Individual sports have long been a popular option for sports gamblers around the world. Instead of trying to predict how an entire team will play on a given day, the focus can solely be on to individuals. Tennis falls in that category, with matches going on throughout the year daily.

A study done by Andre Cornman, Grant Spellman, and Daniel Wright called “Machine Learning for Professional Tennis Match Prediction and Betting” attempted to use machine learning over the season to win money in the long run. After taking existing data from historical matches to craft a model, they were able to have considerable success for a two-year run. They plan to now stretch it out over multiple years.

Forming the model

Finding statistics in tennis, like many other sports, is getting increasingly easier thanks to several databases that are popping up. Detailed information on each match, especially from the last decade or so, is readily available on free websites. The trio used a few different sources to aggregate data not only from tennis itself, but tennis betting information as well.

Sign up for a FREE Trial Consultation to start working with Legendary Sports Bettor Jon Price

What they found is that certain pieces of data start making more sense when predicting matches than others. Head to head matchups between the two players in the past was a major consideration, but so was using player rankings, current ages, aces, breakpoint chances and conversions, double faults, and more. They did not disclose just how many stats were used, or to what weight they put on these stats.

Instead of looking at just a relatively small sample size, the attempt was to get many of the statistics from the last several matches. Going out to as many as the last 20 matches, an average gave them a more usable number depending on the playing surface.

Why playing surface matters so much

There are a lot of tennis models that have failed in the past simply by not putting much importance on playing surface. Those who aren’t familiar with the sport might not understand just how big of an impact playing surface makes on tennis matchups.

The three main surfaces are hard, clay, and grass. Not only is movement different for the players, but the ball bounces differently as well. It takes a lot of time playing on the surfaces to feel comfortable, which is why certain parts of the world tend to excel on different surfaces. There are a lot of hard courts in United States, which is why Americans tend to do well on hardcourt, but struggle on European clay.

Attempted methods

The team decided to try four different types of methods that had varying levels of success. Using the information that they were putting together needed to be tweaked, and it also needed to show long-term success for it to make sense.

Logistic regression, SVM, neural network, and random forest were the four they attempted. They all showed promise in some ways, and they were put through a series of tests to see how they would perform in the long run.

Logistic regression

Their hypothesis with logistic regression showed early on to have quite a bit of accuracy overall. The most difficult thing they ran into was training the model since it had so many different types of information going in. They concluded that this was a decent option to go with, and maybe some sweet tweaks could help out as well. Simply put, they did not have the resources to add even more polynomial terms.


A total of three different SPM kernels were used for their testing. RBF, polynomial, and linear all showed a little bit of promise, but had their holdups. RBF was extremely slow for them to train, and it fell into the habit of overfitting. Polynomial also had a lot of issues with it being just so slow to train. Finally, linear allowed them to have the most success, mainly because it was only using the first order of the feature set. It was ultimately the kernel they decide to go with when trying out the SVM model

Random forest

This turned out to be a very fast model to train. With the simplicity, they were able to do some additional tuning to get the type of results that made the most sense. Not only did this model allow for accurate predicting of matches, but whether or not it was a smart move to make any bet on the match.

Neural network

It became pretty obvious early on to the team a neural network was not going to work as they thought. The accuracy was never able to match the other three methods. Tuning could have helped, but they were unable to do that on the computers they were using.

maximizing earnings with the Random Forest Model

Once they decided to stick with the wrong model, it came down to maximizing earnings. The betting strategy was based on a single shot decision problem looking at each mass. Studies show that there is no real connection between one match and another, so looking at things independently made the best predictions.

The betting strategy over two years ended up earning about 3.3% each match. This is figuring in the decision to not make any bets 29% of the time in each event, since the two players are considered close enough that it is not worth the gamble. 

While it might be discouraging for some gamblers to see that virtually three out of every 10 matches shouldn’t have money on them, it’s also a way to stay disciplined. A lot of people who don’t use models make the mistake of gambling on these matches anyway, which hurts their overall earnings. The model showed a considerable amount of success when that’s replaced, but having the information on matches to stay away from are just as valuable.

Success mostly relied on betting the favorites, as the model only accurately predicted the underdog as a winner 47.1% of the time. It shows at the model is good at making bets on the favorite, but not identifying underdog opportunities for big earnings on a single bet. This is more of a long model that has excess overtime.

Continued studies

With encouraging results, the team decided that they will continue to use the model for the next few seasons. After each year, they will begin to analyze what they can do to make slight tweaks with the data available. Adding a bit more flexibility with the betting decision model is one variable they think could help as well. The first iteration only has one type of standard bet considered, which is a major reason why so many matches were avoided. By having varied amounts, there could be financial sense on betting smaller amounts in closer matchups.

All in all, anyone into tennis and betting should look at this as an interesting study on machine learning and having success betting matches. With a success rate of about 70% over years of practice, it shows that they have something that works.

The leading sports investment firm in the country