Spring is the time when many sports fans are glued to their brackets in hopes of asserting their ability to correctly select amongst the 150 quintillion permutations of Teams who will win the NCAA Basketball Championship (see our blog post on this subject). My personal highlight of the spring sports calendar is the Masters Golf Tournament, which is held every year at the Augusta National Golf Club in Georgia. The professional golf schedule contains four major tournaments each year: The Masters, The US Open, The Open Championship, and The PGA Championship. Of these tournaments, only the Masters is played on the same course every year and its champion is awarded the iconic Masters Champion green jacket.
What does winning mean?
Having participated in a number of fantasy sports leagues and being a Data Scientist at MapR gives me a unique perspective on my approach to choosing who I think will most likely “win” the tournament. Let’s first define what winning means. Correctly identifying the one golfer who finishes in first is a rare outcome and something very difficult to predict. In fantasy golf, your overall points are awarded based on the summation of the six golfers you’ve selected from the field of approximately 100 who will start. Therefore, your strategy should begin with ensuring that all six will make the cut and play into the weekend for a chance at finishing close to the top (all 6 in the Top 10 is what I seek to achieve). This is a way of shrinking the problem, since I can begin by identifying who will most likely miss the cut.
Statistical performance metrics in golf are numerous, beginning with the basics: the score. How many strokes did it take to complete all 18 holes for a given round? This number for a PGA caliber player will range between the low 60s and the mid 70s, since every week is played on a different course under varying conditions. The scores from week to week need to be standardized to understand each player’s performance relative to the rest of the field (the z-score transformation works well here). A score of 70 may win one week, but may miss the cut in the next. The same holds true for most other metrics such as driving distance, driving accuracy, number of putts, and the 30+ metrics I use for determining a player’s form. I should also note that the game has evolved, and while I specifically quantify performance on the PGA Tour, many golfers are simultaneously playing on the European and Asian Tours, which impacts my system with selection bias (it is imperfect and I accept it).
Horse for the course
As I stated earlier, the Masters Tournament is held continually on the same course every year, and this takes away some uncertainty out of the equation. I’ve built models to estimate the strengths that have the best chance of putting a player into the lead. At Augusta, these are overall driving, approach shots between 175-225 yards, and par four scoring average. It’s rather surprising that the putting average and strokes gained putting are absent from the top statistics, since the greens at Augusta National are notoriously FAST! It has been said that most weekend golfers would shoot over par playing only on the greens there.
Having collected, parsed, and normalized all of my discrete player data is just the start. How may prior rounds best define top form? How do I weight certain predictors? What about past champions? For the purposes of modeling the Masters Tournament, I have found that utilizing all data from the current season— approximately 14 weeks—works best, with a slight weighting factor that rewards more recent performances. This would change if I were predicting a later tournament in the professional golfing season due to cyclicality or seasonality effects in performances.
My first step, as I explained, is to run a clustering algorithm on the field and build a smaller candidate set of those I expect to play into the weekend. The K-means algorithm works really nicely due to its sometimes undesirable effect of building equally sized groups; however, in my situation this is exactly what I want. Looking back, I have had very few players in my “Missed Cut” cluster ever go on to break into the Top 10, and this has only happened in years of inclement weather delays that forced certain elite players to play 36 holes on the later tournament days.
I can now get to modeling specific performance metrics and how they affect the outcome. I start by building my Random Forest model on all of the data I have using the finishing position as the target variable. Having the completed model, I select random groups of rounds by a given golfer and predict the finish position of each round. In effect I simulate 500 or so rounds for each player and average their finishing position. At the end I build a ranked list of estimated finishing positions for all of the players who I predict will make the cut this year.
Not wanting to disappoint my readers, my predictions for the six players, ranked in order, who I predict will most likely to finish in 10th or better place this year (and hopefully 1st) based on my statistical modeling are:
- Jason Day
- Dustin Johnson
- Adam Scott
- Bubba Watson
- Rickie Fowler
- Charl Schwartzel
- Adam Scott
- Rickie Fowler
- Jason Day
- Justin Rose
- Jordan Spieth
- Phil Mickelson
A few of these match my intuition picks and some surprise me. In the end, my statistical approaches have routinely beat my intuition by 30-40%, and I hope the same is true in 2016.