The Method Behind March Madness

There are 150 quintillion (i.e. the one after quadrillion) permutations to consider when completing your NCAA bracket. Some of us don’t have time to review them all; if you are likewise short on time, you can let MapR do the heavy lifting for you and get your personalized bracket from the Crystal B-Ball!

In this post, we describe the methodology of the Crystal B-Ball and use it to make some predictions about who’s going to the 2016 Final Four and offer some probabilities about which team will be crowned the national champion in Houston on April 4. We cannot guarantee victory in your office pool, but we can promise you won’t get a perfect bracket.


Completing a “smart” bracket is deceptively simple. All you need to do is determine a win probability of the teams in each game (they add to 100% so you only need to find one of them), simulate a large number of matchups, and then tabulate the most likely outcomes. As in many cases, what appears to be simple is often incredibly challenging. Determining accurate win probabilities for each team is one of those obstacles. Here’s a summary of the current advanced metrics used for selecting the brackets from the NY Times.  

The goal of generating a customized bracket is to link the fundamentals of the game with historical outcomes and present those options to a user such that he or she can select the importance of each. Here is a list of those options:

  • Seed-Matchup History – tournament history is available several places (example). This informs us that when #2 plays #7, #2 wins 73% of those games. It’s important not to put too much emphasis on the seeds (i.e. #1-#16) because it will diminish the potential for a lower (i.e. worse) seed to emerge from each region.
  • Performance vs. Common Opponents (“MapR Rank”) – a graph-based technique based on Google’s PageRank. You can find a 2015 blog about this approach here. The kernel of these rankings is already baked into the seed matchups so this doesn’t really kick in until later rounds (when there 1 vs. 2 and 1 vs. 1 games). Significant difference in season rank will increase win probability for the lower-ranked (i.e. better) team. The MapR rank considers the strongest teams to be Kansas, Oregon, Villanova, Michigan State and North Carolina.
  • Hoops Fundamentals – boosts for these categories are based on the 2015-2016 season’s statistics downloaded here.  The adjustments to win probability for each team are based on predictive models built on game data since 2003, made available for a recent Kaggle competition
    • Foul shot conversion – adjusting the win probability based on the team with the better seasonal FT%. The boost is determined by a classification model using the difference between FT% as its only input.
    • 3-point accuracy – similar to the free-throw correction, this adjustment is based on a model with “difference in seasonal three-point accuracy” as its sole input.
    • Rebounding proficiency – if you played organized basketball, your coach probably lectured you endlessly about the importance of defense. Dedicated to all the coaches out there, this adjustment awards the team with more rebounds per game a slight bump (also based on a model built on the Kaggle data) to its win probability.

Based on the weight the user supplies for each category, the probabilities are adjusted for each match (with a small dose of randomness to simulate the unpredictability of the tournament) and the games are “played”, round-by-round, until a winner has been determined and the bracket completed.


The randomness in the methodology will generate endless numbers of unique brackets, especially in the earlier rounds. To summarize the results, we need to generate many brackets and tabulate.  It should be noted that when numerous simulations of random events are aggregated, probabilities emerge in the long run, filtering out the “madness” of long shots, dark horses and Cinderella teams. Keep in mind while assessing these results, that Goliath beats David 99 out of 100 times, but if they only fight once, there’s always a chance.

The following results are based on 1 million brackets generated from the Crystal B-Ball with the following weights that favor fundamentals over reputation (you can set your own, plus award one team an automatic spot in the Final Four):

Speed Matchup15%
Common Opponents15%
Foul Shot Conversion84%
3-Point Accuracy92%
Defense (rebounds)85%


The table below shows the percent of simulations in which each team reached the Final Four. 


Table 1 - Final Four Probabilities, by Region

1Kansas34Oklahoma37Xavier25Mich. St44

Kansas represented the South in the Final Four in 34% of the brackets. This was the only region where the #1 seed dominated the others. In the Midwest region, Michigan State, the #2 seed, appeared to be an overwhelming favorite to reach the Final Four (44% of brackets compared to Virginia at 18%). Oklahoma and Xavier, also #2 seeds, rounded out the most likely Final Four participants.

Using a similar aggregation, the teams that were most frequently crowned as champions among the 1 million brackets appear in the table below.

Table 2 - Championship Probabilities

1Michigan State19


The numbers suggest Michigan State is the team to beat this year. It appears that the Jayhawks are poised to give them their toughest challenge, if they survive a potential semi-final against Oklahoma.

Predicting an event doesn’t seem to influence the outcome but it does make it more fun, especially when this particular event can be considered, in statistical terms, a stochastic mystery wrapped in a random puzzle. In other words, the Spartans should wait to win before cutting down the nets in Houston.

Some of these games may come down to a buzzer-beater, questionable officiating, or one of those random bounces that makes a ball destined for the basket roll around the rim and out. We have yet to build a model that reliably accounts for those “one-in-a-quintillion” shots. 



Ebook: Getting Started with Apache Spark
Interested in Apache Spark? Experience our interactive ebook with real code, running in real time, to learn more about Spark.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free