Unlocking March Madness with Data Science

March Madness

Since 1939, the NCAA has hosted a single-elimination basketball tournament to determine its national champion every spring. This event is unique in that it draws interest across a wide public spectrum, from hoops enthusiasts to casual fans who can’t tell a shot clock from a baseline. Regardless of your basketball IQ, “filling out your bracket” has become a national obsession.

There is no shortage of opinions when it comes to ranking the teams. Surely, each pollster brings his or her own biases to any such attempt. Bias in this case comes in the form of attributing predictive power to attributes that have little or none to offer (including preference for an alma-mater, uniform color, pre-season rankings or last year’s performance). But the teams have been playing each other all season. Why not just leverage this information directly to sort the schools and exclude the bias?

And that is exactly what you can do, using a well-known machine learning algorithm to rank the teams based on their performance; and you don’t have to be a data scientist (or a broadcaster) to apply it.

PageRank Algorithm

The Algorithm

PageRank was derived in the early days of Google to determine the relevance of web pages. The philosophy of the algorithm is this: the most important page is the one that the most important pages link to. The algorithm assumes that if page A links to B, then B is more “important” than A. But there are many pages and many links.

Two simple formulae explain the basics of PageRank:

  1. G = α S + (1-α) E
  2. πT = πT G

G refers to the square “Google” matrix. It is composed of a matrix (S) where each row contains the signals being sent out to other pages from one particular page plus the “teleportation matrix” (E) that allows for a user to jump randomly from any page to any other - even when there’s no link. The weights given to S & E are controlled by a constant α, which is usually set around 0.9.

The value πT is the stationary row vector of G (in an infinite number of browsing attempts, these are the probabilities you’ll end up at each page). Start with all pages assigned the same weight. Iterate through a process known as the power method to derive πT. Sort this vector and you have your rankings. This is an extension of Markov Theory, a good topic for further study.

Google has much more complex approaches now, but this relatively simple method can be applied to situations in which we want to rank entities based on a signal of “importance” they send when interacting, just like college basketball.

Application to College Basketball

What happens when team A loses to B? If we interpret team A’s loss as a signal indicating that B is superior, then the PageRank algorithm should give us the teams in order of the strength of the “signals” they get from the other teams. It certainly rewards teams for winning games, but also considers the quality of the teams that beat them.

Since Google already did the hard work, all we have to do is get the game results, assign each team an equal rank, and then build the matrix of teams sending “signals” to the those that defeated them. We iterate for the stationary vector (πT) which should give us our rankings.


The code and game data to reproduce or experiment with these results can be downloaded from the following repository: https://github.com/joebluems/MarchMadness.git

Game Data

Since this algorithm is data-driven, we’re going to need data. This data is available from a variety of internet sources - use the csv file on the repository or pull the scores yourself with a series of curl commands from your favorite sports site. Game data comprised 5,778 matches through March 15, 2015 including 631 different colleges.

Division I schools can play non-Division I schools, which generates activity for teams with very few games (sorry, Daniel Webster and Sarah Lawrence College). The algorithm automatically weights these teams low to a lack of data, so no adjustments to the raw results are necessary.

Running the Algorithm

Applying the algorithm to generate PageRanks requires two steps:

Step 1 (create the matrix) – read the game data and assign a unique integer to each school. This information is helpful when displaying the results in the next step. Once the data is read, keep track of which team loses to whom and assign a “weight” to each loss. This information comprises the H matrix (a more pliable form of the S matrix), which is key to the PageRank algorithm.

Step 2 (find the ranks) – execute the power method to solve for the stationary row vector (πT) of the G matrix. Sorting this vector in descending order provides the rankings. The alpha parameter was set to 0.90 and the power method continues until the convergence criteria are satisfied (residual <= 0.001). The algorithm was adjusted to base the outgoing signal (loss) on a minimum denominator of 25 games (to account for non-Division I teams with incomplete schedules).


The following table shows the PageRank outcome and the AP ranks for comparison (the Associated Press ranks are compiled from votes by 64 sportswriters and broadcasters - which may average bias or compile it):

March Madness data science top teams

Rounding out the top 25 by PageRank are the following: Baylor, Wichita State, North Carolina, Oregon, San Diego State, Louisville, Colorado State, Dayton, Michigan State and Wofford. Every team in the top 25 made the tournament, though Dayton was assigned a play-in game to join the final 64-team bracket.

The top 10 for both lists are slightly shuffled.  It’s important to keep in mind that the PageRanks are based only on win-loss results. As the sole undefeated team, Kentucky is a no-brainer, consensus #1. Villanova’s record earns it the second spot from PageRank, but the AP ranks it 4th. Conversely, the algorithm penalizes Virginia down to 6th from 3rd in the AP poll. Based on the strength of their wins (and losses), Arkansas improves from 21st on AP to #15 on PageRank.

PageRank also made a few pulls from outside the AP top-25, like Virginia Commonwealth, Wofford and Colorado State. One of the differences of this approach is that it treats all games equally - the polls will give greater weight to more recent games (which may not be a bad idea). Understanding a method’s limitations will you you the clues to improve it to fit your situation.


Deeper analysis may be achieved through smarter weightings. Should a blowout win on the road count as much as a triple OT nail-biter at home? Outscoring a team below your division might also be weighted less. These are just a few examples; you are encouraged to experiment with the scripts to explore the algorithm in depth. Maybe you can do better!

This approach may be applicable to your business problem if you’re looking for insights from how entities interact with each other. As the size and complexity of your problem grows, you may consider Hadoop for storing the relationships, distributing the calculation or streaming to update the network more quickly.  Apache Spark is a good starting point for scaling up (Spark’s GraphX library has PageRank implementations).

This analysis comes short of what we really wantthe true ranking of the best teams. Despite our best efforts, there’s no foolproof way of picking winners even when the data tells us which one should. After all, that is why they play the games.

Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free