Data Analysis: March Madness Predictions

I didn’t really pay attention to college basketball this year, so I decided to take a different approach to completing my picture.

I started by downloading the full Division 1 men’s basketball schedule (pulled from rivals.yahoo.com), along with each game’s score, date, and home team.

In the model, I assume that each team has two (unknown) vectors of real numbers that describe how good their attack and defense are on various attributes, respectively. For example, we might want to represent how good the point guards of each team are, how good the forwards are, and how good the centers are, both in attack and defense. We could do this using an offensive and defensive vector:

Insulted: [5, 10, 4]

Defending: [2, 3, 10]

This means that the guards have a 5 in attack and a 2 in defense, etc. In my model, it will be easier if we assume that high numbers are better for offenses and low numbers are better for defenses.

The score of a game between team i and team j can be generated as the dot product of the offensive vector of team i with the defensive vector of team j, and vice versa. In our execution example, if our team from before played with a team with vectors:

Insulted: [3, 2, 4]

Defending: [2, 5, 5]

Then the first team’s score is predicted to be 5 * 2 + 10 * 5 + 4 * 5 = 80

and the second team’s score is predicted to be 3 * 2 + 2 * 3 + 4 * 10 = 52

What a blowout!

Now, the only problem is that we don’t really know the vectors that describe each team’s offense and defense. Okay, we’ll learn them from the data.

Formally, the goal is to find latent matrices O and D that minimize the sum of the squared error between the predicted and observed scores. In mathematics,

sum_g (gi_score – O_i: * D_j 🙂 ^ 2 + (gj_score – O_j: * D_i 🙂 ^ 2

where I use the notation that the team I played against team j in game g (i and j depend on g, but I remove this dependency in the notation to keep things simple) *.

I won’t go into detail, but we can take the derivative of the error function with respect to each latent vector to find changes in the vectors that make them more closely match the results of all games earlier in the season. I repeat this until there is no change to improve the error (batch gradient descent, for the detail oriented people).

Results In the case that I choose that the latent vectors are one-dimensional, I get as output an offensive and defensive rating for each team. Remember, to predict the first team’s score against another team, multiply the first team’s offensive rating (higher is better) by the second team’s defensive rating (lower is better).

Here are the top 10 offenses and defenses, as learned by the 1D version of my model:

Offenses

North Carolina (9.79462281797)

Pittsburgh (9.77375501699)

Connecticut (9.74628326851)

Memphis (9.71693872544)

Louisville (9.69785532917)

Duke (9.65866585522)

UCLA (9.59945808934)

West Virginia (9.56811566735)

Arizona Street (9.56282860536)

Missouri (9.55043151623)

Defenses

North Carolina (7.02359489844)

Pittsburgh (7.0416251036)

Memphis (7.05499448413)

Connecticut (7.07696194481)

Louisville (7.14778041166)

Duke (7.18950625894)

UCLA (7.21883856723)

Gonzaga (7.22607569868)

Kansas (7.2289767174)

Missouri (7.2395184452)

And here are the results of simulating the entire tournament with a 5-dimensional model. For each game, I report the predicted score, but for the group I just chose the predicted winner.

==================== ROUND 1 =====================

Louisville 75.8969699266, Morehead St. 54.31731649

Ohio St. 74.9907105909, Siena 69.6702059811

Utah 69.7205426091, Arizona 69.2592708246

Wake Forest 72.3264784371, Cleveland St. 64.3143396939

West Virginia 66.7025939102, Dayton 57.550404701

Kansas 84.0565034675, North Dakota St. 71.281863854

Boston Coll. 65.0669174572, USC 68.7027018576

Michigan St. 77.3858437718, Robert Morris 59.6407479

Connecticut 91.9763662649, Chattanooga 63.9941388666

BYU 74.7464520646, Texas A&M 70.5677646712

Purdue 69.8634461612, Northern Iowa 59.4892887466

Washington 81.8475059935, Mississippi St. 74.6374151171

Marquette 73.4307446299, Utah St. 69.1796188404

Missouri 83.8888903275, Cornell 68.1053984941

California 74.9638076999, Maryland 71.2565877894

Memphis 78.3145709447, CSU Northridge 59.0206289492

Pittsburgh 85.5983991252, E. Tennessee St. 64.8099546261

Oklahoma St. 81.6131739754, Tennessee 81.8021658489

Florida St. 59.994769086, Wisconsin 60.9139371828

Xavier 77.3537694, Portland St. 63.8161558802

UCLA 76.790261041, VCU 65.2726887151

Villanova 72.9957948506, US 58.6863439306

Texas 64.5805075558, Minnesota 62.3595994418

Duke 85.084666484, Binghamton 61.1984347353

North Carolina 99.2788271609, Radford 69.7291392149

LSU 65.0807263343, Butler 64.9895028812

Illinois 70.6250577544, West. Kentucky 57.6646396014

Gonzaga 75.0447785407, Akron 61.0678281691

Arizona St. 64.7151394863, Temple 58.0578420156

Siracusa 74.7825424779, Stephen F. Austin 60.5056731732

Clemson 74.4054903161, Michigan 70.8395522274

Oklahoma 78.5992492855, Morgan St. 59.7587888038

==================== ROUND 2 =====================

Louisville 67.3059313968, Ohio St. 60.5835683909

Utah 71.3007847464, Wake Forest 73.2895225467

West Virginia 67.9574088476, Kansas 67.4869037187

USC 62.1192840465, Michigan St. 64.56295945

Connecticut 76.8719158147, BYU 71.8412099454

Purdue 74.245343296, Washington 73.6100911982

Marquette 76.4607554812, Missouri 80.5497967091

California 64.7143532135, Memphis 70.9373235427

Pittsburgh 79.1278381289, Tennessee 70.6786108051

Wisconsin 63.0943233452, Xavier 63.5379857382

UCLA 74.1282015782, Villanova 71.4919550735

Texas 66.3817261194, Duke 70.9875941571

North Carolina 86.2296333847, LSU 73.8695973309

Illinois 62.6218220536, Gonzaga 65.6078661776

Arizona Street 74.0588194422, Siracusa 71.254787147

Clemson 76.9943827197, Oklahoma 78.9108038697

==================== SWEET 16 =====================

Louisville 72.8097088102, Wake Forest 68.2411945982

West Virginia 66.1905929215, Michigan St. 65.2198396254

Connecticut 70.4975234274, Purdue 67.014115714

Missouri 66.6046145365, Memphis 69.9964130636

Pittsburgh 72.8975484716, Xavier 64.848615134

UCLA 72.3676109557, Duke 73.1522519556

North Carolina 84.6606149747, Gonzaga 80.3910425893

Arizona St. 67.8668018941, Oklahoma 67.0441371239

==================== ELITE EIGHT =====================

Louisville 64.0822047092, West Virginia 61.7652102534

Connecticut 64.875382557, Memphis 65.9485921907

Pittsburgh 72.8027424093, Duke 70.5222034022

North Carolina 76.2640153058, Arizona St. 72.3363504426

==================== FINAL FOUR ======================

Louisville 60.7832463768, Memphis 61.4830569498

Pittsburgh 80.3421788636, North Carolina 81.0056716364

==================== END GAME =====================

Memphis 73.8935857273, North Carolina 74.259537592

In the end, these predictions were enough to win my support. Obviously, everything should be taken with a grain of salt, but being a doctoral student in machine learning [http://www.machinelearningphdstudent.com/], it was fun to put my money where my mouth was and have a little fun.

Oh, and let me know if you want the data I collected or the code I wrote to make this work. I am happy to share it.

* I also regularize the latent vectors by adding zero-mean independent Gaussian priors (or, equivalently, a linear penalty in the L2-norm squared of the latent vectors). This is known to improve these matrix factoring-like models by encouraging them to be simpler and less willing to detect spurious features in the data.

Leave a Reply

Your email address will not be published. Required fields are marked *