Introduction/Motivation
As the 2023-24 NBA regular season comes to an end, Nikola Jokic is poised to hoist his 3rd MVP trophy. The Serbian superstar is widely regarded as the best basketball player in the world and has led his Denver Nuggets squad to a top seed in the competitive Western conference this season. All of this coming off an NBA Championship last season, the first in Nuggets team history. MVP awards are indicative of individual greatness and are usually the second most cited accolade when discussing all-time great players behind only championship rings. Nikola Jokic is my favorite NBA player. His finesse, strength, and passing mastery are beautiful to watch. But how do these basketball skills translate to cold hard numbers? Would a statistical model be able to pick up on the attributes most influential to MVP voting? To answer these questions, as well as hone my data science and machine learning skills using Python, I decided to embark on a project with the goal of predicting this years MVP race.
Many argue that the MVP is a primarily narrative-driven award, and that players with the best story behind them will end up being selected by the media members who vote on the award. I wondered how accurately it could be predicted purely using player and team statistics. While predicting the frontrunners for the award can be relatively easy (just picking the best players on the best teams), I assumed it would be more difficult to predict the overall winner as races can often be very competitive. For example, last season three players (Jokic, Joel Embiid, and Giannis Antetokounmpo) all received first place votes. They each had gaudy statistical seasons and their teams all won at least 50 games. They also each had their own particular narrative behind why they should win the award. Differentiating between them statistically would be difficult. That’s why I set out to create a machine learning model trained on historic data which could be used to predict who would win the award going forward. If the MVP award turned out to be predictable it would show that narrative might not play as big a role in voting as the public suspects. The following will be both a technical write-up of my project and an analysis of the NBA MVP award.
Data
The first step towards creating any kind of machine learning model is collecting the requisite data. In order to predict future outcomes of something you need enough historic data on the topic to unearth trends and relationships between what you’re trying to predict and other measurable variables. For this specific project the first piece of data I needed was MVP voting results for each NBA season. I decided to start my analysis with the 1980-81 season, as that was the first season in which the league used media voting to determine the MVP award. The next step was collecting season-level data on each NBA player and team in every season since 1980-81. To get all this lovely data I used the Beautiful Soup package in Python to scrape information from various basketball-reference.com pages. This automates what would otherwise be a laborious and time-consuming task.
Once collected, the data still need to be manipulated into the proper format for analysis and specific stats which are likely to play a role in MVP voting must be chosen (think scoring lots of points, or being the best player on a really good team) . This is known as feature engineering and selection. Using my knowledge of basketball I narrowed down the potential stats to include in the model, choosing ones that revolve around a players impact and their team’s success. Some of these stats (or features) were already available in the data like points per game, while others I had to create myself. For example, it’s often mentioned that players who have already won the MVP award are less likely to win it again because of “voter fatigue”. Essentially, this theory states that the media members who vote on the award get tired of voting for the same player every year. To capture this I added a variable that calculated the number of prior MVPs each player had won. This phenomenon is potentially what kept Jokic from winning his 3rd MVP last season when voters awarded it to first-time winner Joel Embiid instead.
What’s in the Model?
Through the iterative process of feature selection and engineering I wound up with ten variables to use for predicting MVP vote share. I won’t list them all here but they are primarily advanced metrics like win shares and box plus-minus which can be calculated using only box score data making them available all the way back 1980-81. One issue with a project like this is that the data is inherently imbalanced. There are way more cases of players who didn’t receive any MVP votes than players who did. This can cause problems with some machine learning methods. To alleviate this concern I chose a gradient boosting model, as they are known to handle imbalanced data more effectively. I also chose to approach the problem as a regression problem (predicting a continuous number like vote share) rather than a classification problem (predicting a category, such as won/didn’t win MVP.) Gradient boosting is effective for this project because it constructs a series of decision trees, each designed to correct the errors of the previous one. This approach is particularly suited for handling the imbalanced nature of MVP voting data, as it focuses on improving predictions incrementally, allowing the model to adaptively focus on harder-to-classify instances, which are often the nuanced cases in MVP selections.
To create the model I split my historic data into training and testing sets. The training data will be used to make the model which will then be evaluated for accuracy on the testing set. To evaluate the model I used mean square error (MSE.) I iteratively added and subtracted different variables from the model with the goal of minimizing MSE. Once trained I was able to evaluate on the test data and see which of the features were most important. Here are the top five features in terms of importance for predicting MVP vote share:
Offensive Box Plus-Minus (OBPM) is a box score-based metric that measures a basketball player's offensive contribution in terms of points per 100 possessions. OBPM extends beyond traditional stats like points per game by accounting for a player's efficiency and overall offensive impact while they are on the court. OBPM is particularly relevant to MVP discussions because it isolates the offensive aspect of a player’s game, which MVP voters tend to regard more highly than defensive performance. A high OBPM indicates that a player is not only scoring but doing so efficiently and in ways that significantly boost their team's offensive output. Win Shares per 48 minutes (WS/48) is a stat which attempts to measure how many wins a player contributes to their team. Higher WS/48 values suggest that a player is not just accumulating statistics, but doing so in a way that significantly enhances their team's ability to win games. Given that MVP voters often favor players who lead successful teams, this metric is understandably instrumental. It reflects not just performance but impactful performance. Having these two advanced metrics as the two most important features shows how comprehensive one-number analytics can be helpful in assessing players. Features three through five are much more straightforward; team win percentage, points per game, and turnovers per game. Now that the model is fully operational we can check how it performed on past seasons and predict the 2023-24 NBA MVP.
Past Season Predictions
We are through the dry technical sections and we can get to the results of the project. Using the predicted vote share from the model it correctly predicted 37/43 (86%) of the past MVP seasons correctly! I was surprised by this result, I assumed the model would struggle in years when the winner wasn’t cut and dry. Diving into the incorrect predictions, in four of the seasons it predicted incorrectly that the true MVP winner finished second. These years (1989, 1994, 1997, and 2008) were some of the closest races of all time. In 1989, the model predicted Michael Jordan to take home his 2nd MVP but instead it was Magic Johnson securing the award. This was the closest margin of any season predicted by the model, with the two separated by less than half a percent of predicted vote share. In 1994, Hakeem Olajuwon and David Robinson both had tremendous seasons and the model chose Robinson while the media chose Hakeem. Same goes with 1997 and 2008. In 1997 Michael Jordan was eked out by Karl Malone for an MVP that Jordan probably should’ve won; an opinion corroborated by the model. In 2008 Kobe won when many, including the model, thought Chris Paul to be more deserving.
So in all but two seasons this simple machine learning model was able to predict the MVP winner to at least finish top-2 in vote share. But what’s up with those two other seasons? It turns out that my model hates Steve Nash. In 2005 and 2006 Nash won back-to-back MVP awards yet the model predicted him finishing outside the top six in voting both of those seasons (9th in 2005 and 7th in 2006), while instead predicting Shaq to win in 2005 and Lebron in 2006, both of whom ended up finishing second to Nash. Luckily for the credibility of my model, these seasons have been deemed two of the most controversial finishes for the award in history. In my mind there are a couple things limiting Nash statistically. He was a poor defender with a negative defensive box plus-minus dragging him down. In 2005 he didn’t lead his own team in win shares and didn’t average over 20 points per game in either season. There have been accusations of racial bias by the media voters in these seasons, something the model is definitely not trained to pick up on.
In Bill Simmons’ Book of Basketball he categorizes the MVP winners. Both 2005 and 2006 fall into the last category; Outright Travesties (along with 1997 and 2008, meaning 4/6 of the model’s misses were deemed by Simmons as travesties). Regarding Nash’s 2005 MVP campaign he says that it “initially seemed preposterous because it would have been the first time (a) a table setter won the award; (b) a non-franchise player won; and (c) a defensive liability won. Those are three pretty big leaps. Were there racial implications to the Nash/MVP bandwagon? In a roundabout way”. On the 2006 award he quotes a column of his from the time “if you actually end up picking him [Nash], either you’re not watching enough basketball or you just want to see a white guy win back-to-back MVP’s”. Statistically these Nash MVP seasons are an anomaly and an example or narrative driving MVP voting.
2023-24 MVP Predictions
Above are the top-5 finishers in predicted vote share for the 2024 award. 2023-24 came with new criteria imposed on players to be eligible for awards. Players must have played at least 20 minutes in at least 65 games, allowing for only two “near-misses” where they played 15-19 minutes. I instilled these restrictions on my predictions even though they removed Joel Embiid from contention and I was interested to see where his injury-marred season would stack up. From 5-1 here are the predicted MVP candidates;
Jayson Tatum - The best player on the league’s best team is usually a good case to pick up some MVP votes. Tatum is hindered by his elite teammates causing him to not put up as impressive stats. The advanced stats aren’t as favorable of him as the rest of these candidate likely due to Jaylen Brown and Kristaps Porzingis’ impact.
Shai Gilgeous-Alexander - I thought Shai would finish second in predicted share. He has been the picture of consistency for the Thunder this season, leading them to a top-3 record in the West. He averages over 30 points per game and is second in the league in WS/48 behind Jokic.
Giannis Antetokounmpo - The pairing of him and Damian Lillard was perhaps the most talked about thing coming into the season. If it’s possible to average 30/11/6 on the second best team in a conference quietly, Giannis has done just that.
Luka Doncic - The NBA’s leading scorer is a one-man wrecking crew. He puts up absolutely ridiculous box score numbers nearly every night and his team has won over 50 games.
Nikola Jokic - Statistically nobody can touch Jokic. The advanced numbers he puts up season after season are historic. The Nuggets currently have the best record in the West and he is miles ahead of the rest of the pack in WS/48 and OBPM. This season’s MVP has been all but wrapped up since February.
Conclusion/Future Additions
In this project I investigated the predictability of the NBA MVP award and what factors contributed most towards an MVP level season. To do this I scraped data from basketball-reference.com, created a gradient boosting model and applied the model to 40+ seasons of NBA data. I found advanced one-number metrics like offensive box plus-minus and win-shares/48 to be among the most important predictors of the award correctly predicting 37/43 past seasons. Turning my attention towards the future I predicted another win for Nikola Jokic in 2024 and analyzed the cases for each of the top-5 candidates. This project was not without it’s limitations. The most glaring of these was the inability to quantify media narrative during the season. In the future I could scrape news articles about the NBA and conduct sentiment analysis to see which players were viewed most favorably during the course of the season. This could then increase the predictive accuracy of the model, potentially allowing for it to correctly chose the winner in those close seasons it missed or pick up on the media’s enthusiasm for Steve Nash. Overall, this was an enjoyable, insightful project where I learned a lot about data science in Python as well as historic NBA MVP races.
I'm a little biased, but this is an incredible project & a fantastic piece of writing!