Computing Pythagorean Wins For The WNBA
Like many people out there, I only recently got in to the WNBA. As with most new things I encounter I wanted to see if I could get a quantitative understanding of what’s going on. With the new season coming up pretty soon and with a brand new team being created in my area (go Valkyries!) I wanted to see what sort of prior I could come up with for each team’s rating. As a starting point I wanted to look at the history of the league and understand how things change for teams from year to year. To do so I scraped the entire history of the league to get every regular and post-season game including scores and participants. One modification to the data that I made was to normalize the teams such that a “new” team that was simply a relocating team that decided to change their name was treated as the same team. (Side note: the Wikipedia rabbit holes this led me down were pretty interesting. The number of teams that were created and shuttered since the leagues founding in 1997 is a lot higher than I would have guessed!) The first thing that I looked at was the year to year correlation between win percentages for a given team. That turns out to be 0.396. This is actually higher than I would have expected given that we are treating each team as a black box and totally ignoring injuries, drafts, trades, coaching changes, etc.
Is there a way to do better with the same black box approach?
One thing we could try is Pythagorean expectation. The basic idea is that points scored and points allowed tend to be more predictive of future win percentages than simply extrapolating from prior season win percentages. The basic formula is as follows:
$$WinPercentage = \frac{P_S^x}{P_S^x + P_A^x}$$
Where $WinPercentage$ is the Pythagorean expectation of a given team’s win percentage based on their points scored ($P_S$) and points allowed ($P_A$) with an exponent value of ($x$) whose value is dependent on the sport being evaluated. If we want to do this for the WNBA we need to first figure out what the proper value of $x$ should be. Rather than do the algebra here to derive $x$ I will simply refer you to this article which goes through the process of working out the value of $x$ for a cricket league. The tl;dr is that to derive $x$ we need to solve the following equation: $$x = \frac{\log\left(\dfrac{W}{L}\right)}{\log\left(\dfrac{P_s}{P_a}\right)}$$ To do so all that is required is to compute the $\log\left(\frac{W}{L}\right)$ and $\log\left(\frac{P_s}{P_a}\right)$ for each team and season. Once we do that we can compute a best fit line and the slope will be our $x$ value. The following shows a scatterplot of our computed values as well as the best fit line with the derived coefficients:
As we can see from the slope of the best fit line the value of $x$ is almost exactly 9. Now that we have this value we can compare the Pythagorean expectation approach against the prior season win percentage to see what has more predictive value.
Does the Pythagorean approach out-predict using prior year records?
It turns out that it does! Whereas we got a year to year correlation in win percentages of 0.396 the Pythagorean expectation correlates with future win percentage at an improved rate of 0.457. While this does represent an improvement the scatterplot below highlights that there is a lot left to be done if we want to have forward projections of team performance we can feel confident in.
Future Work
To get better than this we are going to have to dig in further to understand how teams are winning/losing, what players are doing to contribute to those outcomes, and the impact of coaching. One aspect of the player and coaching impact that gets entirely glossed over with the approaches discussed so far is that players get traded and coaches get fired mid-season all the time and we are binning everything in to buckets of team + season. Similarly we are ignoring roster changes that are a function of injury (e.g. The 2024 LA Sparks looked very different before vs. after Cameron Brink’s season ending ACL tear). We also need to gain an understanding of what kinds of positive or negative variance players and teams experienced so we can understand and try to anticipate regression.