The Correlation Conundrum

The K Zone

December 21st, 2017

The Correlation Conundrum, by Ian Joffe

If there is one existential truth about baseball players, it’s that they’re wildly inconsistent. Sometimes players will be league average or below, and then suddenly break out, for one year or for the rest of their career. Other times a strong player will have an off year or two before bouncing back (hopefully). These patterns in season-to-season output can be the result of a myriad of factors, including mechanical changes, aging curves, psychology, and luck, each of which teams employ small armies to study. While each player will be affected by each factor differently, the average player, over a large enough sample size, should experience similar trends in year-to-year production. My goal here is to try and measure those trends, and figure out the likelihood of a contextless player having a breakout, bounceback, or fallout season.

In this article, you’re going to see a ton of scatter plots. For a multitude of statistics, I will compare players’ output in a forth year with their stats from each of the previous three, and from averages of the previous three. I did this by writing a Python program to sift through Fangraphs leaderboards for players that qualified four years in a row, and then comparing their numbers. While this requirement gives my group a significant bias in talent level (average WAR 3.7), the year to year difference in talent with these players should be similar to that of the average player, especially considering that my data goes back to 2010 and covers a large sample of 200 hitters and 113 pitchers.

One tool I will use a lot is Pearson’s Correlation Coefficient. You can see the wikipedia page for Pearson’s to get a grasp of the formula and mathematical intricacies, but basically it measure how similar one set of data is to another. In this study’s context, it will dictate how much players’ stats from one year match up with their stats from another. Each Pearson’s “P-Value” is on a scale from zero to one, lower decimals representing a smaller correlation, and higher ones representing a larger correlation. Two identical data sets will have a P-Value of one, and two truly random data sets will have a P-Value of zero. Another way to look at Pearson’s is as a numeric representation of each graph (or one could look at the graph as a visual for Pearson’s). A more stable scatter plot, where the points make a straight line, will have a higher P-Value, and a more random plot, with points scattered all over the place, will have a lower P-Value. So, without further ado, here is the data I collected:

For Batters:







And for the Pitchers:




So, let’s start by examining the overall trends. As predicted, between the three individual years, numbers from one year ago can best predict the output for this year. But, in almost all cases, stats from two or three years prior actually have about equal value. In the few exceptions, like batter WAR, the previous two years had similar correlations, with a significant drop-off after that. But, with every statistic, the averages were the most predictive of coming season’s outcome. While the three-year averages were better than the two-year averages, they were also consistently only very slightly better. In fact, I’d say that for the everyday fan, taking a two-year average is your best bet with minimal effort. The ideal player predictor, then, would probably be a two year average with a slight weight on the most recent season, but when I tried to use machine learning to determine the weight, I got some really weird results. For example, on the three-year-average, it always gave the most recent seasons the same weights, and the stats from three years ago the biggest weights. Perhaps the machines are already smarter than us, and my computer is working in ways I cannot understand. But, a computer uprising notwithstanding, it’s a safe bet to say that my machine learning algorithm was a failure.  So, I can’t provide you with exact weights, but as a general rule, my research shows that a two-year average is best for predicting whether or not a player will succeed in a coming season.

A few statistics had more irregular results than others. WAR, for example, had similar correlations between one year ago and two years ago. I would probably attribute this to the fact that it is composed of so many different statistics. If each of those stats has its own pattern of positive and negative regression, at least in part independent of the other components of WAR, then it would stand to reason that WAR as a whole is more variable year-to-year than each of those individual components. Therefore, last year’s WAR makes a worse predictor of the coming year, and a more similar predictor to two years ago, at least in comparison to any of those components. Fangraphs’ defense metric also had a somewhat strange outcome, where each year was a similar predictor, instead of last year being the best predictor, but that makes sense for defense in particular, because it is more ability-driven than hitting or pitching, which are more luck-driven. Earned Run Average also experienced strange correlation patterns. This could have a similar explanation to WAR, a run being composed of many different factors, but all in all I would attribute this strange pattern, along with overall low correlations, to ERA being a garbage statistic, even in my large sample size. However, among all of these different exceptions, it still held true that a two-year average is the best tool to use, by a meaningful margin over any single year. One stat that surprised me was HR/9 for pitchers. Sabermatricians have long heralded homers as one of the “three true outcomes,” but it showed an both overall low correlation and a weird pattern, with the one-year and three-year P-Values being equal, and the two-year value being the lowest. Additionally, the averages only improved this P-Value by a little. Unlike the other exceptions, I can’t easily explain away why this is happening. Maybe statisticians should consider relying less on home runs given up by pitchers to evaluate an arm. HR/9 turned out to be a notable exception, but for predicting most any other stat – power, contact, speed, control, or command – to gain an advantage, take a two-year average.



Image Attributed To:




  1. Mike says:

    You don’t u understand the data you’re seeing here young man.


Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s