– The K Zone –
December 3rd, 2018
One in 49 Million, by Ian Joffe
The hitting streak is among the most exciting phenomena of the game of baseball. We like to think them as incredible feats, accomplished only by a unique combination of mental and physical skills manifesting themselves over a month-long period. There is another view, however, on the creation of hitting streaks: that they are actually statistical likelihoods which are all but bound to occur within a given period of time, controlled by data’s randomness alone. Both explanations seem reasonable. The perfectly robotic sabermatrician would argue for the latter, for in a game driven by statistics, things like the hitting streak can be predicted rather perfectly using data and probability. But the first argument, too, has logical merit. Players are human, and it’s very possible that they are able to get “locked in” to some mechanical or psychological state that increases their odds of getting hits in each game.
To determine which argument is true, and if hitting streaks exist as anything more that statistical illusions, I compared data from baseball reference‘s play index about real hitting streaks to simulated data from a python program I wrote that determines the odds of certain hitting streaks occurring over a given time period. If the real MLB data matches the statistically expected data, it is reasonable to assume that real hitting streaks are based in nothing more than statistical probabilities, but if the MLB data is distinguishable from the expected results, it would appear that there is something special going on with players who have lengthy streaks.
To find the number of expected streaks in a given period of time, one must apply a geometric distribution, which is based on a string of events, each of which is labeled a success or a failure. The probability of a success, denoted by p is, in this case, a game played without a hit. A failure, then, is a game with a hit. To find the number of trials (games) it takes for a batter to not get a hit, or the number of consecutive games with a hit before a batter fails to get one, one applies two conditions. First, a batter must fail to get a hit in the game in question (p), and second, the hitter must get a hit in all previous games ((1 – p)x-1), where 1-p is the probability of a hit (or more specifically, the odds of not not getting a hit), and x is the number of the games in the streak, the last game being the one without a hit. So, the formula for the expected frequency of both conditions to occur is the product of the two, or (1 – p)(x-1)(p) . The data that I used extends from 2000-2018, over which the MLB batting average was .260. The average player had 3.134 at bats per game during that period (although this is a very, very slight overestimate because in order to avoid adding too many games without at bats for players like AL pitchers, I had to purge from my data players with less than one at bat per game on average). So p, the probability of not getting a hit in a game, equals (1-.260)3.134, or 0.389. From this, I was able to plug in and find the expected number of each length of hitting streak.
To find the real number of hitting streaks since 2000, I wrote a script that put together data from baseball reference’s play index. The longest hitting streak in that period is Dan Uggla’s 33-game streak back in 2011, so I calculated the odds of each streak length up to there. Here were the results:
Looking at the shorter streaks where length < 9, the expected values are actually greater than the observed values, which suggests that getting in a short groove has no psychological or mechanical advantage. Having a three-game hitting streak does not make a player any more likely to have a four-game hitting streak. So, where did the extra frequencies go? For starters, the observed one-game streaks is much higher than the expected, which is strange. I have no explanation for that. But, a lot of frequencies went to longer streaks as well. Here’s the graph zoomed in on lengths > 10:
There’s a critical point after about 10 games where the observed frequencies overtake the expected frequencies, and they do so by a very significant amount. The chi-square P-value was way under 0.001. That’s probably because this effect becomes even more exaggerated as the hitting streaks get longer. Here’s the data for hitting streaks longer than 20 games:
The observed values start to lose their perfect exponential curve because of the smaller sample, but the effects are still very clear. Very, very few hitting streaks over 20 games are expected. Yet, many occurred. In total, the model expected 10.28 hitting streaks longer than 20 games in the 19-year period. We got 81 – an increase by nearly a factor of eight. The model predicted 1.49 hitting streaks of 23 games. The actual value: 14. The odds of a hitting streak like Dan Uggla’s occurring during the new millennium were just over 1 in 100. I would say we should consider ourselves lucky to be able to see such incredible statistical feats – and we are – but this is clearly more than luck. There is no way that so many of these lengthy hitting streaks occurred in a non-mental, non-physical game of randomness. While there is little evidence to suggest a 4-game hitting streak is any more likely than expected, it is clear that players are far more likely to go on hitting streaks over 20 games than statistics would expect. A player who already has a hit in 22 games is much more likely than expected to get a hit in the 23rd. This is probably because there’s little pressure involved on a short streak. I doubt a hitter would even be aware that they have a hit for four games in a row. But, as the steaks climb above 10 and 20 and the media starts to pay attention, it’s impossible not to be aware of them. For the players who perform well on the big stage, they start to improve. Based on the data, we can be all but certain that the mental factor is there.
I found this a rather relieving conclusion. Some of my previous articles, like those about taking revenge on old teams, or players on their birthdays, found little evidence for a mental factor in baseball. They suggested that the game is perfectly predictably random. This data, however, suggests otherwise. It shows that there is an element to how hitters perform above the statistics. It’s still incredibly scientific – my opinion is that psychology and next level sports medicine will be the next Moneyball-esque breakthrough in the game – but it shows that players are more than numbers. I love statistics, which you know because you just read my article, but it’s still nice to think that players operate on a field above the random, and from this, one can argue that they do.
Of course, I couldn’t finish an article about hitting streaks without mentioning Joe DiMaggio. His 56-gamer in 1941 is still the gold standard for hitting streaks, and feels as unbreakable as a record gets. The purely statistical odds of any player having such a streak since the dead ball era are 1 in 49,000,000. In other words, he did something in one short century that should have taken five billion years, the literal age of the Earth, to accomplish. Yeah, DiMaggio was pretty great.
If you found this article interesting, make sure to follow The K Zone on Twitter and be the first to know when we post brand new research and interviews. Thanks!
Ms. Christine Robbins
Statistics How To
Image Attributed to:
The Associated Press
You did all the numbers assuming that everyone is the average mlb hitter. Your model literally just took the average ba and number of at bats. But that average is made by an uneven distribution of subpar and elite players. Throw a basket of elite players into your model and you’ll see where the longer hit streaks come from.