The Bully Factor
Bill James has a new article up entitled "The Bully Factor" that examines pitcher performance on the basis of quality of the opponent. The article was prompted by an inquiry from a subscriber to his website, but Bill says the idea of breaking down a pitcher's performance by quality of opponent originated in a 1969 when Bill argued to a college buddy that Marichal was better than Gibson and his buddy (a big Cards fan) responded by insisting that Marichal tended to beat up on the weak sisters in the league (in the '60s NL, that would be teams like the Mets and Astros). Bill's response at the time, without knowing any of the actual facts, was "bullshit."
Bill finally got around to crunching the numbers and posted his spreadsheet online for downloading (clicking on the preceding link will automatically download the excel spreadsheet to your hard drive). I believe the article itself is only available at Bill's subscriber-only website. Basically, Bill divided teams into four quality categories based on their aggregate records by each decade and then broke down a pitcher's starts against teams in each category. The findings are interesting, if not all that significant. Bill himself makes no great claims as to the significance of his research in judging pitchers. Bill doesn't really make this point but I will: given two pitchers with identical records, one should prefer the pitcher who pitches better against A-list competition. Why? Because if the pitcher's team is in contention, games against other contenders are two-fers. A win against another contender is not only a win for your team but a loss for the other contender.
Here's how Bill described his methodology in arriving at a single metric he refers to as "the Bully Factor":
How do we measure the extent to which each pitcher dominated inferior competition? I looked at six factors relative to that issue, which were: 1) The percentage of the pitcher’s wins that came over “D” quality competition, 2) The difference in the pitcher’s winning percentage versus “A & B” teams and his winning percentage versus “C & D” teams, 3) The difference in the pitcher’s ERA versus “A & B” teams and his ERA versus “C & D” teams, 4) The difference in the pitcher’s overall effectiveness RANK (1 to 702) versus “A & B” teams and his overall effectiveness rank versus “C & D” teams, 5) The difference in the pitcher’s overall effectiveness rank (1 to 702) versus “A” teams compared to his overall effectiveness rank versus all teams, and 6) The player’s career win total versus “A & B” teams compared to his career wins versus “C & D” teams.
I made up an index of these six indicators, which I called the “Bully Factor”; a high Bully Factor indicates that the pitcher pitched much better against weak competition than against strong competition—much better, or in some cases much more. Later, I’ll list the pitchers at the top and bottom of the chart, but first, let’s look at the guys with the most “normal” data, the guys in the center of the chart.So who are the biggest bullies among notable pitchers? Well, to begin with, Bill was pretty much on target with his "bullshit" response to his buddy's assertion that Marichal was a bully and Gibson wasn't: Marichal generally performed better against the quality competition, whereas Gibson had a greater tendency to beat up on the weak sisters in the league. Bill is careful not draw any grand conclusions from this fact, as well he should be, because Gibson's spectacular big-game record certainly refutes any argument that Gibson couldn't step it up against good teams in big games. But the fact remains that as between the two Gibson did more padding of his stats against the bad teams than Marichal did.
Here are the biggest bullies among the more notable pitchers of the last 60 years (Bill's data covers pitchers with 100 or more starts since 1952): Bob Turley, Denny McLain, C.C. Sabathia, Early Wynn, Jack Morris, Justin Verlander, Roy Oswalt, Bob Lemon, Tim Wakefield, Ken Holtzman, Herb Score, Mel Parnell, Joe Niekro, Camilo Pascual, Derek Lowe and Mark Buehrle. Zack Greinke also has a pretty big Bully Factor so far in his brief career.
Some other notable pitchers who had Bully Factors well above average are Sam McDowell, Luis Tiant, Tim Hudson, Dave Stewart, Mike Hampton, J.R. Richard, Steve Rogers, Don Newcombe, Vida Blue, Bob Gibson, Andy Pettitte, David Wells, Randy Johnson and Bert Blyleven.
Notable pitchers with very low Bully Factors include Frank Lary (aka "the Yankee Killer"), Carlos Zambrano, A.J. Burnett, Kenny Rogers, Bartolo Colon, Jarrod Washburn, Phil Niekro, Dave Stieb, Floyd Bannister, Bob Welch, Frank Viola, Mel Stottlemyre, John Lackey, Al Leiter and Bret Saberhagen.
Some other notable pitchers who had Bully Factors distinctly below average are Bret Saberhagen, Fernando Valenzuela, Nolan Ryan, Tommy John, John Candelaria, Juan Marichal, Mickey Lolich, Cliff Lee, Robin Roberts, Sandy Koufax, John Smoltz, Tom Glavine, Mike Cuellar, Dennis Eckersely, Ron Guidry, Dwight Gooden, Dave McNally, John Tudor, Johan Santana, Curt Schilling and Frank Tanana.
Of most interest to me were the pitchers who performed particularly well against the A category teams. Generally speaking these teams had winning percentages over .550 for the decade. There are five pitchers who really stand out, compiling excellent winning percentages and ERAs against A category teams: Whitey Ford, Sandy Koufax, Bret Saberhagen, Pedro Martinez and Johan Santana. Against A category competition, each had a winning percentage above .630 and an ERA below their career ERA.
There are ten other pitchers who had a winning percentage above .570 against A quality competition (min. 25 wins against A competition): Dwight Gooden, Freddy Garcia, Roy Halladay, Jack Sanford, David Wells, Jim Maloney, Juan Marichal, Tom Glavine, Ron Guidry and John Candelaria.
Categorizing teams based on their records over a decade rather than annual records will produce some anomalies. Just for example, a pitcher who just came into the AL within the last few years will have his games against the Tampa Rays thrown into the D category of weak sisters even though the Rays have been anything but weak the last few years. As another example, Bill's data shows Saberhagen with a .570 winning percentage against teams with decade records above .500 and .601 against teams with decade records below .500. Splits based on annual team records, however, show that Saberhagen's numbers are flipped: he had a .606 W% against teams with records above .500 and a .571 W% against teams under .500.
Still, as always, James is provocative. And some of the findings are very striking. Ford and Koufax were great against top flight competition. Jack Morris and Justin Verlander really feasted on the worst teams. Draw your own conclusions as to the significance of these facts.
ERA+: Looking Behind The Stat
ERA+ is a great analytical tool. It permits comparisons of ERAs across different eras and different run environments by adjusting for general league scoring levels and park factors. Its advantages over simple ERA are obvious. It is the single pitching statistic most often regarded as the definitive tool for analyzing pitching careers. Some stat geeks have become so enamored of ERA+ and its derivatives that they deny certain baseball truisms that might call into question the validity of judging pitchers primarily on the basis of ERA+. They tend to deny the concept of clutch pitching, despite the fact that certain pitchers evince a tendency to pitch measurably better or worse in high leverage situations (see this post for a discussion of Leveraged ERA+, or LevERA+, which weights runs allowed (and runs prevented) based on the impact on win expectancy). They also tend to discount the theory that most pitchers "pitch to the score" by changing their pitching approach depending on the game situation.
A host of statistics confirm that most pitchers do indeed pitch to the score. Pitchers as a group subscribe to the theory that when granted a big lead it is better to put the ball over the plate and make the opposition hit their way back into the game rather than risking a rally fueled by bases on balls. Virtually all successful pitchers walk fewer batters when working with a significant lead. Virtually all pitchers, successful or not, surrender more runs when working with tremendous run support from their teammates. Baseball-Reference.com recently added pitching splits based on team run support, showing a pitcher's performance in games in which they received between 0 and 2 runs of support, 3 to 5 runs of support and 6 or more runs of support. The vast majority of pitchers will surrender more runs on average when working with 6 or more runs than they do when working with 5 or fewer runs. The run-support splits further confirm that variations in ERA in high run-support scenarios have little or no impact on a pitcher's winning percentage in these scenarios, with good pitchers winning between 90% and 95% of these decisions regardless of how much their ERAs increase with great run support.
These statistics don't reveal defects in the ERA+ statistic but rather reveal the limitations of the statistic. They reveal that the ERA+ of pitchers who are blessed with generally superior run support, like Jack Morris, may be misleading. In games in which Morris received six or more runs of support he allowed 18% more earned runs than he did when working with 3 to 5 runs of support. This didn't prevent Morris from winning 93.3% of his decisions in these games, approximately the same percentage as pitchers who had much smaller increases in ERA in similar situations. The incremental runs allowed by Morris in high run-support games significantly inflated his ERA and ERA+ but had virtually no impact on game outcomes or his teams' fortunes.
Morris is representative of most elite starting pitchers in this regard. They tend to allow significantly more runs when they have good run support to work with. The following list shows the percentage by which these pitchers' ERAs increased or decreased in games in which they received 6 or more runs of support (relative to games in which they received 5 or fewer runs of support).
Obviously, for a given ERA (or ERA+) the optimal distribution of runs allowed by a pitcher would have the pitcher allowing the fewest runs in games in which his run support was weak and the most runs in games where his run support was strong. Pitchers who pitch relatively better where their run support is particularly weak or strong see little benefit to their winning percentages; even the best pitchers in the lowest run scoring environments will win less than 25% of their decisions when they receive 2 or fewer runs of support, and even average pitchers will generally win nearly 90% of their decisions in games in which they receive 6 or more runs of support. The impact of a pitcher's performance is greatest in those games where his run support is in the middle range - three to five runs of support - and those pitchers who pitch well in those games see the most beneficial impact on their winning percentages.
Palmer pitched a slightly lower run scoring environment in Baltimore, and accordingly 3 to 5 runs represented slightly better run support than the same number of runs when scored in the parks Blyleven pitched in during the '70s. However, this potential mitigating factor is offset by the fact that Blyleven received better run support overall when receiving 3 to 5 runs of support, getting an average of 3.93 runs/game as compared to Palmer's 3.77 runs/game. After adjusting for the different scoring environments, the run support received by each within the 3 to 5 run category is almost precisely the same. The huge disparity in their winning percentages when receiving between 3 and 5 runs of support cannot be explained by disparate run suppport, and is almost solely a function of the fact that Palmer pitched significantly better when receiving middling run support.
Blyleven had a slightly better ERA+ than Palmer when receiving 6 or more runs of support, but winning percentage in this category is largely inelastic (meaning that it doesn't vary much even with significant fluctuations in ERA+ ). Palmer lost only one such game in the '70s, Blyleven lost two. Blyleven also had a better ERA+ than Palmer when receiving between 0 and 2 runs of support, but Palmer had a significantly better winning percentage, .267 to Blyleven's .211. Palmer's advantage when receiving weak run support can be explained by Palmer's far superior record in one-run games, which will constitute a significant percentage of games in which a pitcher receives two or fewer runs of support.
As the Palmer/Blyleven comparison demonstrates, relatively similar ERA+ figures can mask significant differences in pitcher performance. Although Palmer's ERA+ in the '70s was only marginally better than Blyleven's, Palmer's substantially better performance in high leverage situations and better performance in those games where pitcher performance is most likely to affect the outcome (i.e., the 3 to 5 run support category) produced a substantially better W-L record.
Ron Guidry. Guidry pitched much better in higher leverage situations, compiling a LevERA+ more than five points higher than his nominal ERA+. Guidry also pitched significantly better in games where he received 3 to 5 runs of support, compiling an ERA+ in those games of 130.5 as compared to an overall ERA+ of 119 and an ERA+ of 109.4 in games in which he had run support of 6 runs or more.
John Tudor. Tudor had nearly a 129 LevERA+ (as compared to a 124 ERA+). He also excelled in matching his performance to the game scoring environment, pitching his best in lower scoring games while allowing more runs in high run support scenarios.
Whitey Ford. Ford's LevERA+ of 137 was even more impressive than his outstanding 133 ERA+. Ford also allowed nearly 9% fewer runs when receiving 5 or less runs of support than he did with 6 or more runs of support.
Tommy John. John's 114 LevERA+ was approximately three points higher than his ERA+, and his ERA was nearly a full run higher when receiving support of 6 runs or more than when he was working with 5 runs or less. His ERA in high run support scenarios hurt his ERA and ERA+ but not his winning percentage, and accordingly his ERA+ is deceptively low.
Juan Marichal. Marichal had a slightly higher LevERA+ than ERA+, 125 to 123, and he allowed approximately half a run more when supported with 6 or more runs than he did when working with 4 to 5 runs. Between his fine clutch pitching and his tendency to allow insignificant runs when working with great run support, Marichal's 123 career ERA+ is deceptively low.
On the other end of the spectrum - the Blyleven end, so to speak - Dave Stieb, Curt Schilling, Orel Hershiser and Steve Rogers are notable examples of pitchers whose LevERA+s were lower than their ERA+ and who tended to pitch better when graced with huge run support than they did in games in the critical 3 to 5 run support category. Like Blyleven, their ERA+ figures don't tell the full story.
In short, any apparent comparability between Bert Blyleven's performance in the '70s and Jim Palmer's is illusory. Palmer was clearly the better pitcher and it's not even particularly close. This may not be apparent if one looks only at ERA+, but one doesn't have to look too hard behind the ERA+ stat to learn that while they may have allowed a similar number of runs, Palmer generally allowed them when he could afford to and Blyleven too frequently allowed them at the worst possible times. This fact, not disparate run support, accounts for the huge difference in their W-L records. ERA+ won't tell you that. It's still an important measure of pitching performance, but there are now statistics readily available that, when viewed together with ERA+, give a much fuller and accurate picture of a pitcher's performance.
_____________________
* Koufax's +59% figure is an anomaly produced by the fact that Koufax played in wildly disparate scoring environments, pitching in distinctly hitter-favorable parks until '62, and then switching to the pitcher friendly Dodger Stadium just as he was hitting his stride. As a consequence, a disproportionate number of games in which Koufax received 6 or more runs of support occurred early in his career when he was not yet the Koufax of legend, and this significantly skews the numbers.
The Theory of Relativity
I love a lot of the new pitching stats. They're great analytical tools. Take FIP, for example ("fielding independent pitching"). It's based on the proposition that what happens on a ball put in play is frequently a function of random chance and team fielding. Bill James recognized its utility and cited Wally Bunker's 1964 season as an example of a pitcher apparently benefiting from some good luck insofar as his BAbip that year was .216. It turns out that Bunker in fact had a pretty good facility for generating low BAbip's in his career, presumably because, like Maddux in his prime, he was adept at keeping the ball away from the fat part of the bat and inducing batters to hit pitches outside the hitter's sweet spots in the strike zone. But Bunker never again came close to posting the .216 BAbip he posted in '64, despite being backed by the legendary team defense of the '60s Orioles.
As Bill James has noted regarding FIP and various other new and sophisticated measures of pitching performance, they have a tendency to throw out a lot of information in an effort to isolate and identify a pitcher's performance independent of non-pitching factors. Bill is a little unsettled by this, and so am I. As he's argued, W-L records are the antipode to FIP and similar stats, incorporating all information, including unfortunately things that have nothing to do with a pitcher's performance, like offensive support and team fielding. However, the inclination of the stat geeks to summarily dismiss W-L records is extremely misguided. It is possible to start with W-L records and make appropriate adjustments, and that's what I'm about to propose.
The Theory of Relativity, in contrast to FIP, throws out nothing but attempts to adjust for everything (or at least most things) that happens outside of the pitcher's performance. Simply put, it compares a pitcher's W-L record to his teams record in games where the pitcher was not the pitcher of record (i.e., it subtracts the pitcher's W-L record from the team's), adjusting for factors that effect the pitcher's and team's W-L records but are largely unrelated to the pitcher's own performance. If a pitcher received run support better or worse than the run support a team generally provided its pitchers, the pitcher's W-L record is adjusted (via the Pythagorean theorem) to reflect what his W-L record would have been had he received run support equal to his team's average. It also adjusts for the performance of the rest of the team's pitching staff, because even a good pitcher who receives excellent run support will appear to fare poorly relative to his team's W-L record if the rest of the starting pitching staff is comprised of Walter Johnson, Pete Alexander, Tom Seaver and Randy Johnson, with Gossage, Eckersley and Rivera coming out of the bullpen.
These pitching staff adjustments are accomplished by taking the team's ERA+ (exclusive of the subject pitcher's own ERA+) and adjusting the team's W-L record to reflect what it would have been had the rest of the staff generated a 100 ERA+ (again, based on the Pythagorean theorem). It simply takes the team's ERA+ exclusive of the subject pitcher's ERA+, calculates the runs allowed or saved by the staff's performance above or below the assumed 100 ERA+, and adds or subtracts those incremental runs to the team's runs allowed. A Pythagorean record is then generated assuming a league-average staff.
Once you've adjusted the pitcher's record for run support and adjusted the team's record for the rest of the pitching staff's performance, you compare the pitcher's adjusted W-L record to his team's adjusted W-L record. The impact of run support on the pitcher's W-L record relative to his team's is thereby eliminated, and the impact of the rest of the staff's performance on the team's W-L record is similarly eliminated. A good pitcher will have an adjusted W-L record much better than his team's adjusted record, and a poor pitcher will have a worse one. Measuring the difference between the adjusted records of the pitcher and the team provides a good measure of the pitcher's performance. It doesn't expressly adjust for team defense (a notoriously difficult aspect of team performance to measure), but it implicitly incorporates it because bad team defense will lower the denominator representing the team's W-L record and therefore increase the relative impact of the pitcher's W-L record (adjusted for run support) relative to his team's W-L record (adjusted for the performance of the rest of the pitching staff).
The concept of simply comparing a pitcher's record to his team's is not novel, but the defects in the system became apparent to me when I was comparing Phil Niekro's relative W-L record to Don Sutton's. Even if the records were adjusted for variations in run support, Sutton would still tend to fare poorly compared to Niekro because Niekro would benefit by being compared to the poor Braves pitching staffs of the '70s, while Sutton would suffer from being compared to the generally excellent Dodger's pitching staffs of the '70s. It was easy for Niekro to outperform the sub-average pitchers on the Braves staff, but more difficult for Sutton to outperform the Tommy John's, Claude Osteen's and Andy Messersmith's who generally populated the Dodger staffs. It's fairly easy, however, to adjust for this, and the conceptual validity of the adjustment should be obvious. Still, the process of collating the team pitching data from different years, incorporating it into the adjustment formulas and generating the Pythagorean adjustments is a little involved and so for the moment I'll only present an analysis of three pitchers: Tom Seaver, Ron Guidry and Dave Stieb.
I selected these three pitchers because I thought they would be illustrative. Bill James has noted how spectacular Seaver's winning percentage was given the generally mediocre nature of the Mets teams he pitched for in the late '60s/early and mid-70s. I selected Guidry because I knew that his record was spectacular even after accounting for the fact that the Yankees teams he pitched for were generally pretty good, but I didn't know how his relative record had been affected by his run support and the quality of the Yankee pitching staffs. And I selected Stieb because (i) I knew that he had significantly underperformed relative to Pythagorean projections during his prime years in the early and mid-80s, and (ii) I was tired of beating up on Bert Blyleven. (I knew Blyleven also underperformed his Pythagorean projections in his prime, but I genuinely like the guy and he was by many measures a borderline great pitcher - certainly better than Stieb - albeit not a Hall of Famer).
I compared nine-year peaks for each of the pitchers. This was convenient because both Guidry and Stieb had distinct nine-year peaks that account for all of their superior seasons. One could select various nine-year periods for Seaver, because his peak extended well beyond nine years, but I selected his first nine seasons, comprising substantially his entire Met career. I'll begin the comparison by noting some things that you probably already know. For instance, the Mets were not a good team once you subtract Seaver, notwithstanding their two NL pennants and their '60 World Series championship. Their team winning percentage from '67 to '75 was .495 (mediocre, but not bad), but was only .463 once you subtract Seaver's .636 winning percentage from the equation, and that obviously stinks. You probably also knew that the Mets' problem was poor hitting. They actually had very good pitching, even after stupidly trading Nolan Ryan, posting a team ERA+ of 108 from '67 to '75. Once you subtract Seaver's superlative ERAs, however, the team ERA+ was 102.2. That's not great, but it's pretty good considering the staff's ace pitcher is excluded. Another way too look at it is that the Mets staff was above average even without the great Seaver.
I was somewhat surprised by how good Stieb's winning percentage was from '82 to '90. He was 135-90 for a very good .600 winning percentage. But I was also slightly surprised by how good the Jays teams were in that period. They had a .548 winning percentage, and were generally a pretty good team even aside from the excellent '85 and '87 seasons, other than in '82. Even subtracting Stieb's W-L record the Jays still had a .539 winning percentage. I was very surprised, however, by how good the Jays pitching was in that period. They had a team ERA+ of 109.9 and an ERA+ of 106.9 even after subtracting Stieb. Even without Stieb the Jays staff in the '80s was as good as the Yankees pitching in the period '77 to '85 (primarily because the Yankees pitching sagged significantly from '82 to '84). Jimmy Key and Doyle Alexander were no slouches, and Jim Clancy was a pretty good No. 4 starter. And the Tom Henke-led bullpen was generally pretty solid and sometimes excellent.
The Yankees had a team winning percentage of .575 from '77 to '86, and were well over .500 every year other than '82. The Yanks' winning percentage drops to .552 without Guidry, still very good but not that much better than the Jays' .539 W% without Stieb. The Yanks pitching was better than the Mets but not as good as the Jays, posting an overall 106.3 ERA+ and a 103.7 ERA+ without Guidry. The period of '77 to '85 was really a tale of two Yankee pitching staffs: the excellent staff from '77 to '81 and the generally mediocre staff from '82 to '85.
On the offensive support side both Guidry and Seaver received run support slightly better than team average, in each case about 3%. Stieb's run support was 1.2% below team average. Accordingly Guidry's and Seaver's adjusted W% was slightly lower than their actual W% and Stieb's slightly higher. The adjustments were quite small in each case, with Stieb's W% going up from .600 to .606. Guidry's adjusted W% dropped 18 points to .679 and Seaver's dropped 14 points to .622.
The big beneficiary of the adjustment to team W% by assuming an average pitching staff was Stieb. The Jays W% (exclusive of Stieb) drops from .539 to .518. A Jays staff with a 100 ERA+ would have added about 36 runs per year to the Jays' runs allowed total.
The effects of these adjustments were essentially negligible for Guidry and Seaver, with the reduction in their personal W%'s being largely offset by the reduction in the team W% resulting from translating their good team pitching staffs into average staffs. Stieb, by contrast, saw a significant increase in his W% relative to his team's. Simply comparing Stieb's .600 W% to his team's .548 W% shows that Stieb outperformed his team by 9.5%. Adjusting for run support and pitching staff, however, increases Stieb's relative performance figure to 17%. That's a pretty good figure, and though I've not yet run the figures for various HOFers I'm willing to bet that it compares favorably to some of the more marginal inductees.
Seaver outperformed his team after adjusting for run support and pitching staff by a tremendous 37%, which is almost precisely the figure obtained by comparing his straight W% to his team's.
Guidry outperformed his team after adjusting for run support and pitching staff by 27%, which represents less than a one point increase over the approximately 26% figure obtained by comparing his .697 W% to his team's .552 W% without Guidry.
Just to give some idea of how astounding Seaver's figure is, my preliminary calculations appear to suggest that Koufax outperformed his team during his historic five-year run from '62 to '66 by slightly north of 40%. Seaver's 37% relative performance figure maintained over a nine-year period, therefore, appears to be a historic feat, and I'm willing to bet that few other pitchers since 1920, if any, can match it.
Stieb's figures demonstrate how a pitcher who had run support below team average and pitched on a good staff can have actually outperformed his team by a larger margin than a simple comparison of W% between pitcher and team would indicate. On the flip side, a pitcher whose performance relative to his team's at first glance appears to be superlative can be revealed as a fundamentally average pitcher if he received both great run support relative to his team's average run support and pitched on a team with an inferior pitching staff. Obviously neither Seaver nor Guidry are examples of this, and I'm not sure off the top of my head which pitcher might fit this profile. I know Andy Pettitte has received tremendous run support throughout his career, but he's also pitched on generally excellent pitching staffs. If anyone can suggest such a pitcher in the comments section I'd appreciate it. I'm going to start looking by first identifying poor pitching staffs from recent decades and then examining the run support received by their starting pitchers.
The performance of Seaver, Guidry and Stieb relative to each other was not a complete surprise. For one thing, Stieb slightly underperformed his Pythagorean record from '82 to '90, compiling a .600 W% relative to a .613 Pythagorean projection (Stieb significantly underperformed the Pythagorean projection during his very best years of '82 to '85, indicating that he slightly outperformed Pythagorus over the balance of his nine-year stretch). The Pythagorean comparison doesn't provide for any of the adjustments in the Relativity method I've described, but it does indicate that Stieb didn't make particularly good use of his run support. Guidry, by contrast, hugely outperformed his Pythagorean projection from '77 to '85, posting a .697 W%, more than 40 points higher than his .654 Pythagorean projection. That's a big difference. Seaver underperformed his Pythagorean projection but by an insignificant amount, posting a .636 W% from '67 to '75 as compared to a .641 Pythagorean projection, well within the margin of error in Pythagorean projections.
What did we learn by comparing a pitcher's performance to his team's after making the Relativity adjustments? Well, without having finished fully computing the figures for a meaningful number of other pitchers, I think we learned that Stieb was a pretty good pitcher; Seaver, as one must have expected, was a truly great pitcher and a worthy member of the inner sanctum in the Hall of Fame; and Guidry was precisely between Stieb and Seaver. My own takeaway is that the gap between Seaver and Guidry was about what I'd expected: it's significant, because Seaver is unquestionably among the very elite in the history of baseball, and Guidry, although deserving of HOF induction in my opinion, is admittedly a marginal candidate if one focuses soley on career statistics and ignores the astounding big-game record and his degree of dominance over a decade. I think the Relativity analysis also suggests strongly that the gap between Stieb and Guidry is about as big as the gap between Guidry and Seaver. It's significant, and it belies any comparison of the two based on nothing more than ERA+.
The results for Stieb and Guidry confirm a few things and dispense with a few myths. They confirm that Guidry's improved performance in high leverage situations translated into incremental wins, and Stieb's poor performance in high leverage situations translated into incremental losses. Stieb may have had the superior ERA+, but Guidry's LevERA+ was distinctly superior, and the difference explains in part the disparity in their ability to outperform their teams. The Relativity analysis also dispenses with the myth that Guidry's outstanding career winning percentage was just a product of good run support and great teams. Guidry did indeed get good run support and pitched for good teams, but the fact remains that he outperformed his teams by a huge margin. A .600 winning percentage for a Yankee pitcher in the years '77 to '85 would be good but not that much better than the Yanks' record for those years. A .697 winning percentage, however, is spectacular even after adjusting for run support and the quality of the Yankee teams.
Based on what we've seen so far I think it's clear that elite pitchers will outperform their teams by 17% after adjusting for run support and the quality of the rest of the pitching staff. All time greats - and I mean pitchers among the top six or eight of all time - may outperform their team on an adjusted basis by more than 35%. And it should be clear that pitchers who outperform their team on an adjusted basis by more than 25% are no doubt Hall of Famers. If there are any doubts about that, the Relativity analyses of pitchers like Drysdale, Bunning, Sutton, Niekro, Ryan, Palmer are likely to resolve those doubts.
UPDATE: I just ran the numbers for Greg Maddux for the period '92-'02. He's an interesting case, of course, because he pitched on such great pitching teams, and so his 15% outperformance of his team's record on a straight comparison of W% could be expected to rise significantly. But - wow. Maddux shoots up to a relative performance index of 42% when adjusted for run support and pitching staff. I didn't appreciate how poor Maddux's run support was relative to team average. The Braves scored 4.86 runs/game when Maddux wasn't pitching, but only 4.41 for Greg. Maddux is just north of Seaver's 37%. I guess no one should be surprised.
Leverage Adjusted ERA (Or "Not All Runs Are Equal")
It's been surprising to me, given the profusion of new pitching statistics (FIP, VORP, Component ERA), that we haven't seen an expression of ERA or ERA+ that adjusts for leverage, weighing runs allowed in high-leverage situations more and runs allowed in low-leverage situations less. The data is available in the game logs at Baseball-Reference.com, but aggregating the data would be a tedious exercise. Fangraphs.com aggregates the data on a seasonal basis in the WPA, WPA/LI and Clutch statistics, but expresses the statistics in terms of incremental games won or lost rather than adjusted ERA.
Fangraphs calculates "Clutch" by subtracting WPA/LI, which aggregates the unleveraged increase or decrease in win probabilities associated with each plate appearance against a pitcher, from WPA, which also aggregates the win probabilities but assigns a leverage factor to each event based on the game situation (score, inning, base and out situation). Generally speaking, a pitcher with a positive Clutch factor performed better in high-leverage situations relative to his overall seasonal performance, or declined in performance in low-leverage situations relative to his overall seasonal performance, or some combination of the two. A better performance in high-leverage situations means that the incremental outs the pitcher got in high-leverage situations count for more than an average out (i.e., an out obtained in a game situation with a leverage factor of 1.0). A worse performance in low-leverage situations means that the incremental runs the pitcher allowed in low-leverage situations count for less than the average run (i.e., a run scored in a game situation with a leverage factor of 1.0).
The significance of the Clutch statistic should be obvious: not all runs allowed (and runs prevented) are equal. For example, the run surrendered in the bottom of the ninth of a tie game should be counted differently than the run surrendered in the bottom of the first inning after the visiting took a six run lead in the top half of the inning. ERA and ERA+ count each run the same, notwithstanding that the two runs I used as examples are likely to have had hugely disparate impacts on the outcome of the game. The advantage of expressing the number of leverage-weighted runs allowed as a variation on ERA should also be obvious: most fans will not know whether a Clutch factor of 0.74 is merely above average, or very good, or a spectacular achievement, but fans know how to compare a 116 ERA+ to a 135 ERA+.
It turns out that a Clutch factor of 0.74 - meaning that the pitcher's clutch performance was worth 0.74 wins for the season - is very high. A Clutch factor of 2.0 in a season is truly spectacular, and 3.0 or above exceedingly rare. Curt Schilling was a spectacular clutch performer in 2001, improving on his usual performance that year by 30% with runners on, by 48% with runners in scoring position and two-out, and by 20% in "late and close" situations. These spectacularly clutch performances translated into 2.03 incremental wins for Schilling as measured by win probabilities added. But if one expresses this same statistic by adjusting his ERA and ERA+ to reflect not only how many runs he allowed, but the impact of these runs given the game situation, what happens to Curt's 2.98 ERA and 157 ERA+ for 2001?
Leverage-adjusted ERA+ (or "LevERA+) is calculated by expressing the 2.03 wins Curt added by virtue of his clutch performance in terms of the equivalent number of runs. The concept of expressing wins in terms of equivalent runs is a common one in sabremetrics, although the appropriate win-to-runs conversion factor is difficult to calculate and varies depending on the league scoring level and "park run environment". Fortunately, Fangraphs has already calculated the conversion factor for us, and it can be found simply by dividing RE24 by REW (here's the Fangraph glossary that describes these two stats). Curt's RE24 for 2001 was 50.87 and his REW was 5.09. That means the appropriate win-to-runs conversion factor for Curt in 2001 was 50.87/5.09, or 9.99 (which is a fairly typical conversion factor in today's game). Multiplying Curt's 2.03 Clutch factor by 9.99 reveals that Curt's clutch performance was the equivalent of allowing 20.28 fewer runs than the 86 runs Curt allowed in 2001, or 23.58% fewer runs. Reducing Curt's earned runs allowed by the same 23.58% results in a figure of 65 leverage-adjusted earned runs (as compared to Curt's actual 85 earned runs in 2001). That means Curt's LevERA in 2001 was 2.28 and his LevERA+ was 205.
This is an extreme example, of course, because a Clutch factor of 2.03 for a season is extremely high. Most pitchers will have a Clutch factor much closer to 0 (that is, they were neither "clutch" nor "anti-clutch") and accordingly a LevERA+ that varies very little from their ERA+.
Just to give a further idea of how extreme Curt's 20 run Clutch improvement was, consider that since 1974 (the date from which the Clutch stats are available at Fangraphs) the largest career Clutch improvements and declines measure less than 100 runs. Curt, oddly enough, had a career Clutch factor of negative 5.26, which translates into 53.4 more runs and 50.75 more earned runs over his career. Again, that may not sound like much but it moves Curt's career ERA+ of 127 to a 122.1 LevERA+. That's still excellent, of course, but it's a difference to which most fans, and certainly most sabremetrically inclined fans, would attach some significance.
Another HOF aspirant with a significantly negative Clutch factor is our old buddy Bert Blyleven. Bert's Clutch factor for the first four seasons of his career - '70 to '73 - are not available, but it can be estimated on the basis of the leverage statistics at Baseball-Reference.com that Bert's Clutch factor for those four years would be slightly negative (very negative in '70, very positive in '71, mildly negative in each of '72 and '73). Let's assume for purpose of calculating career LevERA+ that his Clutch factor was precisely zero for his first years. That leaves Bert with the -3.88 Clutch factor he accumulated from '74 to the end of his career. That translates to 37.12 more runs and 33.48 more earned runs over his career, making Bert's career LevERA+ 115.9.
Here are the top 25 LevERA+'s among pitchers with 2000 or more innings pitched since 1952:
Guidry, Appier, Tudor, Palmer and Santana had the largest increases over ERA+. Maddux and Brown had the largest decreases.
There were three pitchers who ranked in the top 15 in ERA+ but not in LevERA+: Schilling, Smoltz and Mussina. Schilling had the largest drop (127 ERA+, 122.1 LevERA+). Smoltz dropped from a 125 ERA+ to 122.4 LevERA+. Mussina dropped from a 123 ERA+ to a 122.4 LevERA+.
In evaluating levels of run support provided to pitchers, increasing attention has been focused on run distribution and the potential that an adverse run distribution can make a pitcher's run support look better than it really was. A pitcher whose run support as measured by average runs/game scored by his team in his starts may have really received relatively poor run support because he received a very large number of runs in a small number of games, or had a concentration of games at the both the low-end and high-end of the spectrum. Although these kinds of adverse run distribution situations can (and do) occur within seasons, it is very unlikely that the phenomenon could persist over a lengthy pitching career, and I've seen no data that suggests that any pitcher in fact suffered from adverse run distribution over the course of his career.
Very little attention has been paid, however, to the distribution of runs allowed by pitchers, despite the fact that certain pitchers have exhibited distinct tendencies to pitch differently in high-leverage and low-leverage situations. Unlike distribution of run support, adverse distribution of runs allowed is far more likely to persist over a career because of the potential that a given pitcher possesses the tendency to pitch better or worse in high-leverage situations (or pitch better or worse in low-leverage situations, or some combination of the two). LevERA+ is a measure of the impact of the distribution of runs allowed by a pitcher. It reveals that certain pitchers, like Bert Blyleven, contributed to their own mediocre W-L records by performing relatively poorly in critical situations, and it reveals that other pitchers, like Ron Guidry, produced spectacular W-L records not only because of superior run support but because of their superior performance in critical situations.
BlyLeverage
Bill James did a piece a few years ago on Bert Blyleven in which he addressed the great mystery surrounding Blyleven's conspicuously mediocre W-L record. While conceding that Bert's critics make some good points - "Blyleven did not do an A+ job of matching his effort to the runs he had to work with" - he ultimately concluded that Bert's biggest problem was his lack of run support, not his failure to pitch better in critical situations. Bill attributed roughly two-thirds of Bert's relatively poor record to lack of run support and one-third to Bert's tendency to pitch relatively poorly in tight games.
Bill's analysis was disappointing in certain respects, however. First, he didn't note that Bert's relatively poor career W-L record is almost purely a function of his performance in the first nine years of his career ('70 to '78). Had Bert compiled a W-L record commensurate with his ERAs and run support in the '70s Bert would already be in the Hall and Bill James and I wouldn't be writing about him. Second, Bill didn't discuss Bert's pertinent statistics from this period that likely explain the disparity between Bert's excellent ERAs during that period and his pedestrian W-L record. As I've previously noted, Bert had terrible record in "late and close situations" in that period, far worse than any premier pitcher of that era that I've examined, and lost a disproportionate number of close games. While it strikes me as reasonable and logical to infer that a pitcher who performs poorly in the late innings of tight games will lose a disproportionate number of close games, I thought I'd look at the records of various pitchers in one-run games and attempt to determine if there is any significant correlation between a pitcher's performance in close games and his record in one-run games.
I began by identifying pitchers who either distinctly improved their performance in high-leverage situations or exhibited a distinct decline in performance in high-leverage situations.* I then compiled their records in relatively low-scoring one-run games in which they started and pitched at least 5 innings, reasoning that higher-scoring one-run games and games in which they pitched fewer than five innings are less a function of their performance and more a function of other factors. Accordingly, I looked at one-run games with scores of 4-3, 3-2, 2-1 and 1-0. A comparison of these one-run games to Bill James's data on all one-run games pitched by the pitchers referenced in his Blyleven article indicated no significant differences, meaning that none of the pitchers performed materially differently in higher-scoring one-run games.
Here are the pitchers in the two categories:
Now, to the analysis. The six pitchers who improved in HL situations improved by an average of 7.5%, ranging from Guidry at 3% to Palmer at 13%. The nine pitchers who declined in HL situations did so by an average of 7.44%, ranging from Gibson at 2% to Rogers at 18% (I probably should have excluded Gibson, the only pitcher whose performance varied by less than 3%, but I left him in to make the point that this analysis is not intended to be any kind of dispositive argument about clutchness). The six improvers had a winning percentage of .614 in one-run games in which they started and pitched at least five innings. The six decliners had a winning percentage of .520 in such games. The correlation coefficient between performance in HL situations and one-run game winning percentage was a fairly strong .69.
There were outliers in each category. Ford improved by 6% in HL situations but had only a 32-29 record in one-run games (but, as with Gibson and Hunter, you can't tell me Whitey wasn't clutch). On the other end, Carlton and Sutton each declined by 12% but had winning percentages of .566 and .545, respectively. The best record in one-run games was Koufax, who had a winning percentage of .682 (and improved in HL situations by 6%). The worst record in one-run games was Blyleven, who had a winning percentage of .432 (and declined in HL situations by 6%).
This is obviously a very small sample set. There are more pitchers in the "decline" category than the "improve" category simply because that category seemed to fill out faster (primarily because I began by looking at pitchers referenced in James's Blyleven article and most of them just happened to exhibit performance declines in HL situations). I'm considering adding more pitchers to the analysis but compiling the records of one-run games is a fairly tedious exercise. If I can bring myself to pore through the game logs I'll update this analysis.
_______________________
* I opted to go with the high-leverage statistics at Baseball-Reference.com rather than the "late and close" statistics for various reasons but principally because the "late and close" statistics are just too narrow for this purpose, excluding anything before the sixth inning and even many situations in the late innings in which the difference in the score is only two runs. Additionally, "late and close" statistics have become increasingly less relevant over the last 30 years, as pitchers accumulate very few innings beyond the sixth inning. Whereas the "late and close situation" typically constituted between 15% and 20% of a pitcher's innings in the '60s and '70s, they generally constitute less than 10% of a contemporary pitcher's innings.
On The Subject of "Clutch"
It's a word you hear a lot about in discussions of athletics. It's a given among most sports fans and commentators that some performers are clutch and some aren't. Does anyone dispute that Michael Jordan was clutch? Does anyone dispute that John Elway was clutch? We all remember those game-winning shots and game-winning 4th quarter drives. Those were clutch, right? Ron Guidry's 26 wins in 30 September pennant race starts? That's gotta be clutch, doesn't it? And does anyone really dispute that Derek Jeter is clutch?
Well, yes, some people do dispute that Derek Jeter is clutch. And, frankly, they make some pretty good points. They correctly caution us that we should be careful about placing too much emphasis on "the flip" in '01 in the ALDS against the A's, or the walk-off home run in '01 against the D'backs in game 4 of the 2001 World Series. And they're right about relying on anecdotalism, or isolated instances of "clutch plays", or, more generally, very small sample sets. Those may have been clutch plays, but do they necessarily make Derek Jeter a clutch player? They point out that Derek Jeter in the post-season is pretty much like Derek Jeter in the regular season - almost identical batting average, OPS, and just a little bit more HR power in Oct/Nov than in April to September. Jeter's not being "clutch", they argue; he's merely being Jeter.
Reggie Jackson? Surely a .755 slugging average across five World Series establishes beyond question Reggie's clutch bona fides, right? Well, what about those 11 ALCS series, the skeptics ask. Those were big games, too, and Reggie slugged .380 and had an OBP under .300.
Here's Bill James on the subject of "clutch":
"The prominence of clutch performance as an element in player ratings can be attributed to three factors: (1) Hero worship journalism; (2) Self-aggrandizement by athletes, particularly retired athletes serving as TV announcers; (3) The fact that we all need, at times, to escape the implications of our logic."Bill then cites ex-athletes like Joe Morgan, Ray Knight and Reggie Jackson as commentators who have a tendency to cast every contest as a "test of character, determination, and fortitude."
"My attitude toward this can probably be inferred from my tone. I do not believe that athletes are better people than the rest of us, I do not believe that athletic contests are tests of character, and I do not believe that there is any such thing as an ability to perform in clutch situations. It's just a lot of poppycock."While rejecting the notion of an ability to perform in the clutch, however, Bill agrees that certain players have performed so well in clutch situations, for whatever reason, that they deserve credit for it and extra consideration when assessing their historical standing.
Bill contrasts Don Drysdale and Bob Gibson to illustrate his point. Bill cites Gibson's well-known big-game reputation, his tremendous performance down the stretch in the 1964 NL pennant race, and his remarkable World Series record (7-2, seven straight wins, two game 7 victories, 2 World Series MVPs) and contrasts Gibson's clutch achievements with Drysdale's pennant race performances.
"This is an absolute fact that doesn't change depending on how you feel about it: Don Drysdale started 13 games in his career in the heat of the pennant race against the team the Dodgers were trying to beat - and never won. Not even once. He never pitched particularly well without winning; 0 for 13.*
"I don't believe that this reflects a character failing on Drysdale's part. I think it's just something that happened. Sometimes he had been overworked; sometimes maybe a pitch or two got away from him. Sometimes you make good pitches and get beat. If there was a big game next week, I'd as soon have Drysdale pitching for me as anybody else.
"Nonetheless, it did happen; he did, in general, pitch poorly in pennant races (with some exceptions), and he did repeatedly fail to beat the Dodger's kep opponent in the heat of the pennant race. In rating Drysdale's career, is this something that should be ignored, or something that should be considered?"Bill answers his own question directly and succinctly, stating "if a player really does come through in big games or fail in big games, I don't think we can afford to ignore that."
Bill then argues that there are, in his opinion, about 20 players who should be rated up or down "a little bit" because of their clutch performances. In addition to Drysdale and Gibson, Bill mentions five other players for whom the clutch factor would figure in Bill's analysis: Yogi Berra, Joe Carter, George Brett, Steve Garvey and Reggie Jackson. Although Bill doesn't say so, I think it's fairly clear that each of these five would be uprated by Bill for clutch performance. But Bill doesn't explain why, and it's really not clear to me what Bill's methodology was in arriving at these five examples. If it's post-season performance (and I believe that is what Bill primarily relied on) then it should be noted that none of these players have aggregate post-season numbers that put them among the all-time post-season performers (with the possible exception of Reggie Jackson). And each of them have very notable chinks in their post-season records. The point is this: why these guys but not Lou Gehrig? Henry Aaron? Lou Brock? Allie Reynolds? Lefty Gomez? Babe Ruth? Mickey Mantle?
The fact is that there were two different Yogi Berras in the World Series - the one that hit .188 in his first five World Series and the one that was a fire-breathing monster in the next 7 Series in which he played. Gehrig and Ruth virtually never had a poor World Series - why don't they deserve Bill's uprate?
Why Joe Carter? His aggregate post-season numbers are even weaker than his generally mediocre regular season numbers. And he only played in five post-season series. What about Lou Brock and Henry Aaron, each of whom may have played in only three post-season series but put up numbers that are off the charts? And if it's Carter's World Series winning walk-off HR in the '93 WS that qualifies as a clutch uprate for Bill, then what about Bill Mazeroski, whose HR to win the classic 7 game Series in 1960 is even bigger than Carter's walk-off, and whose aggregate post-season numbers are far better than Carter's? Or what about Mantle, who hit more hugely consequential World Series HRs than anyone?
Well, Bill himself pointed out that the subject of clutch performance is inevitably very subjective and, as he put it so eloquently, "it's a dangerous area to get into, because when you reach into the bullshit dump, you're not going to come out with a handful of diamonds."
Still, you can't avoid the whole "clutch" debate; it's a classic sports fan subject. And it's a subject that in many ways is an implicit premise of this blog about so-called big-game pitchers. There are some players who were so undeniably great in big games, in tight pennant races, or in post-season competition that you have to take notice. And in the final analysis, I suppose I don't care if these performances were the result of some innate clutch gene, or some identifiable super-ability in the clutch. These performances occurred, they took place in the biggest games on the biggest stage, and the implications for their team and for baseball history were profound. So I'm with Bill, here: I don't think we can afford to ignore that.
____________________
* Bill, although generally correct about Drysdale's conspicuously poor record in September against other contenders, is simply wrong in his claim that Drysdale never won such a game. Drysdale beat the Giants on September 19, 1959 to draw the Dodgers even with the Giants with six games to go, pitching six innings, giving up one unearned run and striking out 8; and Drysdale beat the Pirates on September 15, 1966 to put the Dodgers up by 2.5 games over the Pirates with 17 games to go, going 8.2 innings and giving up 5 hits and 3 runs.