Thursday, December 24, 2009

Do pitchers throw no worse in their bad starts than in their good starts?

Suppose your starting pitcher does great against the first nine batters he faces, getting them all out. You'd think he's got good stuff tonight, right? And that he'd do better than normal the rest of the game.

It turns out that's not the case. In a study I did nine years ago (.pdf, page 13), I found that when a pitcher had a no-hitter going through three, four, or five innings, his performance in the remainder of the game was almost exactly what you'd expect out of a pitcher of his general quality. The great early performance didn't signal a better performance later on.

The authors of "The Book" did a more comprehensive study (page 189), and came to the same conclusion. If a pitcher is excellent early, that doesn't give you any extra information on how he'll perform later.

What if the pitcher is really crappy his first couple of innings? My study showed that, again, there was no effect. After giving up three runs in the first inning, starters reverted to almost exactly their expected form in the remainder of the game. In this case, though, "The Book" found a different, and more intuitive result: pitchers were actually a little worse than usual after getting hammered: you would have expected them to give up a wOBA of .360, but they got hit for .378. That's statistically significant, but still fairly small.

Anyway, the point is: there seems to be a whole lot of luck involved in pitching. The pitchers who got hammered gave up a wOBA of .701 to their first nine batters, but only .378 afterwards. Now, I suppose you could argue that they were legitimately unskilled at first, so bad that the entire lineup hit better than Barry Bonds or Babe Ruth at their peak. But it's much more likely that they just had bad luck, that they were pretty much the same pitcher before and after, and just happened to throw pitches a bit worse than usual at first. Or, that the hitters were lucky enough to get good wood on the ball during that inning.

If you think about it, that's roughly how we think of hitters. When a career .250 hitter goes 3-for-4 one day, we praise him for what he did, and might even name him the MVP of the game. But, still, we don't assume that he somehow played better than usual. We don't say it in so many words, but we understand that every player has good games and bad games, lucky at-bats and unlucky at-bats, and 3-for-4 is not that unusually lucky. He was the same hitter as always, but happened to get better results that day.

For pitchers, on the other hand, we're a little bit less understanding that way. When a pitcher gets hammered, we usually talk about how he gave up bad pitches, or how he didn't have his control, or some such. It's rare that a pitcher will give up five runs in the first inning without everyone trying to figure out what's wrong.

But if you think about it for a bit, you'll understand there must be a lot of luck. If you've played simulation games, like APBA or Strat-O-Matic or Diamond Mind, you've experienced that the same pitcher can appear to pitch awesome, or get hammered, just by a few unlucky rolls of the dice. I think I found once that an average team has a standard deviation of 3 runs scored per game due to luck. If you assume that runs scored is normally distributed (which it's not, but doesn't affect the argument much), you'll see that a decent starter with an ERA of 4.00 would be expected to give up 10 runs in a complete game, one time a year, just by luck alone. (Of course, he'll likely be relieved long before he gets to 10 runs, but the point remains.)

Which brings me to the point of this post, which is an awesome pitching analysis by Nick Steiner, over at The Hardball Times. Steiner looked up PITCHf/x data for A.J. Burnett, during his good starts, and during his bad starts, and found ... pretty much *no difference*. If you haven't seen the article yet, you should go there now, scroll about halfway down, and look at the two scatterplots. They look almost the same to me. Also, the pitch selection bar graphs, at the bottom of the piece ... they look pretty much the same, too.

That's a bit different from what we were talking about before, where pitchers "got better" after a bad start. There, it was still possible that they threw worse pitches even though their "stuff" was just fine. But here, we're finding that not only was their skill level apparently not changed, but the *actual pitches* were the same.

That is, there's two different kinds of luck. First, there's the possibility that, even though your skill is fine, your pitches are a bit unlucky and don't quite work -- they don't hit the corners, or the specific curve ball hangs a little more than normal. Second, there's the possibility that, your pitches are just as good as any other day -- but you're unlucky enough that the batters just happened to hammer them.

It's the second kind of luck that we're talking about here. Of course, even with Steiner's data, we're not 100% sure it's luck. There are other things it could be, as some of the commenters have pointed out:

-- Mistake pitches. Maybe the difference between a good and bad outing is just the number of mistakes. From the charts, it would be easy to miss a few hanging curves, or fastballs down the middle.

-- Combinations. Maybe it's not just what kind of pitch, how fast it is, and how it breaks: maybe it's the timing of certain pitches relative to others. If it's hard to hit a curve ball after a fastball, and the pitcher doesn't choose that combination often enough, the batters will have better results.

-- Slight differences. Maybe a small difference in location makes a big difference in results. It could be that there *are* differences in those to scatterplots, but we can't pick them up with the naked eye.

-- Umpires. Part of the effect could be a different strike zone between the "good" days and the "bad" days -- the same pitch that was called a strike five days ago is called a ball today.

-- Differences that PITCHf/x doesn't tell you about. Maybe certain pitches are deceiving in ways that type, velocity, and spin don't capture: two pitches that look identical on paper might look very different to the batter.

All those things are possible, but they just don't seem very likely to me. The most plausible one, to me, is differences that aren't captured in the data. I'm not well-enough informed to know if that could be happening.

My feeling is that a lot of the luck comes from the batter side. The pitcher has time to plan and decide; the batter has very little time to react. If the batter has to "guess" what kind of pitch is coming, and roughly where it's going to cross the plate, that's inherently a random process. If the hitter is "waiting on a fastball," and he gets one, things are going to work out well for him. If it's a curve ball, not so much.

Doesn't it seem reasonable that a certain pitch might be a "good" pitch only in the sense of probability? Maybe a certain pitch is a strike 50% of the time, a ball 20% of the time, an out 20% of the time, and a hit 10% of the time. But, if instead of ten pitches going 5/2/2/1, one day they might go 5/2/1/2 -- only because the batter happened to guess right one extra time, and got good wood on the ball. That one extra hit is worth an average of more than three-quarters of a run. Depending on circumstances, it could be several runs. And not because anything the pitcher did differently, but just because the batter decided to wait on a curve ball instead of a slider.

Anyway, it shouldn't be too hard to check: find a bunch of pitchers of roughly the same ability. Figure out the variation of their results. Then, run a simulation, and check the variation of *those* results. If they're the same, you've just shown that pitchers perform like their own APBA cards, and that it's likely that almost all of what you see is randomness. If not, then the difference between the two variances (actually, the square root of the squares of the differences) is something other than luck.

What might that be? 20 years ago, we probably thought it was pitchers having better stuff some days, and not other days. But now, we know that's a fairly small effect, except (as "The Book" found) for inexperienced pitchers.

Up until a few days ago, we thought a lot of it might be pitchers having their stuff, but just getting unlucky and happening to throw bad pitches that day. But now, we seem to have evidence that that's not a big factor either. But even though it's not a *big* factor, it must have *some* effect. That's because we know that some pitches are easier to hit than others; there have been PITCHf/x studies that have shown, as expected, that pitches down the middle get hammered, and that certain levels of movement are easier or harder to hit than others. So it can't be that the pitches actually thrown make *no* difference.

But it does appear that the difference is small, at least compared to luck -- because, when we compare good games to bad games, the difference in the scatterplot of pitches is too small to notice!

What does that mean, in practical terms? It means that you shouldn't necessarily take out your starter just because he gives up a lot of runs, because he's likely just his usual self. That might be hard for some managers, when their ace gives up 7 runs in the first inning.

But managers already know that, perhaps. On "The Book" blog, Tangotiger found that A.J. Burnett threw almost as many pitches in his bad starts as in his good starts (101 vs. 105).

So what else do we learn? Well, from the experience of DIPS, it could turn out that Steiner has found a way to eliminate a lot of noise from a pitcher's record. If and when we can associate a firm run value to a specific pitch, based on type, speed, location, and spin, we might be able spot those pitchers who were unlucky: who threw good pitches, but were hit hard anyway.

That might be a ways off: it might be that there are things about the individual pitcher that go beyond those measurements that PITCHf/x makes, and it could be that the "mistake" pitches get lost in the scatterplot. But, as a starting point, I think teams would have at least a bit of an edge applying Steiner's conclusions anyway.

Labels: ,

Wednesday, December 16, 2009

The renowned neurosurgeon vs. Don Cherry

Dr. Charles Tator is a Toronto neurosurgeon who has treated many young people for concussions and spinal injuries. He's not just a doctor: he's an active medical research scientist with an impressive resume. He's also the founder of "Think First," a program that attempts to help prevent of concussions and spinal injuries by promoting the use of safe practices and proper equipment.

On Saturday, at a conference in Regina, Dr. Tator argued that hockey is too aggressive. He said that players are taking too many hits to the head, and that concussions are more frequent than they should be. He said NHL players used to have "respect for their own safety and respect for the safety of their opponents," but not any more.

That probably wouldn't have made the news, except that Tator went on to blame Don Cherry for contributing to the problem. Cherry, a plain-spoken (and often controversial) commentator on "Hockey Night In Canada," advocates an agressive style of hockey, and has created a series of "Rock 'Em Sock 'Em" videos that prominently feature NHL fights and spectacular checks.

"I think [Cherry] is a negative influence because he promotes aggressive hockey," Tator said.

I'm on Don Cherry's side. I think Tator is out of line.

The reason is: this is not a medical question. The general effects of concussions and spinal injuries are not expert or specialized knowledge: we all understand what it means to have a concussion, or to be paralyzed. We read about these kinds of injuries in the sports pages all the time, about Tim Tebow, or Brett and Eric Lindros, or Kevin Everett, or any number of other players. We laymen understand that less aggressive play will lead to fewer injuries.

We laymen understand other parts of the issue as well. We know that we could eliminate hockey injuries completely if we eliminated hockey. If that's too extreme, we know we could eliminate a very large proportion of serious injuries by banning bodychecking, where every NHL game is as contact-free as the All-Star game. There are lots of other ways we could reduce injuries, too. We could make the puck a little softer -- that might have prevented Trent McCleary's frightening injury. We could force goalies to wear a Kevlar neck brace (which would have saved Clint Malarchuk from his near-death experience), limiting their mobility but increasing their safety.

That's not a question of medical expertise. It's a question of tradeoffs. It's a question of, how much injury are we willing to put up with while keeping the quality of game where we think it should be? Or, put another way: how much are we willing to lose in the quality of or interest in the game in order to save a certain number of injuries per year?

It would be nice if there were no tradeoff at all, and, indeed, some people might say that banning aggressive play will make the game *better*, not worse. But if that were the case, the NHL would have done it already, and there would be no controversy. There are already rules banning many dangerous practices, rules that have a very strong consensus of approval in and out of the game. In any case, even in the unlikely event that Tator believes that there is absolutely no cost to implementing his view of what should be allowed in the NHL, that's an opinion about hockey, not about medicine.

Which is my argument, in one sentence: the issue is about hockey, not about medicine. Is there any reason to believe that Dr. Tator's opinion about what's good for hockey is more valid than Don Cherry's? Absolutely not. Certainly, Dr. Tator might have expertise on what kind of hits cause what kind of injuries. But in terms of whether the tradeoff is desirable, and what the rules of hockey should be ... well, on that score, there is a strong argument to be made that Tator is biased, much more biased than Cherry.

If you're Dr. Tator, what is your experience with spinal injuries? Pretty direct. You see many, many victims of sports injuries, some of whom are very badly hurt. Every day, you see their despair and their pain, and you identify with them, and try to help them as best you can. Sometimes, there's nothing you can do. You, and the victim, constantly reflect on what could have been. If only the opposition had been a little more careful; if only the game hadn't gotten just a little too chippy in the third period; if only the players had played in a non-contact league. Then, everything would have been fine.

It's easy to understand why victims and doctors are so concerned with averting as many future injuries as possible, and why some of them, like Tator, start organizations to promote education and prevention. Every day they see the costs, intimately and emotionally. We, the fans, do not. We may feel bad for Kevin Everett and Eric Lindros, but we quickly forget and move on. Dr. Tator cannot do so quite as easily.

But what about the benefits? Dr. Tator doesn't see them nearly as much as he sees the costs. There are literally millions of North Americans who play hockey, without incident. There are millions of us who watch hockey, and many of us see aggressive play as one of the fundamental characteristics of what makes the game great. Suppose reducing body contact will save 20 injuries a year, but reduce fan interest by 5%. Do the benefits outweigh the risks? Dr. Tator will have to deal with some of the 20 injuries, but won't be one of the 5% who lose a little bit of interest in the game.

I don't play ice hockey, but I play ball hockey four times a week. Two weeks ago, one of our players got hit in the hand by a stick. His right index finger was broken in three places; he'll be unable to use it for a month. If you're a doctor, an indexfingerologist, and you see three of these broken fingers a day, wouldn't it be easy for you to get the idea that floor hockey is dangerous, and should be banned -- or at least that everyone should have to wear gloves? But we, the participants, just kept playing. We understand there's a risk of getting our finger broken, or worse, and we accept that risk. Floor hockey is fun, and the risk seems quite reasonable compared to the benefits. My life would be a lot less interesting without ball hockey -- it's my main source of recreation and exercise, and a good part of my social life. If the doctor sees only the thousand broken fingers, but not the millions of happy players who get all these benefits, while knowing about the danger and being willing to put up with it, isn't it the doctor who's seeing things the wrong way?

I recall a few years back, there was another chapter in the ongoing debate on whether motorcycle helmets should be compulsory. A doctor wrote something like: "if you don't think helmets should be mandatory, it's because you're not a doctor dealing with the victims every day. You should come to the emergency room and see how mangled these riders' heads are when their cranium hits the pavement. If you saw a few of these, you'd change your mind."

And that, frankly, is a bulls**t argument. You don't need to see the blood and guts to understand that the accident killed the victim, or turned him into a vegetable. You don't need to have a medical degree. You need to carefully weigh the risks and benefits, and come up with an argument. You need to study the issue. If helmets saved one life a year, it wouldn't be worth it: you could take all the money spent on helmets, use it to buy medical tests, and probably save hundreds of lives. On the other hand, if helmets saved a hundred thousand people a year, then, yeah, there's an argument for requiring them. But "blood and guts are disgusting and tragic, therefore helmets should be banned," is not a reasonable argument. It's an argument from a misguided person who thinks the fact that he treats the victims gives him a special moral insight into what risks society should tolerate and what risks it shouldn't.

To his credit, Dr. Tator doesn't make such an argument, but the idea is roughly the same. The argument has to be one of costs vs. benefits, and Dr. Tator's involvement with the victims, no matter how expert, charitable, and concerned, makes him likely to be *more* biased and *less* credibile in analyzing the issue. Of course, he could make an argument, with numbers and logic, to show us that he has used valid analysis to overcome the possibility of bias. If he's done that (and the press didn't report that he did), I'm absolutely willing to look at it.

Now, you could argue that for the NHL, the fans are biased just as much, in the other direction. After all, in ball hockey, the injury is mine. In the NHL, the injury is a stranger's. As a fan, I get all the benefits, the entertainment value of fights and bodychecks, and it's the players who pay the price. If I'm willing to criticize the doctor for overemphasizing the costs, shouldn't I also criticize myself for *underestimating* the cost?

Well, yes and no. From an economic standpoint, we fans *are* bearing the costs: more enthusiastic fans creates more revenue for the league, which means the players get paid more, and therefore compensated for the risks inherent in the kind of hockey we demand. On the other hand, Dr. Tator doesn't have to pay anything for his demand that hockey get less aggressive. He gets all the benefits, in terms of having to give bad news to fewer victims of spinal injuries, but pays a very small portion of the costs, being only one hockey fan out of millions (and perhaps not liking aggressive play in the first place).

But never mind that argument. Suppose we ignore that we fans are compensating the players for their preference, and we assume that we are so insulated from the reality of career-ending injuries that we're too biased against safety, and are demanding more than the "optimal" amount of hockey aggressiveness. Then we're biased one way, and Dr. Tator is biased the other way.

So it seems like, so far, everyone is biased. What we need is an opinion from someone not so far from the fans and the game, and someone not so far from the victims of injury. Someone who is intimately familiar with both the costs and the benefits.

That's Don Cherry, isn't it? Cherry may have strong (and sometimes controversial) political views, he is often politically incorrect, and it seems to me that he's disliked by many who don't like his blunt style and uneducated way of speaking. But in my (admittedly untested) opinion, he is one of the foremost experts on NHL hockey anywhere. He has a huge fan following, and, more importantly, inordinate respect from the players. He is not in favor of recklessness on the ice; he's spoken out many times against aspects of hockey he thinks are dangerous. For years, he's waged a campaign in favor of "no-touch" icing. Every year, he shows an video, quite unpleasant to watch, of injuries incurred by players chasing each other after an iced puck, and rants against the stupidity of the league for not changing the rule. He's been active in injury prevention in youth hockey, promoting the "STOP" program to help prevent hits from behind. I don't follow Cherry as much as some others, but I have never heard him condone dangerous play.

Cherry has been an NHL coach, and he's active in the league's social circles. He's certainly seen many of his player friends and acquaintances felled by injury, unlike most of us fans, which means that, like Dr. Tator, he has first-hand experience with the costs of aggressive play. But he knows the game, and he knows, intimately, the risks involved. He intuitively knows what types of "aggressive" play are risky, and which ones are not. He has opinions, probably as good as anyone's, with credentials as good as anyone's, on what types of aggression are good for the game, and which ones are not.

Intuition is no substitute for a well-formed argument, backed by evidence and logical argument and measurement of costs and benefits and risks. But if you asked me whose gut argument I would want to hear first, it would be Don Cherry's. It doesn't matter if you're one of Canada's foremost experts in spinal injury treatment and prevention, because the question is, what's the proper balance between aggressiveness and risk? That's a question of opinion, not of science.

Dr. Tator may well be right, that hockey is too aggressive and therefore too dangerous. But until he comes up with numbers and arguments and a way to measure the tradeoffs, I am less inclined, not more, to take his word for it simply on account of his profession.

Labels:

Sunday, December 13, 2009

Did Tim Donaghy really win 70% of his bets against the spread?

According to disgraced NBA referee Tim Donaghy’s new book, Donaghy won 70 to 80 percent of his NBA bets. (The link is to an excellent espn.com article by TrueHoop's Henry Abbott, which I recommend highly.)

70 to 80 percent is huge: these were bets against the spread, so you’d have expected that Donaghy's winning percentage would only be around 50 percent, unless he had some kind of edge.


What was his edge? Did he fix the outcomes of games with biased refereeing? Donaghy says no: the way he won so many games was by knowing which *other* referees were biased. Not corruptly biased, mind you, sometimes just unconciously biased. He says,

"I listened to the directives from the NBA office, I considered the vendettas and grudges referees had against certain players or coaches, and I focused in on the special relationships that routinely influenced the action on the court. Throw in some quirks and predictable tendencies of veteran referees and the recipe was complete. All I had to do was call it in and let the law of averages take over. During the regular season, I was right on the money seven out of 10 times. There was even a streak when I simply couldn’t miss, picking 15 winners out of 16 games. No one on the planet could be that lucky. Of course, luck had little to do with it."


Does that make sense? I don't think it does. I don't think that even perfect knowledge of the tendencies of referees can get you a winning percentage of .700.

According to basketball researcher Wayne Winston, a reasonable pythagorean exponent for basketball is 14. (That's from his book "Mathletics," which I've been planning to review for a while now -- I'll do it this week, I swear.) To increase your winning percentage from .500 to .700, then, requires an extra 6% of points, or about 6 points in a 100-point game.

Six points is huge. It's twice what home field advantage is worth. If you assume that teams score an average one point per possession when they're not fouled, but 1.5 points per two-shot foul, it would take 24 extra foul shots (12 fouls) in a game to account for six points.

Teams shoot about 25 foul shots a game on average. 24 extra foul shots would basically double the number of foul calls per game. If you assume that each of the three referees call 8 foul shots each, one biased referee would have to call *four times as many* foul shots for one team to raise its winning percentage to .700.

Of course, the biased ref could also *refrain* from calling foul shots for the other team ... but, with only 8 shots called per game per ref, there's a natural limit to what you can *not* call.

It just doesn't seem at all plausible that one "vendetta" or "special relationship" could have such a large effect. Especially considering that the examples Donaghy provides are pretty weak. For instance:

"Referee Joe Crawford had a grandson who idolized [Allen] Iverson," writes Donaghy. "I once saw Crawford bring the boy out of the stands and onto the floor during warm-ups to meet the superstar. Iverson and Crawford’s grandson were standing there, shaking hands, smiling, talking about all kinds of things. If Joe Crawford was on the court, I was pretty sure Iverson’s team would win or at least cover the spread."


Doesn't that sound pretty much impossible? First, could any professional referee call an extra 24 foul shots a game for Allen Iverson without drawing some kind of attention? And, second, isn't it completely implausible that anyone could rise to the ranks of NBA referee with judgment so bad that he would be *that biased* in favor of his grandson's idol?

Anyway, we don't have to take Donaghy's word for it: we can check the records. Actually, ESPN's Abbott already checked. It turns out that Donaghy's accusations are completely false. With Crawford refereeing, Iverson's teams went 5-9 against the spread. That's .357, not .700.

Abbott checks a few other of Donaghy's claims, and references others who have done similar checks. The bottom line: absolutely no evidence of any bias at all, much less the kind of bias that would let you pick 70% winners.

That leaves at least three possibilities:

1. Donaghy was not specific enough in describing the circumstances in which he knew he had a .700 chance of winning. Maybe, for instance, that only happened when Joe Crawford's grandson was actually at the game, rather than watching at home.

2. There was other information Donaghy used to make his picks, not just his knowledge of referee bias. As he said: "There were other factors that came into play. Inside information about injuries. Home game or away game. Home crowd. Many more factors to take into consideration."

3. Donaghy himself rigged the games in order to win his bets.

4. Donaghy didn't actually win 70% of his bets.


Number 1 still doesn't seem very plausible: as I argued, it's very hard for a referee to make a team lose 20% of the games it otherwise would have won, and make it look natural. This is especially the case if the grudge the referee holds is against a player, and not a team: can anyone really cause Allen Iverson to lose 6 points a game, without making it obvious?

Number 2 is implausible too: there are thousands of bookies and gamblers analyzing basketball much more thoroughly than Donaghy did. If most of the information he claims to have used was public, the betting line would have adjusted for those factors already.

Number 3 is implausible, for similar reasons. Actually, it's a bit more plausible than number 1, because, for one thing, Donaghy would care about the team, not any individual player, so he could spread out his biased calls. Secondly, he could concentrate his fixes in close games, so that it may not take 6 points a game, but perhaps only 1 or 2 points in a game that's tied in the last minute.

But, to me, Number 4 is the most plausible. It does require you to assume that Donaghy and the FBI are incorrect about the results of the wagers: but if Donaghy can be so wrong about the results of his strategies (which can be verified), why can't he also be wrong about the results of his bets (which cannot)?

Anyway, there would be easier ways to figure this stuff out, if there were a list of games that Donaghy bet on. Just check those games, and the betting lines, and see if there was anything unusual there, either by Donaghy himself, or the other referees at the game. But, according to the article, there is no such list. It seems like the NBA and FBI are taking Donaghy's word for how big his bets were ($2000 each) and how many he made (more than 125 over four seasons).

In that case, isn't it more plausible that the $100,000 Donaghy made came from sources other than winning 70% of his $2,000 bets? Maybe he won the lottery, or he was peddling confidential information about Tiger Woods, or he was selling illegal MRIs to Canadian patients with sore knees. Any of those seem more plausible than being able to go .700 against the spread.

In any case, I'd be willing to put Donaghy to the test. He says he can pick 70% winners. I think he can pick 50% winners. Let's set the bar at 60%. I'm willing to bet him even money that he can't go better than .600 in his choice of 30 NBA games this season.


Labels: , , , ,

Thursday, December 10, 2009

The Bradbury aging study, re-explained (Part III)

Last week, J.C. Bradbury posted a response to my previous posts on his aging study.

Before I reply, I should say that I found a small error in my attempt to reproduce Bradbury’s regression. The conclusions are unaffected. Details are in small print below, if you're interested. If not, skip on by.

As it turns out, when I was computing the hitters age to include in the regression, I accidentally switched the month and year. (Apparently, that wasn’t a problem when the reverse date was invalid – Visual Basic was smart enough to figure out that when I said 20/5 instead of 5/20, I meant the 20th day of May and not the 5th day of Schmidtember. But when the reverse date was valid – 2/3 instead of 3/2 -- it used the incorrect date.)

That means that some ages were wrong, and some seasons from 24-year-olds were left out of my study. I reran a corrected regression, and the results were very, very similar – all three peak ages I’ve recalculated so far were within .08 years of the original. So the conclusions still hold. If you’re interested in the (slightly) revised numbers, let me know and I’ll post them when I’m done rerunning everything.


Okay, now to Bradbury’s criticisms. I’ll concentrate on the most important ones, since a lot of this stuff has been discussed already.

----

First, there’s one point on which I agree with Bradbury’s critique. He writes,

" … the model, as he defines it, is impossible to estimate. He cannot have done what he claims to have done. Including the mean career performance and player dummies creates linear dependence as a player’s career performance does not change over time, which means separate coefficients cannot be calculated for both the dummies and career performance. … Something is going on here, but I’m not sure what it is."

He’s right: having both the player dummies and the career mean causes collinearity, which I eliminated by getting rid of one of the player dummies. I agree with him that the results aren’t meaningful this way. I should have eliminated the mean and gone with the dummies alone.

In any case, it doesn’t matter much: the results are similar with and without the dummies. The reason I used the dummies is that it made the results make more sense, and more consistent with what Bradbury found. It turns out that without the dummies, some of the aging curves were very, very flat. By including the dummies, the curves were closer to what Bradbury found.

In retrospect, the reason the curves make more sense with the larger model is that the dummies have the effect of eliminating any observation of only one season (since the dummy will come out to have that player match whatever curve best fits the other, more-than-one-season, players).

Regardless, the peak age is similar either way. But Bradbury’s point is well-taken.

----

Secondly, Bradbury disagrees with me that players are weighted by the number of seasons they played:

"His belief is based on a misunderstanding of how least-squares generates the estimates to calculate the peak. There is no average calculated from each player, and especially not from counting multiple observations for players who play more."

It’s possible I’m misunderstanding something, but I don’t think I am. The model specifies one row in the regression for each player-season that qualifies (player with a certain number of PA and seasons). If player A has a 12-year career that peaks at 30, and player B has a 6-year career that peaks at 27, then player A’s trajectory is represented by 12 rows in the regression matrix, and player B’s trajectory by 5 rows.

Bradbury would argue that the scenario above would result in a peak around 28.5 (the average of the two players). I would argue that the peak would be around 28 (player A weighted twice as heavily as player B). I suppose I could do a little experiment to check that, but that’s how it seems to me.

----

Thirdly, Bradbury says I misunderstood that he used rate statistics for home runs, not actual numbers of home runs:

"I’m estimating home-run rates, not raw home runs. All other stats are estimated as rates except linear weights. This is stated in the paper."


Right, that’s true, but that wasn’t my point. I was probably unclear in my original.

What I was trying to say was: the model assumes that all players improve and decline at the same fixed HR rate, regardless of where they started.

So, suppose Bradbury’s equation says that players drop by .01 home run per PA (or AB) the year after age X. (That’s 6 HR per 600 PA.) That equation does NOT depend on how good a home run hitter that player was before. That is: it predicts that Barry Bonds will drop by 6 HR per 600PA, but, also, Juan Pierre will drop by 6 HR per 600PA.

As I pointed out, that doesn’t really make sense, because Juan Pierre never hit 6 HR per 600PA in the first place, much less late in his career! The model thus predicts that he will drop to a *negative* home run rate.

I continue to argue that while the curve might make sense for the *composite* player in Bradbury’s sample, it doesn’t make sense for non-average players like Bonds or Pierre. That might be lost on readers who look at Bradbury’s chart and see the decline from aging expressed as a *percentage* of the peak, rather than a subtraction from the peak.

-----

Finally, and most importantly, one of Bradbury’s examples illustrates my main criticism of the method. Bradbury cites Marcus Giles. Giles’s best seasons were at age 25 to 27, but he declined steeply and was out of the league by 30. Bradbury:

"What caused Giles to decline? Maybe he had some good luck early on, maybe his performance-enhancing drugs were taken away, or possibly several bizarre injuries took their toll on his body. It’s not really relevant, but I think of Giles’s career as quite odd, and I imagine that many players who play between 3,000 — 5,000 plate appearances (or less) have similar declines in their performances that cause them to leave the league. I’ve never heard anyone argue that what happened to Giles was aging."


Bradbury’s argument is a bit of a circular one. It goes something like:

-- The regression method shows a peak age of 29.
-- Marcus Giles didn’t peak at 29 – indeed, he was out of the league at 29.
-- Therefore, his decline couldn’t have been due to aging!

I don’t understand why Bradbury would assume that Giles’ decline wasn’t due to aging. If the decline came at, say, 35 instead of 28, there would be no reason to suspect injuries or PEDs as the cause of the decline. So why couldn’t Giles just be an early ager? Why can’t different players age at different rates? Why is a peak age of 25, instead of 29, so implausible that you don’t include it in the study?

It’s like … suppose you want to find the average age when a person gets so old they have to go to a nursing home. And suppose you look only at people who were still alive at age 100. Well, obviously, they’re going to have gone to a nursing home late in life, right? Hardly anyone is sick enough to need a nursing home at 60, but then healthy enough to survive in the nursing home for 40 years. So you might find that the average 100-year-old went into a nursing home at 93.

But that way of looking at it doesn't make sense: you and I both know that the average person who goes into a nursing home is a lot younger than 93.

But what Bradbury is saying is, "well, those people who went into a nursing home at age 65 and died at 70 … they must have been very ill to need a nursing home at 65. So they’re not relevant to my study, because they didn’t go in because of aging – they went in because of illness. And I’m not studying illness, I’m studying aging."

That one difference between us is pretty much my main argument against the findings of the study. I say that if you omit players like Giles, who peaked early, then *of course* you’re going to come up with a higher peak age!

Bradbury, on the other hand, thinks that if you include players like Giles, you’re biasing the sample too low, because it’s obvious that players who come and go young aren’t actually showing "aging" as he defines it. But, first, I don’t think it’s obvious, and, second, if you do that, you’re no longer able to use your results to predict the future of a 26-year-old player. Because, after all, he could turn out to be a Marcus Giles, and your study ignores that possibility!

All you can tell a GM is, "well, if the guy turns out not to be a Marcus Giles, and he doesn’t lose his skill at age 31 or 33 or 34, and he turns out to play in the major leagues until age 35, you’ll find, in retrospect, that he was at his peak at age 29." That’s something, but … so what?


I’m certainly willing to agree that if you look at players who were still "alive" in MLB at age 35, and played for at least 10 years, then, in retrospect, those players peaked at around 29. And I think Bradbury’s method does indeed show that. But if you look at *all* players, not just the ones who aged most gracefully, you’ll find the peak is a lot lower. There are a lot of people in nursing homes at age 70, even if Bradbury doesn't consider it’s because of "aging."



Labels: ,

Sunday, December 06, 2009

Bloomberg enters the baseball analysis market

Bloomberg, the company that provides investors with software to provide sophisticated real-time information on stock and financial markets, now has software to provide GMs with baseball information.

The New York Times article describing the new system is sketchy in explaining what kind of information will be provided, but my impression is that the breakthrough is in ease of use, rather than sabermetric sophistication:

"The challenge for Bloomberg is to create software that is better, faster and more visually useful than what rivals offer to help develop players and predict their performances. A demonstration of Bloomberg’s software showed dazzlingly colorful graphics and an easy way to plot statistics and compare players in complex combinations."

Not there's anything wrong with ease of use ... it's not a lot of fun to calculate player values yourself, or even to get your team of programmers to do it, if there's something available off the shelf.

But at the same time, there's a hint that the software will adjust for park effects, and maybe even do simulations:

"For Jeff Wilpon, the chief operating officer of the Mets, the value in the software will be in evaluating free agents.

"If you take X player on another team who’s around a great cast of players," he said, "we want to look at him in our ballpark with different players around him to see how he will fit in."

In addition, it'll include PitchF/X data:

"What looks impressive are highly visual pitch charts that can be summoned for any particular period, with parameters including arm angles that can, based on diminishing performance, suggest physical injury."


But Bloomberg also makes it sound like a friendly database query engine -- in effect, a version of Baseball Reference's "Play Index":

"It’s one thing to say, I want to see how various players hit home runs over the years," said Bill Squadron, who is managing the product introduction. "But it’s another to say, I want to see home runs, on-base percentage, pitches per plate appearance, take it all together and look at 10 guys who exceed a certain level."

Maybe it's all these things together.


It would be interesting to note what kind of sabermetric analysis is included in the system ... will free-agent evaluation include a version of WAR? Will it include estimates of dollars per win? Will Bloomberg have evaluated the various run estimators and chosen the best one? Will the Bloomberg algorithms become "conventional wisdom?" If so, will it be possible for some teams to gain an advantage by taking advantage of flaws in Bloomberg's analysis?

My guess is that if teams start using this system, and it does include some of the newer developments, we'll know about it because team management will start internalizing it. It's easy to ignore analysis from bloggers, but harder to ignore analysis from an expensive and sophisticated system from a respected name like Bloomberg, especially when the owners have spent thousands of dollars to provide it for you.


Labels: ,

Monday, November 30, 2009

Academic article on predicting hitting performance

A new academic article called "Hierarchical Bayesian Modeling of Hitting Performance in Baseball" attempts to beat existing prediction methods -- PECOTA, Marcel, et al -- using a more complicated model and Bayesian techniques.

It's the new issue of the academic journal "Bayesian Analysis".

The article is accompanied by three reviews; I'm the co-author of one of them, with Jim Albert. (Disclosure: we discussed the article by e-mail, but Jim wrote most of it, except for a few paragraphs that I provided in Section 6.)

There's also an article where the authors, Shane T. Jensen, Blakeley B. McShane, and Abraham J. Wyner, respond to the reviews.

All five articles are available at the above link, near the top.



Labels: ,

Thursday, November 26, 2009

The Bradbury aging study, re-explained (Part II)

This is a follow-up to my previous post on J.C. Bradbury's aging study ... check out that previous post first if you haven't already.

My argument was that players with shorter careers should peak earlier than players with longer careers. Bradbury disagreed. He reran his study with a lower minimum, 1000 PA instead of 5000. He found that there was "no drop".

I decided to try to run his study myself, the part where he looks at batter performance in Linear Weights. I think my results are close enough to his that they can be trusted. Skip the details unless you're really interested. I'll put them in a quote box so you can ignore them if you choose.


-----

Technical details:

Here's what I did. I took all players whose careers began in 1921 or later, and looked at their stats until the end of 2008 (even if they were still active). They had to have had a plate appearance in each of at least ten separate seasons. In seasons in which their age was 24 to 35 (as of July 1), they had to have had at least 5000 plate appearances.

Any player who did not meet the above criteria was not included in the regression. Also, the regression included only seasons from age 24-35 in which the player had at least 300 PA.

Each of those seasons was a row in the regression. The model I used was:

Z-score this season = a * age this season + b * age^2 this season + c * career average Z-score + d * player dummy + constant + error term

I didn't include dummy variables for individual seasons (Bradbury's "D" term, if you look at his paper) or park factors. I think those would change the results only slightly.

Another difference I noticed later is that when I calculated the Z-scores, I used the standard deviation only of players who were 24-35 and had 300 PA. Bradbury, I believe, used the SD of all players, regardless of PA. Again, I don't think that affects the results much (although it makes his coefficients about twice as big as mine).

Finally, I'm not 100% sure that I did exactly what Bradbury did in other respects. The study is vague about the details of the selection criteria. For instance, I'm not sure if any ten seasons qualified a player, or only ten seasons of only 300 PA. I'm not sure if the player need 300 PA every season between 24 and 35, or if that didn't matter as long as the total was over 5000. So I guessed. Also, for Linear Weights, I used a version that adjusts the out for the specific season, whereas Bradbury used -0.25 for all seasons (and compensated somewhat by having a dummy variable for league/season).


-----

Anyway, here is my best-fit equation, followed by Bradbury's:

Mine: Z = 0.760 * age - 0.0133 * age^2 - 0.901 * mean - 10.6802 + dummies
J.C.: Z = 1.322 * age - 0.0224 * age^2 - 1.205 * mean + other stuff + dummies


These equations look different, but that's mostly because Bradbury used a different definition of the Z-score. If you look at the significance levels, they're similar: for mine, about 12 SDs; for Bradbury, about 11 SDs. Bradbury might be smaller because his regression was more sophisticated, with certain corrections that likely brought the significance down.

More importantly, our estimates of peak age, which can be calculated as - ( coeff for age ) / ( 2 * coeff for age^2 ):

Mine: 28.62 peak age
J.C.: 29.41 peak age

Why the difference? My guess is that there was something different about our criteria for selecting players for the sample. Again, I don't think the difference affects the arguments to follow.

Now, this is where J.C. says he ran the regression again, for 1000PA and no 10-year-requirement, and got no difference in peak age. I did the same thing, and I *did* get a difference:

Mine, for 5000 PA: 28.62
Mine, for 1000 PA: 28.06

It looks like a small difference, only .56 years -- and the total of 28.06 is still above the previous studies' conclusion that the peak is in the 27s. However, as it turns out, the way the study is structured, that small difference is really a big difference. Let me show you.

First, I ran the same regression, but this time only for players with 3000-5000 PA:

3000-5000 PA: 27.61

So, these guys with shorter careers did have an earlier peak, about a year earlier than the guys with the longer careers. What if we now look at the guys with really short careers, 1000-3000 PA?

1000-3000 PA: 147.00

That's not a misprint: the peak came out to age 147! But the coefficients of the age curve were not close to statistical significance -- neither the age, nor the age-squared. Effectively, these guys performed almost the same regardless of age. They didn't peak at 29, but neither did they peak at 27. They just didn't peak.

And so, it's reasonable to conclude that one of the reasons the peak age dropped so little, when we added more players like Bradbury did, is that the regression wasn't able to find the peak for the players with the shorter careers. And so the sample still consists of mostly players with longer careers.

------

Can we solve this problem? Yes, I think so. The procedure cut off the sample of players at 24 and 35 years of age. If we eliminate the cutoff, the results start to work.

I reran the regression with no age restrictions: players had to have 5000 or 1000 PA anywhere in their careers, not just between 24 and 35. Also, I considered all seasons in which they had 300 PA, regardless of how old they were that year. The numbers are similar:

28.97 for 5000 PA+
28.66 for 1000 PA+

The difference is smaller now, 0.31 years. But the important result is the breakdown of the 1000+ group:

28.97 for 5000 PA+
27.72 for 3000-5000 PA
26.61 for 1000-3000 PA (now significant)
----------------------------------------
28.66 for the overall sample

It seems like the shorter the career, the earlier the peak.

But, still, the overall average seems to only have dropped 0.31 of a year, and it's still around 29 years. Isn't that still evidence against the 27 theory?

No, it's not.

Take a look at the above table again: we have three peaks, 28.97, 27.72, and 26.61. Those three numbers average to 27.77. Why, then, is the "overall" number so much higher, at 28.66?

It's because there were a lot more datapoints in the 5000 PA+ category than the others. And that makes sense. The more PA, the more seasons played. And each season gets a datapoint. So the top category is full of batters with 10 or more seasons, while the bottom category is full of batters with only a few seasons. In fact, some of them may have only 1-2 qualifying seasons of 300 PA or more.

If a player has a 15-year career, with a peak at age 29, he gets fifteen "29" entries in the database. If another player has a 3-year career with a peak of 27, he gets only three "27" entries. So instead of the result working out to 28, which is truly the average peak of the two players, it works out to 28.7.

Another way to look at it: Player A has a 12-year career. Player B has a 2-year career. What's the average career? It's 7 years, right? And you get that by averaging 12 and 2.

But the way Bradbury's study is designed, it would figure the average career is 10.57 years. Instead of averaging 12 and 2, it would average 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 2, and 2. That's not the result we're looking to find.

This is less of a problem in Bradbury's original study, because, by limiting players to 12 years of their career, and requiring them to play 10 seasons, most of the batters in the study would be between 10 and 12 years, so the weightings would be closer. Still, this feature of the study means that it's probably overestimating the peak at least a little bit, even for that sample of players.

So, anyway, if 28.66 is not the right average because of the wrong weights, how can we fix it? Simple: instead of weighting by the number of regression rows in each group, we weight by the number of players in each group:

28.97 for: 640 players with 5000+PA
27.72 for: 595 players with 3000-5000 PA
26.61 for 1148 players with 1000-3000 PA
----------------------------------------
27.52 overall average

So what looked like a small drop when we added the shorter-career players -- 0.31 years -- turns into a big drop -- 1.55 years -- when we weight the data properly.

Now, this only works when there actually IS a drop between the 5000+ and the 1000+ groups. We found a drop of 0.31. But on his blog, Bradbury said that with his data, he found no drop at all.

How come? I'm not sure. But one reason might be random variation (if he used different selection criteria). Another might be his age restriction causing nonsensical results in the important 1000-3000 group. And there are his other variables for "missed information resulting from playing conditions". Or, of course, I may have done something wrong.

------

So we're down to 27.52. That's pretty close to the traditional estimates of 27ish. But I think we're not necessarily done: there are at least two factors I can think of that suggest that the real value is lower than even 27.52.

First, we showed that the regression overestimates the peak age by overweighting long careers relative to short careers. We were able to get the average to drop from 28.66 to 27.52 just by breaking the sample down and reweighting.

By the same logic, all three groups above must also be overestimates! In the middle group, players with 5000 PA are going to be weighted 67% higher than players with only 3000 PA. If we were to rerun the regression after breaking the group down further, into (say) 3000-4000 and 4000-5000, we'd get a lower estimate than 27.52. In fact, we could break those new groups down into smaller groups, and break those groups down into smaller groups, and so on. The problem is that the sample size would get too small to get reasonable results. But I'm betting the average would drop significantly.

Second, the study leaves out players with less than 1000 PA. That's probably a good thing, because with only 1 or 2 seasons, it's hard to fit a trajectory properly. Still, it seems likely that if there were a way of figuring it out, we'd find those players would peak fairly early, bringing the average down further.

------

So, in summary:

-- If we use the Bradbury model on groups of players with fewer PA, we find that those players are estimated to have lower peak age. This supports the hypothesis that choosing only 5000+ PA players biases the result too high.

-- The model used in Bradbury's study consistently overestimates peak age for another reason. That's the weighting problem -- it figures the peak for the average *season*, not for the average *player*.

-- Correcting for that shows that if we look at players with 1000 PA, instead of just players with 5000 PA, the peak age drops to the mid 27s.

-- Other corrections that we can't make, because of sample size issues, would drop the peak age even further.

-- There is good evidence that the shorter the career, the younger the peak age.

-- It doesn't seem possible, with this method, to get a precise estimate of average peak age. "Somewhere in the low 27s" is probably the best it can do, if even that.


Labels: ,

Monday, November 23, 2009

The Bradbury aging study, re-explained

A few days ago, J.C. Bradbury responded to my recent post on his age study.

Bradbury had authored a study claiming that hitters peak at age 29.4, contradicting other studies that showed a peak around 27. His study was based on the records of all batters playing regularly between age 24 and 35. I argued that, by choosing only players with long careers progressing to an relatively advanced age, his results were biased towards players who peak late -- because, after all, someone with the same career trajectory, just starting a few years earlier, would be out of baseball by 35 and therefore not make the study.

In response, Bradbury denies that selective sampling is a problem. He writes,

"Phil Birnbaum has a new theory as to why I’m wrong (I suspect it won’t be his last)."


Actually, it's not a new theory. I mentioned it at exactly the same time and in the same post as another theory, last April. Bradbury actually linked to that post a few days ago.

Also, the reason "it won't be my last" is that, like many other sabermetricians, I am curious to find out why there's a difference between Bradbury's findings, which find a peak age of 29+, and many previous studies, which find a peak age of 27. They can't both be correct, and they way to resolve the contradiction is to suggest reasons and investigate whether they might be true.

But, Bradbury also said that I showed "a serious lack of understanding of the technique I employed." He's partially right -- I did misunderstand what he did. After rereading the paper and playing around with the numbers a bit, I think I have a better handle on it now. This post, I'm going to try explaining it (and why I still believe it's biased). Please let me know if I've got anything wrong.

-----

Previously, I had incorrectly assumed that Bradbury's study worked like other aging studies I've seen (such as Justin Wolfers', or Jim Albert's (.pdf)). In those other studies, the authors took a player's performance over time, smoothed it out into a quadratic, and figured out the peak for each player.

Then, after doing that for a whole bunch of players, those other studies would gather all the differently shaped curves, and analyze them to figure out what was going on. They implicitly assumed that every player has his own unique trajectory.

Bradbury's study doesn't do that. Instead, Bradbury uses least-squares to estimate the best single trajectory for *every batter in the study*. That's 450 players, all with exactly the same curve, based on the average.

According to this model, the only difference between the players is that some players are more productive than others. Otherwise, every batter has exactly the same shaped curve. The only difference the model allows, between the curves of different players, is vertical movement, up for a better player, down for a worse one.

For instance: take Carlos Baerga, whose career peaks early with a short tail on the left and a long tail on the right, peak in his early 20s. Then take Barry Bonds, whose career is the opposite: his career peaks late, with a long tail on the left and a short tail on the right.

What Bradbury's model does is take both curves, put them in a blender, and come out with two curves that look exactly the same, peaking in the late 20s. The only difference is that Bonds' is higher, because his level of performance is better.

The model fits 450 identical curves to the actual trajectories of the 450 players. They can't be particularly good fits, because they're all the same. If you look at those 450 fitted curves, they're like a vertical stack of 450 identical boomerangs: some great hitter at the top, some really crappy hitter at the bottom, and the 448 other players in between.

I can pull a boomerang off the top, and show you, this is what Barry Bonds looks like. The best fit is that he started low, climbed until he reached 29 or so, then started a symmetrical decline (the model assumes symmetry). You'll ask, "what does Carlos Baerga look like?" I'll say, "it's exactly the same as Barry Bonds, but lower." I'll take my Barry Bonds boomerang, and lower my arm a couple of inches. Or, I can just pull the Baerga boomerang out of the middle of the stack.

(One more way of putting it. See this chart? This is how Justin Wolfers represents the careers of a bunch of great pitchers. He smoothed the actual trajectories, but modeled that every pitcher gets his own peak age, and his own steepness of curve. But for this study, they would all be the same shape, just one stacked above the other.)

-----

Now, it's seems to me that the model is way oversimplified. It's obviously false that all players have the same trajectory and the same peak age. People are different. They mature at different rates, both in raw physical properties, and in how fast they learn and adapt. Indeed, this is something the study acknowledges:

"Doubles plus triples per at-bat peaks 4.5 years later for Hall-of-Famers, which indicates that elite hitters continue to improve and maintain some speed and dexterity while other players are in decline."


So, implicitly, even Bradbury admits that the model's assumptions are wrong: some players age differently than others.

However, even if the model is wrong in its assumptions and in how it predicts individual players, it's possible to argue that the composite player it spits out is still reasonable.

For instance, suppose you have three people. One is measured to be four feet tall, one five feet, and one six feet. There are two ways you can get the average. You can just average the three numbers, and get five feet.

Or, you can create a model, an unrealistic model, that says that all three are really the same height, and any discrepancies are due to uncorrelated errors by the person with the measuring tape. If you run a regression to minimize the sum of squares of those errors, you get an estimate that all three people are actually ... five feet.

The model is false. The three people aren't really of equal height, and nobody is so useless with a tape measure that their observations would be off by that much. But the regression nonetheless gives the correct number: five feet. And so you'll be OK if you use that number as the average, so long as you don't actually assume that the model matches reality, that the six-foot guy is really the same height as the four-foot guy. Because there's no evidence that they are -- it was just a model that you chose.

I think that's what's happening here. It's obvious that the model doesn't match reality, but it has the side effect of creating a composite average baseball player, whose properties can be observed. As long as you stick to those average properties, and don't try to assume anything about individual players, you should be OK. And that's what Bradbury does, for the most part, with one exception.

----

A consequence of the curves having the same shape is that declines are denominated in absolute numbers, rather than percentages of a player's level. If the model says you lose 5 home runs between age X and age Y, then it assumes *everyone* loses 5 home runs, everyone from Barry Bonds to Juan Pierre -- even if Juan Pierre didn't have 5 home runs a year to lose!

If Bonds is a 30 home run guy at age X, he's predicted to drop to 25 -- that's a 17% decline. If Juan Pierre is a 5 home run guy at age X, he's predicted to drop to 0 -- a 100% decline.

In real life, that's probably not the way it works -- players probably drop closer to the same percentage than by the same amount. Table VII of the paper says that a typical hitter would lose about half his homers (on a per PA basis) between 30 and 40. If Bradbury used a season rate of 16 homers as "typical," that's a 8 HR decline. But what about players who hit only 4 homers a year, on average? The model predicts them dropping to minus 4 home runs!

Now, that's a bit of an unfair criticism. The text of the study doesn't explicitly argue that a Bonds will drop by the same number of home runs as a Baerga, even though the study deliberately chose a model that says exactly that. Remember, the model is unrealistic, so as long as you stick to the average, you're OK. Bonds and Pierre are definitely not the average.

But, then, why does Bradbury's Table VII deal in percentages? The model deals in absolutes. Bradbury obtained the percentages by applying the absolutes to a "typical" player, presumably one close to average. So why not put "-8 HR" in that cell, rather than "-48.95%"?

By showing percentages, there's an unstated implication, that since the model shows an average player with 16 HR drops to 8, you can extrapolate to say that a player with 40 HR will drop to 20. But that would have to be backed up by evidence or argument. And the paper provides neither.

-----

To summarize:

-- the model assumes all players have the same peak age, and the same declines from their peak (which is another way of saying that it assumes that all players have the same shape of trajectory.)

-- it does assume some players (Barry Bonds) have a higher absolute peak than others (Jose Oquendo), but still have the same shape of career.

-- it assumes that all players rise and decline annually by the same absolute amount. In the agespan it takes for a 10-triple player to decline to 5 triples, a 6-triple player will decline to 1 triple, and Willie Aikens will decline to -5 triples.

What can you get out of a model like that, with its unrealistic assumptions? I think that you can reasonably look at the peak and shape as applied to some kind of hypothetical composite of the players used in the study. But I don't think you can go farther than that, and make any assumptions about other types of players.

So: when Bradbury's study comes up with the result that his sample of players peaked at 29.5 years (for Linear Weights), I think that's probably about right -- for his sample of players. When he says that the average home run hitter loses 8 home runs between 30 and 40, I think that's probably about right too -- for his sample of players.

My main argument is not that the model is unrealistic, and it's not that there's something wrong with the regression used to analyze the model. It's that the sample of players that went into the model is biased, and that's what's causing the peak to be too high.

Bradbury's model works for his sample -- but not for all baseball players, just the ones he chose. Those were the ones who, in retrospect, had long careers.

To have a long career, you have to keep up your performance for many years. To keep up your performance for many years, you need to have a slower decline than average. If you have a slower decline than average, a higher proportion of your value comes later in your career. If a higher proportion of your value comes later in your career, that means that you'll have an older-than-average peak.

So choosing players with long careers results in a peak age higher than if you looked at all players.

Bradbury disagrees. He thinks that Hall of Fame players may have a significantly different peak than non Hall-of-Fame players, but doesn't think that players with long careers might have a different peak than players with short careers.

That really doesn't make sense to me. But Bradbury has evidence. In his response to my post, he reran his study, but for all players with a minimum of 1000 PA, instead of his previous minimum 5000 PA. That is, he added players with short careers.

He found no difference in the peak age.

That's a pretty persuasive argument. I argued A, Bradbury argued B, and the evidence appears to be consistent with B. No matter how good my argument sounds, if the evidence doesn't support it, I better either stop arguing A, or explain why the evidence isn't consistent with B.

Still, the logic didn't seem right to me. So I spent a couple of days trying to replicate Bradbury's study. I wasn't able to duplicate his results perfectly, but many of them are close. And I'm not sure, but I think I have an idea about what's going on, why the evidence is consistent with A. That is, why Bradbury's 1000+ study comes up with a peak of 29 years, while other studies have come up with 27.

I'll get to that in the next post.



Labels: ,

Monday, November 16, 2009

Selective sampling and peak age

Back a couple of years ago, I reviewed a paper by J.C. Bradbury on aging in baseball. J.C. found that players peak offensively around age 29, rather than the age 27 found in other studies.

I had critiqued the study on three points:

-- assuming symmetry;
-- selective sampling of long careers;
-- selective sampling of seasons.

In a blog post today, J.C. responds to my "assuming symmetry" critique. I had argued that if the aging curve in baseball has a long right tail, the median of the symmetrical best-fit curve would be at a higher age than the peak of the original curve. That would cause the estimate to be too high. But, today, J.C. says that he tried non-symmetrical curves, and he got roughly the same result.

So, I wondered, if the cause of the discrepancy isn't the poor fit of the quadratic, could selective sampling be a big enough factor? I ran a little experiment, and I think the answer is yes.

J.C. considered only players with long careers, spanning ages 24 to 35. It seems obvious that that would skew the observed peak higher than the actual peak. To see why, take an unrealistic extreme case. Suppose that half of players peak at exactly 16, and half peak at exactly 30. The average peak is 24. But what happens if you look only at players in the league continuously from age 24 to 35? Almost all those players are from the half who peak at 30, and almost none of those guys are the ones who peaked at 16. And so you observe a peak of 30, whereas the real average peak is 24.

As I said, that's an unrealistic case. But even in the real world, you expect early peakers to be less likely to survive until 35, and your sample is still skewed towards late peakers. So the estimate is still biased. Is the bias significant?

To test that, I did a little simulation experiment. I created a world where the average peak age is 27. I made two assumptions:

-- every player has his own personal peak age, which is normally distributed with mean 27 and variance 7.5 (for an SD of about 2.74).

-- I assumed that for every year after his peak, a player has an additional 1/15 chance (6.6 percentage points) to drop out of the league. So if a player peaks at 27, his chance of still being in the league at age thirty-five is (1 minus 8/15), since he's 8 years past his peak. That's 46.7%. If he peaks at 30, 35 is only five years past his peak, so his chance would be 66.7% (which is 1 minus 5/15).

Then, I simulated 5,000 players. Results:

27.0 -- The average peak age for all players.

28.1 -- The average observed peak age of those players who survived until age 35.

The difference between the two results is the result of selective sampling. So, with this model and these assumptions, J.C.'s algorithm overestimates the peak by 1.1 years.

We can get results even more extreme if we change some of the assumptions. Instead of longevity decaying by 1/15, suppose it decays by 1/13? Then the average observed age is 28.5. If it decays by 1/12, we get 28.9. And if it decays by 1/10, the peak age jumps to 30.9.

Of course, we can get less extreme results too: if we use a decay increment of only 1/20, we get an average of 27.6. And maybe the decay slows down as you get older, and we might have too steep a curve near the end. Still, no matter how small the increment, the estimate will still be too high. The only question is, how much too high?

I don't know. But given the results of this (admittedly oversimplified) simulation, it does seem like the bias could be as high as two years, which is the difference between J.C.'s study and others.

If we want to get an unbiased estimate for the peak for all players, not just the longest-lasting ones, I think we'll have to use a different method than tracking career curves.

UPDATE: Tango says it better than I did, here.


Labels: ,

Friday, November 13, 2009

Consumer Reports alarmist on reverse mortgages

(Warning: non-sports post.)

In their September issue, Consumer Reports issues another muddled panic about a financial product; this time it's reverse mortgages.


Basically, a reverse mortgage is a loan you take out using your house as collateral. Normally, you'd do that with a line of credit -- you borrow the money as you need it, and make at least your minimum payment every month (as interest accrues). It's like a credit card, but with a much lower interest rate because it's backed by your house.

The reverse mortgage is also a loan on your home equity, but it's meant for poorer elderly people who don't have the income to make payments on the loan. With the reverse mortgage, you still get the loan, but the interest accumulates and compounds, and you don't have to pay it back until you move out of the house (or die). The idea is that when you're no longer living in the house, you sell it and use the proceeds to pay off the loan.

What if the loan has compounded so high that the value of the house isn't enough to pay it off? In that case, the borrower is off the hook. One of the benefits of the reverse mortgage is that the borrower is never on the line for more than the house itself.

As CR points out, this benefit has a price: the borrower winds up paying for "insurance" against that happening, insurance that tops up the loan if the house is eventually not worth enough. It's government insurance, and comes with government regulations on reverse mortgages. For instance, you have to be over 62, and you have to do lots of expensive legal paperwork.

It's called a "reverse mortgage" because it's often taken out to provide a stream of payments to supplement social security. That stream of payments is backwards from a normal mortgage: instead of you paying off the mortgage every month, the mortgage pays you.

My feeling, and CR's too, is that a reverse mortgage is a reasonable thing to do if you plan to stay in your house forever, and won't need money afterwards (because either you've died, or you're so ill you move to a nursing home paid for by government). Why die with money in the bank (or equity in your house)?

Another benefit of the reverse mortgage is that it sometimes it can provide the only way to get a lump sum of money in case of sudden need, like a medical emergency.

--------

So what's CR's problem with reverse mortgages? They have a few. Some of them are not completely unreasonable. CR gives stories of people being sold expensive reverse mortgages in order to use the money for inappropriate investments, which is certainly a bad thing. But that's not the fault of the reverse mortgage -- seniors are sold questionable financial products all the time, and sometimes persuaded to borrow money in other ways.

And CR gives examples of seniors who didn't really understand what they were getting into. For instance, sometimes salespeople hand customers overoptimistic projections of what their house will be worth, misleading them about the amount of equity that will be left for their children to inherit. But, again, biased salespeople are hazards of any financial transaction.

CR is also concerned that due to the housing meltdown, a lot of reverse mortgages end in the red, where the government-sponsored insurance comes into play. The payouts have started to exceed the premiums paid by borrowers, and CR is concerned about the burden on the taxpayer. It could be "the next financial fiasco."

I don't really understand their concern. In 2008, the worst year of the housing crisis, the fund only had to pay $400 million in claims. Suppose they pay at that rate for five years. That's about $6 per American. Compare that to the $7 trillion bailout, which is $13,000 per American. Is it really worth worrying about a $6 "fiasco" while ignoring the $13,000?

Not only is it a small amount of money, but you could argue that it's money well spent. Reverse mortgage insurance money is not a gift to the irresponsible: it's part of a social program that allows senior citizens to hold on to their homes, while living better in their old age. I'm not a big fan of government spending, but such a small sum, for such a good purpose, as the result of once-in-a-lifetime anomaly in housing prices, is probably 1000th on my list of government policy issues people should be concerned about.

---------

But the thing that *really* bugged me about the article is the lead anecdote that purports to show the human side of why reverse mortgages are harmful. But, as with their medical credit card screed last year, they got their conclusion completely backwards! Their example actually shows a reverse mortgage that handsomely rewarded the borrower.

When Ernest Minor was 61, his wife had serious medical problems. The Minors still owed $70,000 on their home. They took out a reverse mortgage loan for $176,000. $70,000 of that simply replaced the outstanding mortgage balance. $15,000 went for fees and insurance. And $92,000 went to pay the medical bills.

Because Mr. Minor was not yet 62 (but his wife was), they had to transfer the deed to Mrs. Minor's name in order to be eligible -- which they did.

Mrs. Minor died two years later. That made the loan come due. With interest, it's about $200,000. Obviously, Mr. Minor can't afford to pay that, so he will lose his home.

Sure, it's sad that Mr. Minor will lose the house he lived in for so many years. But, after all, he needed $92,000 for medical bills. Without the reverse mortgage, what would the Minors have done? They'd have sold the house, rented an apartment, and used the proceeds to pay the bills. They would have lost the home immediately. But, with the reverse mortgage, they got to live in the house a couple more years. Plus, if Mrs. Minor had gotten better, they would have been able to stay there indefinitely! Of all the Minors' options, the reverse mortgage was, in fact, the very best way to handle the situation.

CR is upset that the broker had Mr. Minor take his name off the deed to get the loan -- if he hadn't, he'd still be allowed to stay in the house until he moved out. But, unfortunately, government regulation gave him no choice! Instead of picking on the salesman, maybe CR should lobby the government to lower the minimum age, or at least allow one of the spouses to be under 62.

Anyway, Mr. Minor benefits even further. He owes $200,000. But because of the real estate meltdown, his home is only worth $130,000. You'd think he'd be in the hole to the tune of $70,000. But he's not! Remember, with the reverse mortgage, you can't owe more than the house is worth. Mr. Minor can walk away, because his $70,000 deficit is covered by the insurance he bought!

Let's do an accounting. Suppose that before the real-estate crisis, the house was worth $300,000 (which is about right, based on what CR tells us about what percentage of the home's value is loanable). If Mr. Minor had raised the $92,000 some other way, via a conventional loan, he'd be liable for that $92,000. He'd still owe $70,000 on his mortgage And he'd have made maybe $20,000 in interest payments over the three years on that $162,000.

So his total owing would be $182,000. His house would be worth only $130,000. His net worth would be negative $52,000.

But, with the reverse mortgage, he just walks away! He loses the house, but his net worth is $0. The reverse mortgage saved him $52,000! Of course, most of that savings came from avoiding the housing crisis, but still -- rather that Mr. Minor being a reverse mortgage sob story, he should be a success story!

CR prints a half-page photo of a sad Mr. Minor, holding a photo of his deceased wife. It's indeed very sad and moving that the illness lost him his wife and his house. But the reverse mortgage was the lone, small bright spot. It wound up saving Mr. Minor thousands and thousands of dollars. And CR didn't even notice.

Labels:

Saturday, November 07, 2009

Do younger umpires call a more accurate strike zone?

In a post on the economic incentives facing would-be umpires, J.C. Bradbury has an interesting study on how older umpires are more likely to have a larger or smaller strike zone.

Bradbury ran a regression, to predict an umpire's season strikeout-to-walk ratio based on his age and who he is. He found that the older the ump, the more different his strike zone size. He writes,

"It turns out that every year an umpire ages he increases his deviation from the league average by about 0.8% [.16 change in K/BB ratio]. That doesn’t seem like a lot—and it really isn’t a huge effect—but over a period of 12 years that pushes the umpire a full standard deviation (9.5%) [.19] above/below the average deviation. Thus, by the end of an umpire’s career, his calls are about two standard deviations from the typical deviation. This is evidence of a tenure effect or a loss of competency." [square brackets mine.]


How big is that effect? Well, 0.19 might be the difference between a K/BB ratio of 2 and a ratio of 2.19. Assuming the same number of total K+BB, that means that instead of (say) 20 strikeouts for every 10 walks, there would be 20.6 strikeouts for every 9.4 walks. That turns 0.6 walks into strikeouts per 30 K+BB events, which is about 0.35 runs.

In 2009, there were 10.33 such events per game (per team), not 30. That means a standard deviation is worth a bit over a third of .35, or .12 runs per game. So a pitcher's ERA might go up or down by .11. For two standard deviations, it would be .24 runs or .22 earned runs. Assuming both teams are equal, there would obviously be no effect on who wins the game (although it seems likely that pitchers may be affected differently).

What's that in terms of pitches? A study I did (.pdf, page 4) finds that the difference between a ball and a strike is about 0.14 runs. So, at 2 standard deviations, an umpire calls about 1.7 pitches differently per game, per team. That means that more than 95% of umpires are less than 1.7 pitches per game different from the mean.

But since the pitchers and batters are presumably aware of the differences between umpires, they would adjust accordingly. So the effect might be more than .12 runs per SD -- it could be .12 for the actual strikeouts and walks observed, but it might cost the batter another .12 (or some other number) in having to swing at bad pitchers (which would be called balls by another umpire).

And, of course, even umpires who call normal numbers of strikeouts and walks might have their own particular strike zone -- it might be the same size, but have a different shape or location.

Bradbury concludes that, as umpires gain experience, they get more confident and less reverent, and feel less of a need to stick to the league's interpretation of where the strike zone should be. That sounds reasonable, especially considering that Bradbury's regression controlled for calendar year.

It's worth reading Bradbury's entire post, which includes a list of umpires and their individual K/BB ratios.

---

(Note: post was updated in response to an e-mail from Guy, who pointed out I had misinterpreted Bradbury's percentages and got the effect being half of what it should be. I think it's correct now.)


Labels: ,

Thursday, November 05, 2009

Do field-goal kickers do worse in the clutch?

My favorite studies are ones that don't need any fancy math or statistics, but those where you can just look at the data and answer the question almost instantly.

Brian Burke, of "Advanced NFL Stats," had one of those last week. He wondered: is there an overall "choke" effect for field goal kicking? Are kickers less likely to make their kick when the game is on the line, either because of nervousness, or because defenses change their strategy?

The answer appears to be: no. Adjusted for distance, the clutch success rates track the overall success rates almost exactly, except for one blip in the data at 44 yards.

Of course, that doesn't mean that *no* kicker is different in the clutch. The data are consistent with the possibility that only one or two kickers are clutch or choke, which wouldn't be enough to show up in the graph. It's also possible that half of all kickers are clutch, and the other half are choke, and they exactly offset each other so there appears to be no effect.

But in view of the baseball evidence, which shows only very, very slight "clutch" variation among hitters, it doesn't seem likely that field-goal kickers would be significantly clutch.

Furthermore, considering how much data it took to test the clutch hypothesis for baseball, it's probably impossible to find an effect for kickers, even of the same rough size as for batters (less than 3% variation in success rate). Baseball hitters get several hundred opportunities in a season; kickers get maybe 30 or 40. If an effect for individual kickers exists in the NFL, it would have to be huge to have any chance of being detected.


Labels: , ,

Friday, October 30, 2009

Don't use regression to calculate Linear Weights, Part II

Last post, I wrote about how using regression to estimate Linear Weights values is a poor choice, and that the play-by-play (PBP) method is better. After thinking a bit more about it, I realized that I could have made a stronger case than I did. Specifically, that the regression method gives you an *estimate* of the value of events, but the play-by-play method gives you the *actual answer*. That is: a "perfect" regression, with an unlimited amount of data, will never be able to be more accurate than the results of the PBP method.

An analogy: suppose that, a while ago, someone randomly dumped some red balls and some white balls into an urn. If you were to draw one ball out of the urn, what would be the probability it's red?

Here are two different ways of trying to figure that out. First, you can observe people coming and drawing balls, and you can see what proportion of balls drawn turned out to be red. Maybe one guy draws ten balls (with replacement) and six are red. Maybe someone else comes up and draws one ball, and it's white. A third person comes along and draws five white out of 11. And so on. Over all, maybe there are 68 balls drawn total, and 40 of them are red.


So what do you do? You figure that the most likely estimate is 40/68, or 58.8%. You then use the binomial approximation to the normal distribution to figure out a confidence interval for your estimate.

That's the first way. What's the second way?

The second way is: you just empty the urn and count the balls! Maybe it turns out that the urn contains exactly 60 red balls and 40 white balls. So we now *know* the probability of drawing a red ball is 0.6.

If the second method is available, the first is completely unnecessary. It gives you no additional information about the question once you know what's in the urn. The second method has given you an exact answer.

That, I think, is the situation with Linear Weights. The regression is like observing people draw balls; you can then make inferences about the actual values of the events. But the PBP method is like looking in the urn -- you get the answer to the question you're asking. It's a deterministic calculation of what the regression values will converge to, if you eventually get the regression perfect.

-------

To make my case, let me start by (again) telling you what question the PBP method answers. It's this:

-- If you were to take a team made up of league-average players, and add one double to its batting line, how many more runs would it score?

That question is pretty much identical to the reverse:

-- If you were to take a team made up of league average players, and remove one double from its batting line, how many fewer runs would it score?

I'm going to show how to answer the second question, because the explanation is a bit easier. I'll do it for the 1992 American League.

Start with a listing of the play-by-play for every game (available from Retrosheet, of course). Now, let's randomly choose which double we're going to eliminate. There were 3,596 doubles hit that year; pick one at random, and find the inning in which it was hit.

Now, write out the inning. Write out the inning again, without the double. Then, see what the difference is in runs scored. (The process is almost exactly the same as how you figure earned runs: you replay the inning pretending the error never happened, and see if it saves you some runs.)

Suppose we randomly came up with Gene Larkin's double in the bottom of the fourth inning against the Angels on June 24, 1992. The inning went like this:

Actual: Walk / Double Play / DOUBLE / Fly Out -- 0 runs scored.

Without the double, our hypothetical inning would have been

Hypoth: Walk / Double Play / ------ / Fly Out -- 0 runs scored.

Pretty simple: in this case, taking away the double makes no difference, and costs the team 0 runs.

On the other hand, suppose we chose Brady Anderson's leadoff first-inning double on April 10. With and without the double:

Actual: DOUBLE / Double / Single / Double Play / HR / Ground Out -- 3 runs scored.

Hypoth: ------ / Double / Single / Double Play / HR / Ground Out -- 2 runs scored.

So, in this case, removing the double cost 1 run.

If we were to do this for each of the 3,596 doubles, we could just average out all the values, and we'd know how much a double was worth. The only problem is that sometimes it's hard to recreate the inning. For instance, Don Mattingly's double in the sixth inning on September 8:

Actual: Out / Single / DOUBLE (runner holds at 3rd) / Single (scores both runners) / Fly Ball / Fly Ball -- 2 runs scored.

Removing the double gives

Hypoth: Out / Single / ------ / Single / Fly Ball / Fly Ball

How many runs score in this reconstructed inning? We don't know. If the second single advanced the runner to third, and the subsequent fly ball was deep enough, one run would have scored. Otherwise, it would be 0 runs. So we don't know which it would have been. What do we do?

The situation arises that we were picking innings randomly, and dividing them into halves (the half before the double, and the half after the double). The problem is that the process creates an inconsistency in the hypothetical inning. The second half of the inning, in real life, started with one out and a runner on second and third. The hypothetical second half started with one out and a runner on first. That created the problem.

So, since we're picking randomly anyway, why don't we throw away the *real* second half of the inning, and instead pick the second half of some *other* inning, some inning where there actually IS one out and a runner on first? That will always give us a consistent inning. And while it will give us a different result for this run of our random test, over many random tests, it'll all even out.

We might randomly choose Cleveland's fourth inning against the Royals on July 16. In that frame, Mark Whiten struck out and Glenallen Hill singled, which gives us our required runner-on-first-with-one-out. After that, Jim Thome singled, and Sandy Alomar Jr. ground into a double play.

Grafting the end of that inning (single, double play) on to the beginning of the original inning gives us our "consistent" hypothetical inning:

Hypoth: (stuff to put a runner on first and one out) / ------ / Single / GIDP -- 0 runs scored.

Since the Indians scored two runs in the original, unadulerated inning, and zero runs in the hypothetical inning, this run of the simulation winds up with a loss of two runs.

Now, there's nothing special about that Cleveland fourth inning: we just happened to choose it randomly. There were 6,380 cases of a runner on first with one out, and we could have chosen any of them instead.

The inning could have gone:

Out / Single / ------ / result of inng 1 of 6,380
Out / Single / ------ / result of inng 2 of 6,380
Out / Single / ------ / result of inng 3 of 6,380
Out / Single / ------ / result of inng 4 of 6,380
...
Out / Single / ------ / result of inng 6,380 of 6,380

If we run the simulation long enough, we'll choose every one of the 6,380 equally. And so, we'll wind up with just the average of those 6,380 innings. So we can get rid of the randomness in the second half of the inning just by substituting the average of the 6,380. Then our "remove the double" hypothetical becomes:

Out / Single / ------ / average of all 6,380 innings

And, with the help of Retrosheet, we find that after having a runner on first and one out, those 6,380 innings resulted in an average of 0.510 runs being scored. So now we have:

Actual: Out / Single / DOUBLE (runner holds at 3rd) / Single (scores both runners) / Fly Ball / Fly Ball -- 2 runs scored.

Hypoth: Out / Single / ------ / plus an additional 0.510 runs.

I'll rewrite that to make it a bit easier to see what's going on:

Actual: (stuff that put a runner on first with one out) / DOUBLE / other stuff that caused 2 runs to be scored

Hypoth: (stuff that put a runner on first with one out) / other stuff that caused 0.510 runs to be scored, on average

So, for this inning, we can say that removing the double cost 1.490 runs.

Now, the "actual" inning was again random. We happened to choose the Yankees' 6th inning on September 8. But we might have chosen another, similar, inning where there was a runner on first with one out, and a double was hit, and the runner held at third. This particular September 8 inning led to two runs. Another such inning may have led to six runs, or three runs, or no runs (maybe there were two strikeouts after the double and the runners were stranded).

So, what we can do, is aggregate all these types of innings. If we look to Retrosheet, we would find that there were 796 times where there were runners on second and third with one out. In the remainder of those 796 innings, 1129 runs were scored. That's an average of 1.418 per inning.

So we can write:

Actual: stuff that put a runner on 1st with one out / DOUBLE leading to runners and 2nd and 3rd with one out / Other stuff leading to 1.418 runs scoring, on average.

Hypoth: stuff that put a runner on 1st with one out / ------ / Other stuff leading to 0.410 runs scoring, on average.

And so, we know that a double with a runner on first and one out, which ends with runners on 2nd and 3rd, is worth, on average, 1.008 runs.

Let's write this down this way, as an equation:

+1.008 runs = Runner on 1st and one out + Double, runner holds

We can repeat this analysis. What if the runner scores? Then, it turns out, the average inning led to 1.646 runs scoring instead of 0.410. So:

+1.236 runs = Runner on 1st and one out + Double, run scores

We can repeat this for every combination of bases and double results we want. For instance:

-0.117 runs = Runner on 1st and one out + Double, runner thrown out at home

+1.000 runs = Runner on 2nd and nobody out + Double

+1.212 runs = Runner on 1st and two out + Double, runner safe at home and batter goes to third on the throw

I'm not sure how many of these cases there are, but we can look to Retrosheet and list them all. At the end, we have a huge list of all possible combinations of doubles, and what they were worth in runs. We just have to average them, weighted by how often they happened, and we're done. We then have the answer.

As it turns out, the answer for the 1992 American League works out to 0.763 runs.

The answer is NOT a estimate based on a model with random errors that we have to eliminate. It's the exact answer to the question, the same way counting the balls in the urn gave us an exact answer.

Just to be absolutely clear, here's what we've shown:

Suppose we randomly remove one double from the 1992 American League. Then, we reconstruct the inning from the point of that double forward, by looking at the base/out situation before the double, finding a random inning with that same base/out situation, and substituting that new inning instead of what really happened.

If we do that, we should expect 0.763 runs fewer will be scored. If we were to run this same random test a trillion times, the runs lost will average out to .763 almost exactly.

If you try to answer this question by running a regression, to the extent that your estimate is different from 0.763, you got the wrong answer.

------

Anyway, the explanation above was a complicated way of describing the process. Here's a simpler description of the algorithm.

1. Using Retrosheet data, find every situation where there was a runner on second and no outs. It turns out there were 1,572 such situations in the 1992 AL. Count the total number of runs that were scored in the remainder of those innings. It turns out there were, on average, 1.095 runs scored each time that happened (1,722 runs scored in those 1,572 innings).

2. Repeat this process for the other 23 base-out states (two-outs-bases-loaded, one-out-runners-on-first-and-third, and so on). If you do that, and put the results in the traditional matrix, you get:

0 out 1 out 2 out
------------------------------
0.482 0.248 0.096 nobody on
0.853 0.510 0.211 first
1.095 0.646 0.293 second
1.494 0.907 0.423 first/second
1.356 0.940 0.377 third
1.804 1.151 0.470 first/third
2.169 1.418 0.598 second/third
2.429 1.549 0.745 loaded

3. Find every double hit in the 1992 AL. For each of those 3,596 doubles, figure (a) the run value from the above table *before* the double was hit; (b) the run value for the situation *after* the double; and (c) the number of runs that scored on the play.

The value of that double is (b) - (a) + (c). For instance, a 3-run double with the bases loaded and 2 outs is worth 0.293 minus 0.745 plus 3. That works out to 2.548 runs.

4. Average out each of the 3,596 run values. You'll get 0.763.

It's that simple. You can repeat the above for whatever event you like: triples, stolen bases, strikeouts, whatever. Here are the values I got:

-0.272 strikeout
-0.264 other batting out
+0.178 steal
+0.139 defensive indifference
-0.421 caught stealing
+0.276 wild pitch
+0.286 passed ball
+0.277 balk
+0.307 walk (except intentional)
+0.174 intentional walk
+0.331 HBP
+0.378 interference
+0.491 reached on error
+0.460 single
+0.763 double
+1.018 triple
+1.417 home run

-----

Anyway, my point in the original post wasn't meant to be "regression is bad." What I really meant was, why randomly pull balls from urns when Retrosheet gives you enough data to actually count the balls? This method gives you an exact number.

One objection might be that, to do it this way, there's way too much data to use, and so regression is a more practical alternative. But is it really better to use a wrong value just because it's easier to calculate?

Besides, you don't have to calculate them yourself -- they've been calculated, repeatedly, by others who can be referenced. In the worst case, they're close to the "traditional" weights, as calculated by Pete Palmer some 25 years ago. If you need a solid reference, just use Pete's numbers. They're closer than you'll get by running a regression, even a particularly comprehensive one.


Labels: ,

Tuesday, October 27, 2009

Don't use regression to calculate Linear Weights

About a year ago, in a post titled "Regression, Schmegression," Tom Tango argued that regression is not usually one of the better techniques to use in sabermetric research. He's right, especially for the example he used, which is using regression to find the correct Linear Weight values for the basic offensive events.

What a lot of researchers have done, and are still doing, is listing batting lines for various team-years -- singles, doubles, triples, etc. -- and running a regression to predict runs scored. It's not that bad a technique, but there are other, better ones you can use. Still, by looking a bit closer at the regression results, you can get a good idea of why regression results don't always mean what you think they mean.

Let's start with the triple. How much is a triple worth? That is: how many more runs would an average team score if you gave them exactly one extra triples?

We can run a simple regression, runs scored vs. triples hit. I used a dataset consisting of all full team-seasons from 1961 to 2008 (only for teams that played at least 159 games, to omit strike seasons). That was 1,121 teams. The result of the regression:

Runs = 731 - (0.44 * triples)

That's not a misprint: the regression tells us that every triple actually *costs* its team almost half a run!

It's not a sample size issue, either. The standard error of the -0.44 estimate is 0.27. The estimate was actually significantly different from zero (in the wrong direction!) at the 10% level.

Is it possible that a triple actually lowers your runs scored? Of course not. Our baseball knowledge tells us that's logically impossible. A triple maximizes the value of runners on base (by scoring them all), and then puts a runner on third, where he's also likely to score. It's all positive. There must be something else happening here.

It's pretty obvious, but to understand that we can't take the results at face value, we needed subject matter expertise -- we needed to know something about baseball. In this case, we didn't need to know much, just that triples have to be a good thing. But that's subject matter knowledge nonetheless.

No matter how expert you are in the technique of regression, you have to know something about the subject you're researching to be able to reach the correct conclusions from the evidence. Because, as the saying goes, correlation doesn't imply causation. But it doesn't imply *non-causation* either. It could be that triples cause fewer runs, or it could be that there's some third factor that's positively correlated with triples, but negatively correlated with runs scored. Knowing something about baseball lets us argue for which conclusion makes more sense.

Normally, when you interpret a regression result like this, you say something like: "all else being equal, one extra triple will reduce the number of runs scored by about 0.44." But that's not quite right. The "all else" doesn't refer to everything in the universe -- it only refers to everything else *you controlled for in that regression*. Which, in this case, was nothing -- we only regressed on triples.

A more accurate way to put the regression result is:

"One extra triple is associated with a reduction of the number of runs scored by about 0.44. That's either because of the triple itself, or because of something else about teams who hit more triples, something that wasn't controlled for in the regression."

Now, a possible explanation becomes apparent. Teams that hit lots of triples are usually faster teams. Fast teams tend to have fewer fat strong guys who hit for power. Therefore, maybe hitting lots of triples suggests that your team doesn't have much power, which is why triples are negatively correlated with runs scored.

Again, the regression didn't suggest that: it was our knowledge of baseball.

We can test that hypothesis, and there are a couple of ways to test it. First, we can test for a correlation between triples and other hits. And, yes, the correlation between triples and home runs is -0.31: teams who hit a lot of triples do indeed hit fewer home runs than average.

Or, we can just include home runs in the regression. If we do that, we get the equation

Runs = 373 + (1.84 * triples) + (1.93 * home runs)

Which means:

"Home runs being equal, one extra triple will increase the number of runs scored by about 1.84. That's either because of the triple itself, or because of something else about teams who hit more triples (something other than home runs, which was controlled for)."


Of course, there's still something else other than home runs: our baseball knowledge tells us that teams that hit lots of triples are likely to be different in doubles power, too. And, in fact, in almost every other category: singles, outs, walks, steals, and caught stealings. So if we do a regression on all that stuff, we get:

Runs = 42
+ (0.52 * singles)
+ (0.67 * doubles)
+ (1.18 * triples)
+ (1.48 * home runs)
+ (0.33 * walks)
+ (0.18 * steals)
- (0.21 * caught stealing)
- (0.11 * batting outs (which is AB-H)).

Now we're getting values that are close to traditional Linear Weights. But not completely. For instance (and as Tango noted), we get that a double is worth only 0.67 runs, rather than the 0.8 that we're used to.

Assuming the 0.8 is actually the correct number, we ask: What's going on? Why are we getting only 0.67? It isn't just random variation, because the standard error of the 0.67 estimate in the regression is only 0.02. So what is it?

There must be something about teams that hit a lot of doubles that reduce the number of runs they score, in ways *other* than changing the number of singles, doubles, triples, home runs, walks, caught stealings, CSs, and batting outs. What could that be?

I don't know the answer, but here are some possibilities:

-- maybe teams that hit a lot of doubles (relative to the other events) are more likely to be intentionally walked. Therefore, their walks are less valuable than those of other teams. Every additional double may correlate with one extra walk turning out to be an IBB, which results in the regression giving a lower coefficient for the double.

-- the regression went from 1961 to 2008. Maybe teams that hit a lot of doubles (relative to other events) played in low-offense eras (like the mid 60s). The extra doubles mark the team as being from that era, which makes all events worth less, which makes the regression adjust by giving a lower coefficient for the double.

-- maybe teams that hit a lot of doubles ground into a lot of double plays. Since double plays are extra outs that don't show up in (AB-H), that would cause runs to be overestimated. The regression adjusts for that by building the extra DPs into the value of the double.

And so on. I don't know what the true answer is; none of the suggestions above seem very likely to me. It's a bit of a mystery. I'd add IBB to the regression, but the Lahman database doesn't seem to have it for teams. Maybe I'll calculate it some other way and try again.

Anyway, it was a bit of a shock to me that the doubles estimate was so far off. I would have thought that a technique like linear regression, with over 1,000 rows of data, would be able to come up with the answer. But it didn't, almost certainly because of outside factors that we didn't control for. Not only that, but we don't even really know what those outside factors are! (Although if you have an idea, let me know in the comments.)

-----

So the accepted value for the double is 0.8 runs, using a method I will explain shortly. The regression, on the other hand, gives only 0.67 runs.

It's not that the regression answer is wrong -- it's just that the regression answers a different question than the one we want.

The 0.8 method asks: "If a team happens to hit an extra double, how many more runs will it score?"

The 0.67 method asks, "If one team hits one more double than another team, how many more runs will it score taking into account that the extra double means the team might be slightly different in other ways?"

To most analysts, the first question is more important. Why? Because we really *do* want to find the cause-and-effect relationship. Confounding variables may be interesting, but they usually get in the way of what we're really trying to find. It's interesting to note that more triples is correlated with fewer runs scored. But that information isn't very useful to our understanding of baseball -- it doesn't tell us what makes teams win. That is, we hopefully aren't about to tell the 1985 Cardinals that they would have scored more runs by hitting fewer triples, at least not unless we really believe that triples hurt offense.

What we usually want to do, using Linear Weights, is answer questions like: if you release player X, and sign player Y, where player Y hits 10 more triples than player X, how much will the team improve? And, for that question, the regression gives us the wrong answer.

So what's the method we use to find the *true* value of a triple? It's pretty simple:

-- For a certain period of baseball history, look at the play-by-plays of all the games, and divide all plate appearances (and baserunning events) according to which of the 24 base/out states (such as no outs, runners on second and third) was happening when it occurred.

-- For each state, calculate how many runs were scored after that state was achieved. (Here's the one for 2009, from Baseball Prospectus.)

-- Now, for every triple, calculate the difference between the runs before the triple, and the runs after the triple. For instance, a leadoff triple would have been worth .79 runs (there was an expectation of .52 runs before the triple, and 1.31 runs after the triple). But a triple with the bases loaded and two outs was worth 2.36 runs (before the triple, .75 runs were expected to score. After the triple, only .11 runs were expected, but 3 runs actually did score. 3.11 minus .75 equals 2.36).

-- Average out all the values of all the triples, as calculated above.

That average is how much an extra triple is worth to an average team.

If you do this for the 1988 American League (which is the one I happen to have on hand), you get that a triple is worth 1.024 runs. A double is worth 0.775.

There are several ways why this estimate is much better than what you could get from a regression:

-- as we pointed out, the regression estimate is influenced by superflous other factors possessed by teams that hit triples.

-- as Tango pointed out in his link, the regression uses data aggregated into team-seasons, which means you're losing a lot of information. This method uses PA by PA, inning by inning data, for a much more reliable estimate.

-- we have a direct, logical, cause-and-effect relationship.

-- in effect, we are able to hold *everything* constant, even factors we don't know about. That's because we are not comparing team X's triples to team X's runs. We're comparing a league-average triple to the league-average runs. All other confounding factors are averaged out.

Another way to look at it: the regression looks only at inputs and outputs. So it has no idea if the input *caused* the output, or if there's some third factor that links the two. But the play-by-play method isolates the direct effect of the input. It knows for sure that the triple *caused* the change from one state (bases empty, no outs) to the next (runner on third, no outs), and so it's not fooled by outside factors.

Correlation does not imply causation, and regression can only provide correlation. Why not use this method, which is based on causation, and therefore gives you the right answer?

-----

UPDATE: Below, commenter Ted links to a paper (.pdf) where he used regression to figure Linear Weights, and found that when he added variables for GIDP and reached on error, the doubles coefficient increased (from .689 to .722). The HR coefficient also increased (by 10 points). So that, I think, explains part of the mystery: the doubles are artificially low because teams that hit a lot of doubles and home runs are so slow that they also hit into a lot of DPs and don't reach base on error as much. When these are accounted for separately, some of the true value of the double is restored.


Labels: ,

Monday, October 19, 2009

Premature accusations of anti-French NHL racism

Another accusation of racism in sports hit the newspapers today, on the front page of Canada's "National Post." This time, it's English-speakers who are accused of discrimination, in the form of racism against French-Canadian players.
The story is about Bob Sirois, a former NHL forward from Montreal, who did some analysis on NHL demographics and concluded that there is an "anti-francophone virus" in pro hockey. The reporter also quotes Réjean Tremblay, a sportswriter for Montreal's "La Presse," who got a look at the findings, and argues that "discrimination against the frogs is absolute." (Here's an article by Tremblay on the issue.)

What is the evidence for these accusations? We don't know for sure, because the full argument is in Sirois' upcoming book. But the article gives a few statistics:

-- Forty-two percent of francophone Québeckers who played three or more years in the NHL won a trophy or were named to the All-Star team. "Only francophones at the highest level were able to have lasting careers," Sirois said.

-- Of all 16-year-old players at the midget level in Quebec, 1 in 334 anglophones was eventually drafted, but only 1 in 618 francophones.

-- Francophone players in Quebec are less likely to get drafted than anglophone players in Quebec, and they go lower in the draft.

-- Of the 763 francophones drafted since 1970, one-third of them went to four teams: the Quebec Nordiques, Montréal Canadiens, Buffalo Sabres, and Philadelphia Flyers. (The teams drafting the fewest francophones were the Dallas Stars, Nashville Predators, and Phoenix Coyotes.)

-- Sometimes, undrafted players manage to eventually make it into the NHL. That group represents 10% of players overall, but 19% of players from Québec, suggesting that more francophone players are going overlooked.

Since the Post reporter wrote that he had obtained a pre-publication copy of the book, we can probably assume these are the most damning facts behind the accusations.

But are they actually evidence of discrimination? In every case, there are other, more plausible, explanations for the results. Let's take them one by one.

1. 42% of francophone Quebeckers who played three or more years in the NHL won a trophy or were named to the All-Star team.

The idea, presumably, is that to last in the NHL as a francophone, you have to be really, really good. But where's the evidence? Maybe the figure for anglophones is even higher than that? Forty-two percent sounds like a lot, but it's meaningless without a comparison number.

But maybe the lack of a contrasting figure for English Canada is the reporter's fault. Let's suppose the book has the anglophone number, and it's less. Does that prove anything?

No, actually, it doesn't. This is an old argument, actually. A couple of decades ago, baseball was accused of discriminating against blacks on similar evidence: there were lots of blacks in the league leaders, but fewer blacks as marginal players. Bill James effectively rebutted the argument then, based on the characteristics of the distribution of players.

Converted to hockey, the argument goes like this. Suppose that francophones happen to be better players, on average, than anglophones. More specifically, suppose skills are normally distributed with a standard deviation of 15 "points". English players have an average skill of (say) 100 points, but French players have an average skill of 105 points. You need to be over 130 to make the NHL, and over 135 to be considered a star.

So anglophone players need to be 3 SDs above the mean to hit 130 and make the league. That's about 135 players per 100,000 candidates. Francophone players need only be 2.5 SDs above their own mean. That's 233 players per 100,000 candidates.

To hit the superstar 135 mark, the anglophones need to be 3.5 SDs above 100; that's about 23 stars per 100,000 population, which means 23 stars per 135 players. But the Francophones only need to be 3 SDs above 105. That's 135 stars out of 100,000, or 135 stars out of 620 players.

Which means:

17% of anglophone players are stars (23/135)
21% of francophone players are stars (135/620).

So there's a larger proportion of francophone stars than anglophone stars. The difference in our contrived example is only 21% to 17%. But it would be relatively easy to come up with numbers to make the difference bigger, or smaller.

The point is that a small difference in means adds up to a big difference at the far tails of the normal distribution. That's not discrimination, it's just the way the bell curve works.

Here's a more intuitive way to look at it. Suppose the anglophones and francophones were exactly equal in terms of players and stars. Now, let's make the francophones better by taking a couple of Mario Lemieux clones and throwing them into the francophone pot. Doesn't it now make sense that a larger proportion of francophones will be superstars? It's not racism -- it's just that the francophones are now BETTER.

One objection to this line of reasoning might be: if francophones are so much better than anglophones, shouldn't we see them disproportionately represented in the NHL? Yes, we probably should. And who says we don't? The Post article does NOT say that fewer francophones make the NHL, per capita, than anglophones. It says only that francophone Quebeckers were less likely to be drafted than *Anglophone Quebeckers*. I'd be willing to bet, right now, that francophones are more likely to be drafted than non-Quebec anglophones. That's based partly on this logic, and partly on my feeling that if it weren't true, Mr. Sirois would be trumpeting that fact in the article.

[ --> UPDATE: that's apparently not right. "Hawerchuk" says that Québeckers comprise 18% of Canadian NHL players (by games played). But they're 23% of the population. ]

(Oh, and why might it be that francophone players are better than anglophone players? It could be that anglophone Quebeckers live mostly in Montréal, where ice time is harder to get. Francophone Quebeckers are more likely to be in small, northern towns, where there are more rinks per capita and more frozen ponds to play on after school. That would give francophone boys more ice time and practice time, which would make them better players. It would be roughly the same reason that the Canadian Olympic team is competitive with the US team, despite having only one-tenth the population.)


2. Of all 16-year-old players at the midget level in Québec, 1 in 334 anglophones was eventually drafted, but only 1 in 618 francophones.

That could easily happen without discrimination. All it would take is for hockey to be a bigger part of francophone culture than anglophone culture.

Suppose that hockey popular enough among French-speaking families that the top 20% of boys are still playing organized hockey when they're 16. And suppose that hockey is less popular among English-speaking families, so that only the top 11% of boys are still playing organized hockey when they're 16.

That would explain the numbers exactly. The mediocre francophone players don't get drafted. The mediocre anglophone players don't get drafted either, but they dropped out of organized hockey early enough that they don't make Sirois's survey.

Again, I'd be willing to bet that this is what's going on. I live in Ottawa, which is on the border with Québec, and I can tell you that the francophone families I know are much, much more hockey-mad than the anglophone families, on both sides of the border.

Sirois says,

"If you're francophone and your son is talented in minor hockey, anglicize his name and you will double his chances of being drafted."


If Sirois is basing that comment only on this particular statistic, his conclusion is premature, to say the least.


3. Francophone players in Québec are less likely to get drafted than anglophone players in Québec, and they go lower in the draft.

Same argument. The less-skilled anglophones drop out of hockey more frequently, while the less-skilled francophones drop out of hockey less frequently. So the remaining anglophones are better, on average, than the remaining francophones.

Again, I'd bet that if you looked at raw population numbers, more Québec francophones get drafted than Québec anglophones, at every level of the draft. They are less likely to be drafted, as Sirois says, if you look only at the pool of 18-year-old players. But I'd bet they are MORE likely to be drafted if you look at the pool of all 18-year-olds in Québec, whether they play hockey or not.

It's selective sampling if the mediocre francophones are more likely to be in the sample than the mediocre anglophones.


4. Of the 763 francophones drafted since 1970, one-third of them went to four teams: the Québec Nordiques, Montréal Canadiens, Buffalo Sabres, and Philadelphia Flyers. (The teams drafting the fewest francophones were the Dallas Stars, Nashville Predators, and Phoenix Coyotes.)

First, and easiest: are there fewer francophone players in the league now than in the past? Given the number of players these days being drafted from outside North America, that would seem likely. That would explain why the Stars, Predators, and Coyotes -- teams that weren't in the league in the 70s and 80s -- would have drafted fewer francophones than the more established teams.

It's the same reason you'd also find that Québec, Montréal, Buffalo and Philadelphia have had more non-helmeted players than Dallas, Nashville, and Phoenix. It's not because Nashville discriminates against bare heads, but because the Predators weren't around when it was legal to go without a helmet.

Secondly: it might just be a difference in scouting. Back in 1985, Bill James did a study of the MLB draft, and found that, in baseball, players in the southern United States were much, much more likely to be drafted than players in the cold states, even if the players were of equal talent. That wasn't racism against Minnesotans, it was just where the scouts decided to go. James wrote,

" ... the explanation seems obvious. ... The scouts spend a lot of time in the South because it gets warm down there while the North is still freezing, and they go where the baseball is. They see more of the players, see the ones they like more often, and wind up falling in love with them."


Doesn't it make sense that the same thing might apply in hockey? There isn't a weather issue, but there *is* a language issue. Doesn't it make sense that the Phoenix Coyotes are less likely to have a french-speaking scout, and are therefore less likely to send someone up to Chicoutimi in February to check out some prospect? If francophone scouts are rarer than anglophone scouts (which they obviously are), it makes sense that not every team would have one, and, as a result, francophone players would be disproportionately drafted by the teams that do. That's not discrimination, it's just rational allocation of resources.

Dallas might just be saying, "you know, we don't have a francophone scout, so we'll let the Canadiens concentrate on prospects in Trois-Rivières, and we'll send our guy to Regina."


5. Sometimes, undrafted players manage to eventually make it into the NHL. That group represents 10% of players overall, but 19% of players from Québec, suggesting that more francophone players are going overlooked.

This can easily follow from the hypothesis that there are more second-tier francophones in the draft pool than anglophones.

Again, suppose that 20% of francophone boys are still playing at age 16 (and therefore scouted for the draft), but only 11% of anglophone boys are. Scouts know that only 1% of Québec boys, of either language, will make the NHL. So they duly draft only 1% of the anglophone population, and 1% of the francophone population.

Scouts aren't omniscient, and they'll miss a few good prospects. There will be undrafted players who bloom later, and finally attract some interest from NHL teams.

Under our assumptions, 19% of francophone boys will be initially passed over, but only 10% of anglophone boys. That leaves almost twice as many francophones who might get noticed (and signed) later. Of course, those nine percentage points of extra francophones are less skilled than the top ten percent, but some players are late bloomers, and the bigger the pool, the more missed players you're going to sign later.

I think that's the obvious true explanation: more players means more late bloomers.

-----

So the points raised in the article are certainly not enough evidence to conclude discrimination -- there is a perfectly plausible, non-racist explanation of each of them.

If you want to show discrimination, you need better arguments than these, to remove the selective-sampling problem intrinsic to each of the arguments here. What you can do is this: find all anglophones drafted in position X, and all francophones drafted in position X. See how they do in the NHL. If there's discrimination, you'll find that francophone 14th picks do better than anglophone 14th picks.

And even if you find there's discrimination, it doesn't mean it's racist, or even language-specific. It might just be a scouting issue, where there are fewer scouts in Québec than elsewhere, just as there were fewer MLB scouts in Minnesota than in Georgia.

My gut says you won't find much discrimination. I guess I wouldn't be surprised if you found a little bit, that team X might be less interested in a francophone eighth-round pick because of perceived language issues with the other players, when they can't really tell him much apart from a similar anglophone player who's also available. But discrimination is expensive, and every team wants to win. If you want to convince me that teams are deliberately leaving money on the table because of racism, you'll have to come up with some pretty good evidence.

These arguments, though, just don't cut it. There are many better, more plausible explanations for the apparent statistical anomalies in the article -- enough so, in fact, that, in my view, the accusations of racism are premature and irresponsible.


(Other views: Here's Tango, and here's mc79.)


Labels: , , ,

Sunday, October 11, 2009

Doesn't "The Book" study pretty much settle the clutch hitting question?

The clutch hitting debate continues. For the latest, here's Tango quoting Bradbury quoting Barra. Bradbury references Bill James' essay, and Barra references Dick Cramer's 1977 study.

In Tango's post, he says,

Anyway, as for actually finding a clutch skill, Andy [Dolphin] did in fact find it, and the results are published in The Book.


Absolutely. It's time, I think, that this study be acknowledged as the most relevant to the clutch question. Cramer's study gets quoted because it's the most famous, but recent studies (like Tom Ruane's) have used a lot more data. Dolphin's study improves on Ruane's by including even more data, by correcting for various factors, and by giving an actual quantitative estimate of how much clutch hitting talent there really is.

The one fault with Dolphin's work is that it hasn't been published in full. This is understandable: "The Book" contains a huge number of studies, and if they were all run in detail, the book would be a couple of thousand pages. But this is one of the most important studies, on one of the most asked questions in sabermetrics. If we want sabermetricians, academics, and reporters to accept the results, the study should be published in full, so as to be subject to full peer review. I'm not even completely sure how the study worked. I have a pretty good idea of the outline, but not the details. Part of the reason the study needs to be published is for the technical details to be available, so others can evaluate the method and reproduce the results if they choose to.

Anyway, here's what I *think* Andy did:

-- he took every regular-season game from 1960 to 1992.
-- he considered only PAs involving RHP, to eliminate platoon bias.
-- for every player who met minimum playing time, he computed his clutch and non-clutch OBP.
-- he adjusted those OBPs to reflect the quality of the opposing pitcher, and the fact that overall clutch and non-clutch OBPs differ.
-- he computed clutch performance by subtracting non-clutch from clutch.

That gave him clutch numbers for 848 players.

-- he looked at the distribution of clutch hitting, and figured the observed variance.
-- he then figured what the variance would have been if there were no clutch hitting.

It turned out that the actual variance was higher than the predicted variance, which is what you'd expect if there were something other than just luck causing the results (such as clutch hitting talent). The difference we can presume to be clutch hitting.

If luck and talent are independent (which is a pretty reasonable assumption), then

Variance caused by talent = (Total Variance) - (Variance caused by luck)

That calculation led Andy to conclude that the talent variance was .008 squared, which meant the standard deviation of clutch talent was 8 points of OBA.

Andy phrased it like this:

"Batters perform slightly differently when under pressure. About one in six players increases his inherent "OBP" skill by eight points or more in high-pressure situations; a comparable number of players decreases it by eight points or more."


That finding, I think, is the strongest we have, and I agree with Tango 100% that we should consider Andy's .008 figure to be the best available answer to the clutch hitting question.

----

As I said in previous posts, however, I do have some minor reservations about what we can conclude from the analysis, so it's appropriate to add a few caveats.

1. Mostly, I'm not convinced that the .008 represents individual clutch ability in the sense in which most fans think of it -- that the player "bears down" in important situations and performs better than normal. I wonder if, instead, it might just be a matter of both hitters and pitchers using different strategies in those clutch situations.

For instance, suppose you have a power hitter and a singles hitter, and neither gets any better in the clutch. But in those situations, the relative values of offensive events might change. Maybe, with the score close in the late innings, a home run becomes more valuable relative to a single. I'm making these numbers up, but, maybe instead of the HR being three times as valuable as a 1B, it becomes four times as valuable.

Now, the pitcher's strategy changes. Fearing the home run a little more than normal, he'd be apt to pitch around the power hitter, trading fewer home runs for more walks. That would cause the power hitter's OBP to increase more than expected. Even if there's no similar effect for the singles hitter, he'll look relatively worse in the clutch than the power hitter.

So it's possible, and even plausible, that the .008 might not be a reflection of the clutch behavior of an individual hitter, but just an artifact of the strategic manoeuvering in the batter-pitcher matchup.

To find out, you could check whether certain types of hitters have better clutch performances as a group. If you did find that, it would be evidence that at least part of what Andy found as "clutch ability" is just characteristics of the player.

There is some evidence that some of this is happening: in the book, Andy says that when he used wOBA (which weights events by their value, so HRs are worth about three times what a single is worth) instead of OBP (which weights all on-base events equally), the SD dropped from 8 points to 6. That suggests that clutch performance did indeed involve a trade-off between getting on base and hitting for power.

If you went one step further, and analyzed performance in terms of win probability (instead of OBP or wOBA), you might find some other result, such as no evidence of clutch talent at all. It could be that all the clutch differences are the result of hitters adjusting their game to what the situation requires, such as (say) a power hitter trying for a single with the bases loaded, vs. a home run with two outs and nobody on.

2. Just today, Matt Swartz suggested that lefties might be more "clutch" than righties, because they hit better with runners being held at first (I always thought that was because of the hole between first and second, but Matt suggests it's because that limits the defense's ability to shift in other ways). Again, that's something that's real -- so the team would know they could benefit from it -- but not "clutch" in the sense that the hitter is actually better in some way.

3. Another quibble I have with the conclusion is that the result appears to be not that significantly different from zero. Andy says there's a 68% probability that clutch talent is between 3 and 12 points; I calculated that the 95% confidence interval easily includes zero (the p-value of zero is somewhere around .14). So even if you're only interested in whether there's an ability to have a higher OBP (in the sense that some players' clutch OBPs vary more than others), the evidence is not conclusive beyond a reasonable doubt.

4. As Andy implies in "The Book" (and Guy explicitly suggests elsewhere), there could be other explanations for the .008. It could be that some players happened to have more clutch AB at home, so what we're seeing is partly HFA. It could be that some players happened to see a starter for the third time that game (when batters start gaining an advantage) more often in than expected in the clutch. It could be a lot of other things.

Guy suggests doing the same study, but choosing the PA randomly (instead of clutch and non/clutch). That would tell us how much of the .008 happens due to random clustering of factors.

(Note: just as I was about to submit this post, I found an earlier Andy Dolphin study that *does* do this kind of check. Andy found that dividing PA into other situations did not produce any false positives.)

----

Even if some of these criticisms turn out to be justified, it doesn't mean that clutch doesn't matter. Even if we find the entire effect is (say) due to lefties hitting better with runners on base, that's still something a manager or a GM should take into account. If you have two .270 hitters, but one hits .270 all the time, while the other hits .268 usually but .276 in the clutch ... well, you want the second guy. It doesn't really matter to you whether the extra performance comes from the players gutsiness, or just from something that's inherent in the game.

But my perception is that fans who talk about "clutch" are talking about something in a player's make-up or psychology that makes him more heroic in critical situations. I'd argue that while "The Book"'s study convincingly showed that some players hit slightly better (or worse) in clutch situations, it has NOT showed that it's because the players themselves are "clutch".

----

Looking back at what I wrote, I realize I'm repeating things I said before. But the point I was trying to make is that I agree with Tango: the study in "The Book" is state of the art, and, to my mind, the question of whether players hit differently in the clutch now has an answer.

I'm not sure how to get the result accepted. Well, publication of the study would help; the media are more likely to pay attention to a result if it's a full academic-type study instead of a few pages of a book. I'm sure JQAS would be happy to run it. Even a web publication would help.

What else? Well, I suppose that the more the sabermetric community cites the result, the more it'll spread, and the more likely sportswriters will be to come across it when researching clutch.

Or maybe a press release? It works for Steven Levitt!


Labels: ,

Monday, October 05, 2009

Stacey Brook on salary caps and competitive balance

You'd think that when a sport introduces a salary cap, it would lead to greater competitive balance in the league. That would make sense; with a cap, you won't have teams like the Yankees, who spend two-and-a-half times as much on players as the average team, and about five times as much as the Marlins. If you forced the Yankees to spend only the league average, they would have to get rid of many of their expensive star players, and they'd win fewer games.

In theory, if every team had to spend the same amount, they'd all start the year with equal expectations. I say "in theory" because, in practice, different teams would have different philosophies, some of which might work better than others. Certain teams might spend more on scouting, wind up drafting better, and win more games with the same payroll (at least until the draftees reach free agency). But, generally, you'd expect more balance among teams.

It seems that Stacey Brook, co-author of "The Wages of Wins," doesn't think that's true. He thinks that the salary cap (and floor) the NHL instituted in 2005 has had no effect on competitive balance.

Here are Brook's "Noll-Scully" measures of competitive balance for the last few years of the NHL (lower numbers = more balance):

2000-01 1.858
2001-02 1.581
2002-03 1.592
2003-04 1.633
-----------------
salary cap begins

-----------------
2005-06 1.637
2006-07 1.600
2007-08 1.037
2008-09 1.369

It does seem, Brook acknowledges, that competitive balance has improved the last couple of years. But, he says, that's part of a trend that's been going on for a long time. For one thing, there was virtually no change in the Noll-Scully the first two years after the cap. For another, balance has been improving since at least the 1970s:

1970s 2.557
1980s 1.969
1990s 1.796
2000s 1.538

Since competitive balance has been increasing even through most of hockey history that had no salary cap, he argues, it's just a continuation of the trend, and the salary cap doesn't have anything to do with the recent decline. He writes,


"As we argue in The Wages of Wins, and detail in our paper - The Short Supply of Tall People - competitive balance is declining not because of changes in league institutional rules - such as payroll caps - but rather due to the increasing pool of talent to play sports, such as hockey."


But that doesn't make logical sense. Sure, there's already a decreasing trend, for whatever reason, but that doesn't mean a change to the rules can't contribute to the trend. Does having the ability to send text messages lead to people using their phone more? Of course it does! But if you apply the same argument, you get something like, "well, cell phones were becoming more and more popular even before text messaging, so text messaging can't have anything to do with it." That's not right.

And, indeed, it contradicts their own findings in "The Wages of Wins" itself. The authors found that there was an r-squared of .16 between salary and performance in MLB. Which means that if you were to flatten out salaries, so that each team paid an equal amount, it would reduce the variance of wins by 16%. So, absent any compensating factors, "The Wages of Wins" is argues a salary cap MUST reduce the Noll-Scully measure!

----

By the way, take a look at the value of 1.037 for 2007-08. That's really, really low; the lowest you can expect Noll-Scully to be is 1.000, and that's when every team is of exactly equal talent. A value so close to 1 suggests a combination of (a) the league being really balanced that year, and (b) teams, by luck, playing closer to .500 than their talent suggested.

If you look at the standings, you see the usual suspects at the top of the conferences, so it doesn't really seem like all the teams were equal that year. Could it be that Brook used a formula for Noll-Scully that didn't consider the extra point for an overtime loss?

----

But what about Brook's (and Berri's) argument that balance has increased because players' skills are becoming more equal? Well, sure, that's been part of it, no question. But effects often have more than one cause. You may be earning more money because you're working overtime, but that doesn't mean winning the office hockey pool will *also* make you richer. Whatever was causing the levelling of team talent before might still be there ... but, now, there's an additional effect, the salary cap effect.

Now, maybe I'm not interpreting Brook's argument correctly. Maybe he's thinking that the salary cap does contribute to balance, but so much less than the other effect (players getting more equally talented) that it's not worth considering. But I think it's the other way around. With a salary cap, it doesn't matter much how the players' talent is distributed.

Suppose players vary a lot in talent, 100 players equally spaced from 0 to 100, with an average of 50. A team that has lots of money might buy players with an average of 70, and a team owned by Harold Ballard might buy players with an average of 30. Big difference.

Now, suppose the talent pool gets bigger, and competition gets tougher, and now the players are all spaced between 40 and 60. Now, no matter how much you want to spend, you can't get above 60. And no matter how cheap you are, you can't get below 40. But the league average is still 50.

So, yes, Brook is correct, a narrower range of talent leads to more competitive balance.

But, now, suppose that every team has a salary cap and a floor: they all have to spend exactly the same amount of money. Now, it doesn't matter how the talent is distributed: assuming every team is equally good at evaluating players, they'll all sign a team with an average of 50. Even if the distribution of talent is like it was in the 1970s, with lots of spread, it doesn't matter -- because even if there are lots of players in the 90s and 100s, no team can afford to sign more than one or two. The more talented the player, the more likely a team who signs him will have to sign *less* talented players to stay within the cap.

Even if you have the Babe Ruth of hockey, a player who's (say) a 500 when the other players top out at 100, it won't matter, because the teams will bid up the price of his services until they pay him what he's worth. The team who gets him will have less money to spend on other players, and it all evens out in the end.

What's happening is this: in the past decades, competitive balance decreased steadily for many reasons, including the increase of the talent pool that Brook cites. But, now, with a salary cap and floor, most of that stuff doesn't matter much any more!

It matters a bit, because not everyone is a free agent. The distribution of talent does matter for draft choices, because the top draft choice doesn't cost that much more than the others (but can be a whole lot better, as in Sidney Crosby).

---

Of course, NHL hockey teams are more than collections of free agents priced at market value, so we shouldn't expect competitive balance to be perfectly level. There are some factors that might cause the Noll-Scully to actually rise a bit from the theoretical bottom created by the salary cap.

For instance: the first draft choice goes to a team near the bottom of the standings. Back in the days of less competitive balance, that went to a team that was probably legitimately awful. Now, with teams closer in talent, it could go to a team that was just unlucky. If the team that gets the next Sidney Crosby is an average team, rather than a bad team, that won't reduce competitive balance the way it used to.

Also, scouting: an investment in scouting now pays off more than it used to. Before, if you were a low-spending team, maybe a better draft choice might move you from .400 to .450. Now, if all teams are medium-spending, maybe it'll move you from .500 to .550, and give you a legitimate shot at the Stanley Cup. So more teams should be willing to spend the money to improve their drafting. And so, the rich teams could "buy" better players, not by spending to pay them, but by spending to identify them better.

And there are probably other ways to get around the cap: didn't companies introduce employee health plans to get around wage controls in World War II? If a superstar free agent has knee problems, and I wanted to sign that player, I'd offer to hire the best knee doctor in the business and keep him on staff. Whatever he costs, it's not going to count against my cap. That may not actually be practical, but I'm sure rich teams will figure out ways to buy better teams, one way or another.

My point is not to say that these factors will push inequality back to where it was when teams could sign all the free agents they were willing to pay for, just that there may be other theoretical reasons that Noll-Scully may bounce back up a little bit. I think all those factors will be minor, and as long as the salary cap and floor stay within roughly the same range of each other, we'll continue to see a balanced league, regardless of how the talent pool changes.



Hat tip: The Wages of Wins



Labels: , ,

Sunday, September 27, 2009

A game theory study on pitch selection

Commenter "Eddy" was kind enough to send me this link to what looks like a press release on a new study by Kenneth Kovash and Steve Levitt (of Freakonomics fame). The link is to a summary only; to get the actual study, I had to pay $5.

The study is in two parts: one baseball, and one football. I'll talk about the baseball results here; I'll send the football portion of the study to Brian Burke (of "Advanced NFL Stats") in case he wants to review it himself. I hope that's allowed under fair use and I don't have to pay another $5.

In the baseball half, the authors claim that pitchers throw too many fastballs. They would do better -- much better, in fact -- if they threw other kinds of pitches more often.

How can you tell, using game theory, whether fastballs are being overused? Simple: you just check the outcomes. If opposition hitters bat for an OPS of .850 when you throw a first-pitch fastball, but they OPS (can you use OPS as a verb?) .800 when you don't, they obviously you should cut back on the fastballs. At first glance, it might look like you should get rid of them entirely, because, that way, you could shave .050 off the opposition's OPS. But it's not that simple: as soon as the opposition realizes that you're not throwing fastballs, they'll be able to predict your pitches more accurately, and they'll wind up OPSing higher than .800 -- probably even higher than the original .850. Game theory can't tell you the right proportion, at least not without having to make assumptions that would probably be wrong. But it *can* tell you that you should adjust your strategy until the OPS-after-fastball is exactly equal to the OPS-after-non-fastball.

If that's what the Kovash/Levitt study did, it would be great. But it didn't. Instead, it did something that doesn't make sense, and makes almost all its conclusions invalid.

What did it do? It considered outcomes only for pitches that ended the "at bat". (The authors say "at bat", but I think they mean "plate appearance". I'll also use "at bat" to mean "plate appearance" for consistency with the paper.)

That's a huge selective sampling issue. It means that when a pitch on a 3-0 count is a ball, you count it; when it's put in play, you count it; but when it's a strike, you don't include it. That doesn't work. I can make up some data to show you why. Suppose:

-- Fastballs are 50% put in play, for an OPS of 1.000
-- Fastballs are 50% strikes, for an OPS of .800 after the 3-1 count.

-- Non-fastballs are 25% put in play, for an OPS of .900
-- Non-fastballs are 25% strikes, for an OPS of .800 after the 3-1 count
-- Non-fastballs are 50% balls, for an OPS of 1.000.


That summarizes to:

0.900 OPS for fastball
0.925 OPS for non-fastball


Clearly, you should throw a fastball, right?

But if you consider only the last pitch of the at-bat, you have to ignore those 3-1 counts. Then you get:

1.000 OPS for fastball
0.933 OPS for non-fastball


And it looks like you should throw *fewer* fastballs, not more. And that's wrong.

This kind of thing is exactly what Kovash and Levitt have done. They think they've shown that the fastball is a worse pitch than the non-fastball. But what they've *really* shown is that the fastball is a worse pitch than the non-fastball only if you ignore the fact that if the pitch doesn't end the at-bat, the fastball is more likely to put the count more in the pitcher's favor.

So I don't think their main regression result, the one in Table 4, holds water, and I don't think there's a way for the reader to work around it. If the authors just reran that regression, but considered the outcome even if it wasn't the last pitch of the at-bat, that would fix the problem. I'm not sure why they chose not to do that.

----

Still, there are some other aspects of the study that are interesting.

In Table 2, the authors show results for every count separately. On 3-2, every pitch is the last pitch of the AB (except for foul balls, which the authors actually included in the study, but don't affect the results). Therefore, the change in count isn't a consideration, and we can take the results at close to face value.

So what happens? There is indeed a big difference between fastballs and non-fastballs:

.769 OPS after a 3-2 fastball
.651 OPS after a 3-2 non-fastball.


This would certainly lead to a conclusion that pitchers are throwing too many 3-2 fastballs, and the results stunned me: I didn't expect this big a difference. But then it occurred to me: most of the OPS on 3-2 is walks. And walks are undervalued in OPS. If a 3-2 fastball results in more balls in play, but the 3-2 curveball (or whatever) results in more walks, the actual run values might be more even. That is: pitchers know that walks are "worse" than OPS says they are, so they're willing to tolerate a higher OPS for fastballs if it's contains fewer walks. That seems quite reasonable.

Suppose walks form half of OBP for fastballs, but 60% of OBP from curveballs. That's a difference of .100 in OPS due to walks. If you assume that should "really" be .140, that closes the gap from 120 points down to 80.

That adjustment is still not enough to explain the entire gap between fastballs and non-fastballs, but it's certainly part of it. In studies like this, where you're looking for very small discrepancies, and you have non-traditional proportions of offensive events, you need to use something more accurate than OPS.

----

But here's something that makes me worry, and I wonder if there's a problem with the authors' database. Here are the overall OPS values for ABs ending on that pitch, from the authors' Table 1:

.753 fastball
.620 non-fastball

Do you see the problem? This data puts the average OPS at .709 (fastballs being twice as likely as non-fastballs). But the overall major-league OPS for the years of the study (2002-2006) was around .750. Why the discrepancy? The authors do say they left out about 6% of pitches, mostly "unknown", but with a few knuckleballs and screwballs. But there's no way 6% of the data could bring a .750 OPS down to .709. So I'm thinking something's wrong here.

There's no such problem with Table 2, which is broken down by count instead of pitch type. That table does average out to about .750.

UPDATE: in the comments, Guy reports that if you calculate SLG with a denominator of PA instead of AB, the numbers appear to work out OK. So the authors probably just miscalculated.

----

Finally, the authors argue that pitchers aren't randomizing enough. According to game theory, there should be no correlation between your choice of this pitch, and your choice of the next pitch. If you have a correlation, because you're choosing not to randomize properly, the opposition can pick up on that, guess pitches with more confidence, and take advantage.

Kovash and Levitt found that pitchers have negative correlation: after a fastball, they're more likely to throw a non-fastball, and vice-versa. They conclude that teams are not playing the optimal strategy, and it's costing them runs.

However: couldn't there be another factor making it beneficial to do that? It's conventional wisdom that, after seeing a fastball, it's harder to hit a breaking pitch, because your brain is still "tuned" to the trajectory of the fastball. If that's true -- and I think every pitcher and broadcaster would think it is, to some extent -- that would easily explain how the negative correlation observed in the study could actually be the optimal strategy. But the authors don't mention it at all.

----

So I don't think we learn much from this paper, but there's a tidbit I found interesting. Apparently Kovash and Levitt have access to MLB bigwigs, and did a little survey:

"Executives of Major League Baseball teams with whom we spoke estimated that there would be a .150 gap in OPS between a batter who knew for certain a fastball was coming versus the same batter who mistakenly thought that there was a 100 percent chance the next pitch would *not* be a fastball, but in fact was surprised and faced a fastball."


That's kind of interesting. I have no idea how accurate the estimate is ... anybody seen any other research on this topic?



Labels: , ,

Wednesday, September 23, 2009

How much does a "clubhouse cancer" cost his team?

This past weekend, the Cubs suspended outfielder Milton Bradley for the remainder of the 2009 season. Bradley had made some remarks to the press critical of the "negativity" he had received in Chicago. That, combined with his reputation as a complainer who apparently didn't get along with his teammates, prompted GM Jim Hendry to send him home for the rest of the year, with pay.

Do "clubhouse cancers" cost a team wins? In an excellent article at Baseball Analysts, Sky Andrecheck admits he doesn't know. But he looks at the anecdotal evidence of other oncoplayers to at least try to get a handle on how much a team is willing to pay to get rid of him.

This season, Bradley had accumulated 1.2 wins above replacement (WAR) up to the day of his suspension. Andrecheck suggests that he's probably a little better player than that because he's having an off-year. So for Hendry to be willing to lose Bradley's contribution, he must think think that his continued presence would cost the team wins at at least that rate. Otherwise, he'd bite the bullet and keep him around.

That figure is in line with another recent disgruntled clubhouse influence, Shea Hillenbrand, who was projected as a 1.4 WAR player when he was released by the Blue Jays in 2007.

Finally, Tom Tango adds a third anecdote. He notes that no team was willing to sign Barry Bonds in 2008, even at minimum salary, when Bonds was projected to be around 1.5 WAR.

As for other "cancers": Albert Belle and Barry Bonds had poor clubhouse reputations, but weren't released by their teams. Those guys were substantially better than 1.5 WAR per season. That strongly suggests that the cost of keeping a player around is less than the cost of losing an all-star. Andrecheck writes, "I can't think of even a 3 or 4 WAR all-star caliber player ever having been given away or released largely due to clubhouse attitude. Instead, teams learn to deal with these players, rather than oust them."

So, it would appear, poisoning the clubhouse is worth somewhere between 1.5 and 3 wins a year.

That's very cool stuff. But I'm still wondering about a related subject, one that the article doesn't try to answer. My question is: just *how* does a clubhouse cancer cause the 1.5 win dropoff? It's unlikely that the personalities of the players affect their team's Runs Created or Pythagorean estimates (unless clutch play is affected more than non-clutch), so the dropoff must come in the performance of the player's teammates. How does one player's negative attitude cause another player's performance to suffer? Do the disheartened fellow players not try as hard? Are they less motivated to receive coaching, or stay in shape? Do they concentrate less on pitching strategy, maybe spending less time in with the coach going over scouting reports on opposing batters?

And whatever it is, how do we gather evidence? I suppose we could check the performance records of pitchers while Bradley is on the team, and compare them to their records before he arrived and after he left. But 1.5 wins a year, with the equivalent of 18 full-time players (nine hitters and nine pitchers), is only about an 0.8 run shortfall per player. That's not much signal to find among all the noise, isn't it? I suppose you could check the records in the few weeks prior to the player being kicked out, on the premise that that's when the situation became most intolerable. But you might find that the situation reached the breaking point only because the team was losing, so you might mix up cause and effect.

Or maybe it's that one guy who's on the cusp of breaking out, or having a comeback season, just gets discouraged and flames out: some 23-year-old prospect winds up a little less hungry, and gives up a bit too early. That doesn't seem like it could be 1.4 wins, but I guess it's possible.

Any suggestions? I'd even be interested in hearing plausible suggestions for how the 1.4 wins (14 runs) are lost. At least if we have some reasonable hypotheses, maybe we can think of some ways to test them.

----

My suspicion, though, is that the Milton Bradleys don't actually cost their teams 1.4 wins that way. I think there are other reasons that the Cubs might have for releasing Bradley than just a sober calculation of his effect on the team's on-field performance.

First, there's deterrence. There has to be some mechanism by which teams prevent their players from going off half-cocked and ruining team chemistry. There has to be the threat, explicit or implicit, that if the player is disrespectful towards the team, he will pay a price. For most players, who want their time with the team to be as pleasant as possible, the desire to get along with their teammates might be enough incentive. But when an anti-social player crosses the line, the punishment may have to have a negative cost to the team.

For instance, suppose a world-famous surgeon commits murder. Putting him in prison might cost the hundreds of lives his skills would save over the years. But society has to jail him anyway; otherwise, they give every surgeon a license to kill.

The same thing might be happening here. Even if keeping Milton Bradley on the team wouldn't cost only a small fraction of a win, they'd have to get rid of him anyway, just to make sure the other 24 players don't get similar ideas.

Second, Bradley's presence might cost the team wins in other ways than just on the field. If the clubhouse atmosphere is poisoned, the other players are unhappy. If they are unhappy, they are less likely to want to stay on the team. And so, the Cubs would have to offer them more money to stick around as free agents. Indeed, they'd have to pay *all* free agents more money than they would otherwise. If Chicago is a crappy team to play for, but Boston is wonderful, why would anyone sign with the Cubs? (You might also get an increase in disgusted players demanding to be traded.)

If word gets out around the league that Cubs' management is not willing to enforce normal standards of civility from their players, it could cost them a lot more than 1.4 wins per year.

Third, and thinking out loud: is it not possible that while a poisonous Milton Bradley costs his team 1.4 wins a year, a poisonous player of higher ability might cost the team nothing? Whatever mechanism it is that has Bradley hurting the team on the field, there's no doubt it's because of the reaction and chemistry among the other players, right?

Now, people get upset when social norms are violated: I'm going to be more upset if you steal $20 from me than if my taxes go up $20. Is it possible that putting up with an arrogant superstar is a social norm, but putting up with a marginal player is not?

Isn't it possible that when a superstar acts like a disagreeable moron, the other players kind of shrug and accept it? If the social norm is that some superstars are a**holes, and you just have to get used to it if you want to win, then it might cause no harm at all. Where I used to work, if the manager was being a big jerk, the rest of us would talk about it over coffee, and we'd grin and bear it and get back to work. But if one of our fellow grunts was acting like an idiot, that would be different: that would upset us a lot more, because he was one of us.

Could it be the same thing happening here? When Barry Bonds was a jerk, maybe management took the players aside and said, "yeah, we know he's acting like that, but he's our best chance of winning, so try to deal with it?" That wouldn't work with Milton Bradley, and so the players are less likely to put up with it, and management would have to get rid of him.

Anyway, as I said, just thinking out loud on this one.

Finally, could it be just money? If Milton Bradley is pissing off the other players, and the fans find out, and they start booing Bradley, and the team does nothing about it ... might that not get in the way of the fans' long-term loyalty to the team? The fans are loyal and rabid. They're proud to be Cubs supporters, and many have spent their whole lives dreaming a World Series win. Then Milton Bradley comes along, winds up in the absolute dream job of Chicago Cub outfielder, but doesn't appreciate what he's got, and starts insulting the Cubs and the fans and the tradition.

Doesn't getting rid of Bradley fulfill an obligation to those fans? Doesn't that build the brand and cement the relationship and lead to fan loyalty and revenues?

Bradley is only going to miss two weeks, and may get a chance to reform. Those two weeks are worth, what, maybe .1 wins? That's less than $1 million -- and, considering that the Cubs are out of playoff contention, it may be only a few hundred thousand. The suspension could pay for itself in no time at all.

Labels: , ,

Wednesday, September 16, 2009

You can't forecast outcomes that are random

Predictions are often wrong. In an article in the Wall Street Journal last month, "Numbers Guy" Carl Bialik points out a few that went awry. Two years ago, for instance, a government energy agency predicted that the price of oil would be between $75 and $85 in 2008. In reality, it started out the year close to $100, ran up past $140 in July, and dropped back below $40 by the end of the year. Bialik writes, "winging darts at numbers on a board might have been more accurate."

It's easy to make fun of prognosticators when they get this stuff wrong. But let's not get too hasty. The fact is, the things that are most worth predicting are things that are most unpredictable. If you want a prediction of what time the sun will rise tomorrow morning, you can get 100% accurate predictions from any competent astronomer. But what would be the point?

The price of oil varies so much because there are so many factors that influence it: wars, foreign government policies, consumer behavior, US election results, technological advances, natural disasters, and so on. These things are random. And they are very, very complex, most of them being the result of human thought and action.

Still, shouldn't some people be better skilled at making those predictions than others? Absolutely. Tancred Lidderdale, the economist quoted in Bialik's article, has an excellent understanding of the factors that impact the price of oil, much better than mine. So what's wrong with evaluating his predictions after the fact, to see if he's any good?

The problem is that no matter how much you know about the price of oil, it's random enough that the spread of outcomes is really, really wide: much wider than the effects of any knowledge you bring to the problem.

Suppose that on the basis of Miguel Tejada's career, everyone thinks he should hit .290 next year. But suppose Bob, who's a big fan of Tejada, and follows his plate appearances closely, has noticed something about his performance and thinks differently. Maybe it's some detailed observation that he swings a certain way, and other players with the same swing have declined more in their thirties than average. So Tejada should be only about .286.

That may be absolutely right, and figuring that out was an act of staggering sabermetric genius. Bob's estimate of .286 is correct, and the .290 estimates are all wrong. Bob is literally the only one in the world whose estimate is correct.

But in practice, how do you prove that? The standard deviation of batting average over 500 AB is about 20 points: so even with .286 being correct, there's still a 46% chance that A will hit closer to .290 than .286 next year. There's actually about a 1 in 3 chance that Tejada's average will be below .266 or above .306. For practical purposes, it's impossible to evaluate the two predictions on this one single sample. Even if Bob is omniscient, knowing everything possible about Tejada's talent, health, and diet, it's going to take a lot of evidence to prove that he's a better estimator than the mob, so long as the results of individual at-bats are random.

The problem is the small sample size: over 1000 predictions, or 1,000,000, Bob is going to have a better record than everyone else. But, who makes a million predictions, and who keeps track of them to evaluate them afterwards? And even if we do this a reasonable number of times, like 100, Bob still isn't assured of beating me. If his chance of beating me is 54%, then, if we predict 100 times each, I still have an almost 35% 21% chance of coming out the winner.

That is: an omnisicient expert can beat a reasonably-informed layman only about 65% 79% of the time. And that's after 100 trials each, 100 trials where the predictor actually has a significant edge in knowledge or analysis. In real life, if you get only one trial, and you're not even close to omniscient, and the prediction you're making may not be the one in which you have the most confidence, the public's expectations of you shouldn't be very high. Not because you're ignorant, but because life is just too random.

Of course, this is an arbitrary example, with more randomness (20 points) than knowledge (4 points). But isn't it roughly the same situation for the price of oil? The randomness in the economy is just huge. Part of the reason oil went down last year is because of the recession. The recession happened because of the credit crisis. And very few people foresaw the credit crisis, including people who had thousands, or millions, or billions of dollars on the line. For a government economist to be omniscient, he has to be omniscient about mortgage finance, and on the government's and public's reaction to every crisis that might possibly occur. That's asking a lot, isn't it? To an energy economist, the state of mortgage finance has to be taken as random.

Because life is random, and the price of oil is very sensitive to the randomness of human-caused shocks, you can't expect a single, point estimate of the price of oil to be 95% accurate within $1, or even $5. An estimate that precise is impossible, beyond the scope of human capability, and probably beyond the scope of the most powerful computers that could be imagined. An honest and competent forecaster will tell you that the best he can do is give you *distribution* for the future price of oil: maybe that there's a 60% chance it will be between $60 and $110, and a 10% chance it will be below $60, and maybe a 5% chance it'll go over $200 (if there's a major war in the Middle East, say), and so on. That's not something the newspapers are keen to report on -- it's hard to put in a headline, and it's harder for readers to understand.

What you hope Mr. Lidderdale's agency was probably saying was, "we have our best guess at a probability distribution for what the price of oil will be next year. Its mean is in the $75 to $85 range." If that phrasing makes journalists uncomfortable, fine. But that doesn't change the fact that it's the best anybody can do. And it doesn't change the fact that you can't decide how good a predictor is on the basis of one, two, or even a hundred point estimates. You need a LOT of data. And if an outlier happens, all evaluations are off. I'd bet that anyone who predicted, back in 2007, that oil would jump to $140 and then drop back to $37, is a kook, not an expert. What happened in 2008 was something of an outlier, random, unpredictable, and unknowable. Anyone who came close was probably just drop-dead lucky.



Labels: