Sabermetric Research: The Bradbury aging study, re-explained (Part III)

Last week, J.C. Bradbury posted a response to my previous posts on his aging study.

Before I reply, I should say that I found a small error in my attempt to reproduce Bradbury’s regression. The conclusions are unaffected. Details are in small print below, if you're interested. If not, skip on by.

As it turns out, when I was computing the hitters age to include in the regression, I accidentally switched the month and year. (Apparently, that wasn’t a problem when the reverse date was invalid – Visual Basic was smart enough to figure out that when I said 20/5 instead of 5/20, I meant the 20th day of May and not the 5th day of Schmidtember. But when the reverse date was valid – 2/3 instead of 3/2 -- it used the incorrect date.)

That means that some ages were wrong, and some seasons from 24-year-olds were left out of my study. I reran a corrected regression, and the results were very, very similar – all three peak ages I’ve recalculated so far were within .08 years of the original. So the conclusions still hold. If you’re interested in the (slightly) revised numbers, let me know and I’ll post them when I’m done rerunning everything.

Okay, now to Bradbury’s criticisms. I’ll concentrate on the most important ones, since a lot of this stuff has been discussed already.

----

First, there’s one point on which I agree with Bradbury’s critique. He writes,

" … the model, as he defines it, is impossible to estimate. He cannot have done what he claims to have done. Including the mean career performance and player dummies creates linear dependence as a player’s career performance does not change over time, which means separate coefficients cannot be calculated for both the dummies and career performance. … Something is going on here, but I’m not sure what it is."

He’s right: having both the player dummies and the career mean causes collinearity, which I eliminated by getting rid of one of the player dummies. I agree with him that the results aren’t meaningful this way. I should have eliminated the mean and gone with the dummies alone.

In any case, it doesn’t matter much: the results are similar with and without the dummies. The reason I used the dummies is that it made the results make more sense, and more consistent with what Bradbury found. It turns out that without the dummies, some of the aging curves were very, very flat. By including the dummies, the curves were closer to what Bradbury found.

In retrospect, the reason the curves make more sense with the larger model is that the dummies have the effect of eliminating any observation of only one season (since the dummy will come out to have that player match whatever curve best fits the other, more-than-one-season, players).

Regardless, the peak age is similar either way. But Bradbury’s point is well-taken.

----

Secondly, Bradbury disagrees with me that players are weighted by the number of seasons they played:

"His belief is based on a misunderstanding of how least-squares generates the estimates to calculate the peak. There is no average calculated from each player, and especially not from counting multiple observations for players who play more."

It’s possible I’m misunderstanding something, but I don’t think I am. The model specifies one row in the regression for each player-season that qualifies (player with a certain number of PA and seasons). If player A has a 12-year career that peaks at 30, and player B has a 6-year career that peaks at 27, then player A’s trajectory is represented by 12 rows in the regression matrix, and player B’s trajectory by 5 rows.

Bradbury would argue that the scenario above would result in a peak around 28.5 (the average of the two players). I would argue that the peak would be around 28 (player A weighted twice as heavily as player B). I suppose I could do a little experiment to check that, but that’s how it seems to me.

----

Thirdly, Bradbury says I misunderstood that he used rate statistics for home runs, not actual numbers of home runs:

"I’m estimating home-run rates, not raw home runs. All other stats are estimated as rates except linear weights. This is stated in the paper."

Right, that’s true, but that wasn’t my point. I was probably unclear in my original.

What I was trying to say was: the model assumes that all players improve and decline at the same fixed HR rate, regardless of where they started.

So, suppose Bradbury’s equation says that players drop by .01 home run per PA (or AB) the year after age X. (That’s 6 HR per 600 PA.) That equation does NOT depend on how good a home run hitter that player was before. That is: it predicts that Barry Bonds will drop by 6 HR per 600PA, but, also, Juan Pierre will drop by 6 HR per 600PA.

As I pointed out, that doesn’t really make sense, because Juan Pierre never hit 6 HR per 600PA in the first place, much less late in his career! The model thus predicts that he will drop to a *negative* home run rate.

I continue to argue that while the curve might make sense for the *composite* player in Bradbury’s sample, it doesn’t make sense for non-average players like Bonds or Pierre. That might be lost on readers who look at Bradbury’s chart and see the decline from aging expressed as a *percentage* of the peak, rather than a subtraction from the peak.

-----

Finally, and most importantly, one of Bradbury’s examples illustrates my main criticism of the method. Bradbury cites Marcus Giles. Giles’s best seasons were at age 25 to 27, but he declined steeply and was out of the league by 30. Bradbury:

"What caused Giles to decline? Maybe he had some good luck early on, maybe his performance-enhancing drugs were taken away, or possibly several bizarre injuries took their toll on his body. It’s not really relevant, but I think of Giles’s career as quite odd, and I imagine that many players who play between 3,000 — 5,000 plate appearances (or less) have similar declines in their performances that cause them to leave the league. I’ve never heard anyone argue that what happened to Giles was aging."

Bradbury’s argument is a bit of a circular one. It goes something like:

-- The regression method shows a peak age of 29.
-- Marcus Giles didn’t peak at 29 – indeed, he was out of the league at 29.
-- Therefore, his decline couldn’t have been due to aging!

I don’t understand why Bradbury would assume that Giles’ decline wasn’t due to aging. If the decline came at, say, 35 instead of 28, there would be no reason to suspect injuries or PEDs as the cause of the decline. So why couldn’t Giles just be an early ager? Why can’t different players age at different rates? Why is a peak age of 25, instead of 29, so implausible that you don’t include it in the study?

It’s like … suppose you want to find the average age when a person gets so old they have to go to a nursing home. And suppose you look only at people who were still alive at age 100. Well, obviously, they’re going to have gone to a nursing home late in life, right? Hardly anyone is sick enough to need a nursing home at 60, but then healthy enough to survive in the nursing home for 40 years. So you might find that the average 100-year-old went into a nursing home at 93.

But that way of looking at it doesn't make sense: you and I both know that the average person who goes into a nursing home is a lot younger than 93.

But what Bradbury is saying is, "well, those people who went into a nursing home at age 65 and died at 70 … they must have been very ill to need a nursing home at 65. So they’re not relevant to my study, because they didn’t go in because of aging – they went in because of illness. And I’m not studying illness, I’m studying aging."

That one difference between us is pretty much my main argument against the findings of the study. I say that if you omit players like Giles, who peaked early, then *of course* you’re going to come up with a higher peak age!

Bradbury, on the other hand, thinks that if you include players like Giles, you’re biasing the sample too low, because it’s obvious that players who come and go young aren’t actually showing "aging" as he defines it. But, first, I don’t think it’s obvious, and, second, if you do that, you’re no longer able to use your results to predict the future of a 26-year-old player. Because, after all, he could turn out to be a Marcus Giles, and your study ignores that possibility!

All you can tell a GM is, "well, if the guy turns out not to be a Marcus Giles, and he doesn’t lose his skill at age 31 or 33 or 34, and he turns out to play in the major leagues until age 35, you’ll find, in retrospect, that he was at his peak at age 29." That’s something, but … so what?

I’m certainly willing to agree that if you look at players who were still "alive" in MLB at age 35, and played for at least 10 years, then, in retrospect, those players peaked at around 29. And I think Bradbury’s method does indeed show that. But if you look at *all* players, not just the ones who aged most gracefully, you’ll find the peak is a lot lower. There are a lot of people in nursing homes at age 70, even if Bradbury doesn't consider it’s because of "aging."

Labels: aging, baseball

Sabermetric Research

Thursday, December 10, 2009

The Bradbury aging study, re-explained (Part III)

0 Comments:

About Me

Previous Posts