Saturday, July 04, 2009

Is Tiger Woods Irrational?

"Loss Aversion" is a kind of irrationality where people care more about avoiding a loss more than about securing a gain of the same amount. For instance, if one person wins $50, while another identical person loses $50, the unlucky gambler will gain more unhappiness than the winner gains happiness. Because of this, people will expend more effort to avoid incurring a loss than they will in pursuit of the identical gain.

This is irrational; if it takes three hours to save a loss of $20, but you can earn $10 an hour, you're worse off if you spend the three hours to save the $20. That's because if you accepted the loss, and instead spent those three hours earning $30, you'd be $10 ahead. But since, as it turns out, humans tend to value losses as about twice the value of the identical gain, there should be a tendency to spend time trying to save the $20 instead of earn the $30.

The irrational fear of losses is a generalization; it's certainly possible that some people don't have this bias, or at least are aware of it and able to compensate. Life is full of decisions and trade-offs, and you'd think the most successful people would have learned how to be more rational in these situations.

Take golf, for instance. In normal PGA tournament play, your score is just the number of strokes you took. A stroke is a stroke, and so it shouldn't matter if it's a stroke to make birdie (which, if missed, will leave your score unchanged), or a stroke to make par (which, if missed, will make your score one stroke worse). Both strokes are of equal importance. However, the stroke for par may count more in the golfer's mind. That's because, if he doesn't make it, it looks like a loss (his score gets worse). But if he misses a birdie putt, it's just a foregone gain (his score stays the same). If losses irrationally count for twice as much as gains in the golfer's mind, then he should try harder to make the par putt than the birdie putt.

And it turns out he does. According to a recent golf study (.pdf, free download) by Devin G. Pope and Maurice E. Schweitzer, PGA golfers are significantly more likely to make a par putt than the identical birdie putt. It's not even close: overall, the difference is about three percentage points. That's huge: the overall conversion rate for putts is 61 percent, and 3 points out of 61 is about five percent. In baseball terms, it's like a 96-66 team suddenly going 102-60 just by changing an irrational strategy.

Pope and Schweitzer did more than just count putt conversion rates, of course. After all, it's likely that birdie putts are different from par putts in many ways. They might be farther from the hole. They might be in less advantageous places on the green. They are less likely to have followed other putts, which means the golfer doesn't have as good a read of the green. Birdie putts might be associated with different kinds of golfers. They might be associated with different difficulties of greens. And so on, and so forth.

The authors took all these things into consideration. Indeed, their database is so extensive (over 1.6 million putts between 2004 and 2008) that the authors were even able to find thousands of "matching" pairs of putts where one was for birdie and one for par, but where both were in almost exactly the same place on the green, in the same round. The results held: par putts were holed significantly more frequently than birdie putts.

It turns out the difference is mostly length: when pros putted for birdie, they were more likely to leave the putt short than when they were putting for par. The idea is that the birdie golfers are playing conservatively: they want to make sure that if they don't sink the putt, they leave it close to the hole for an easy subsequent shot. On the other hand, the par golfers are scared to miss, because that appears to cost them the loss of a stroke, so they make sure they give it enough weight. For putts of 22.5 feet or longer, birdie putters leave the ball about two inches shorter than par putters.

From a strict strategic standpoint, the conservative play makes little sense. The authors find that after a missed (presumably conservative) birdie try, golfers make the subsequent putt only 0.2 percentage points more often than after a missed (presumably aggressive) par try. So the conservatism costs 3 percentage points, but gains only 0.2 percentage points. Moreover, the gain comes only if the putt is missed: assuming a 50% conversion rate on the original putt, the net gain from conservative play is only 0.1 percentage point!

So why are golfers accepting a much worse first putt for a very, very slight chance at avoiding a three-putt? It must be loss aversion. There are still two possible ways that could manifest itself. It could be that golfers are deliberately trying to hit the ball softer on birdie tries in a conscious effort to be more conservative. Or, it could be that golfers are just trying harder in general on par tries. Either way, that's something that you'd think pros would be able to control, if they realized that what they were doing makes no sense.

Has any golfer figured this out? Apparently not. The authors calculated the effect for each of 188 golfers in their data sample; they got a bell-shaped curve centered around 3.5 percentage points. The "most irrational" golfer's difference was 7 percentage points; the "least irrational" golfer was at half a percentage point. That is: every one of the 188 pro golfers exhibited this bias, to one extent or another. Just as surprising was the finding that the size of the bias was barely correlated with the ranking of the player. Better golfers had less bias, but only very slightly less. Tiger Woods, the best golfer in the world, was almost exactly average in this measure. By eliminating the bias, and shooting birdie putts the same as par putts, Pope and Schweitzer calculate that the average PGA pro will take one stroke off his score for a 72-hole tournament. If a top-20 golfer did this (and none of the other 19 did), his earnings would improve by 22% -- over one million dollars per year.

That's hard to believe, but there it is.

Read the whole study; the authors diced the data many different ways to confirm their thesis, and there are many interesting findings. A New York Times article about the study is here.

P.S. I've been getting lots of spam comments lately, so now comments on older posts (28 days old or more) are moderated. Comments on current posts will continue to appear instantly.

Hat Tip: Inside the Book

Labels:

Tuesday, June 16, 2009

Why are there so many overtime games in the NBA?

There are a lot of NBA games that end up tied after 48 minutes – almost twice as many as expected.

A post from the "Cheap Talk" blog charted score differences for every NBA game between 1997 and 2009. It found a fairly smooth curve, except at zero, where that particular outcome was about twice as frequent as expected.

As it turns out, the spike comes only in the last few seconds of the game. As the authors show in a video on their post, there is barely any spike at all with 40 seconds left. The spike starts to emerge at about 20 seconds, and then grows steadily until 0:00.

Why does this happen? The authors don't really give a hypothesis. One obvious reason, though, is that for the team that's behind, closing the deficit is worthless unless they tie or take the lead. In the first quarter, a team down by three might go for the easy two-pointer instead of the unlikely three. But, with five seconds to go, only the three has any value. Therefore, it's either a tie or nothing.

And if you look at the video again, you'll see that it's not just that zero spikes, but that the values surrounding zero drop. That makes sense – if the team behind by three fails to make it, they'll start fouling the opposition. The most likely result is that they fall further behind. So you get a spike at zero, but a drop at (say) –3, because a minus three with five seconds left gets turned into a minus six or something.

Another possibility (as mentioned by commenters to the post) is that teams are overly conservative. With a tie game and five seconds left, the team with possession might decide to run out the clock instead of risking a turnover. Or, a team behind by two might go for the field goal instead of the three, even if the three is the better strategy. Consider two equally-matched teams. A 40% chance to make two points (and tie) gives you a 20% chance of winning the game. A 30% chance to make a three (and win) gives you a 30% chance of winning the game. A conservative coach might go for the two anyway.

That possibility needs more investigation; I wouldn't accuse teams of playing suboptimally without a bit more evidence. I have to admit, though, that it does have some intuitive plausibility.

UPDATE: follow-up post at Cheap Talk here.

Hat tip: The Sports Economist

Labels: ,

Monday, June 01, 2009

A new golf handicapping system

The Royal Canadian Golf Association (RCGA), Canada's governing body for golf, has a committee to consider updating the system by which a golfer's handicap is computed. Tim B. Swartz, the committee's statistical guy, has a paper in the most recent JQAS explaining the new proposed system.

I'm going to simplify things a bit and explain the situation as I understand it from the paper. Leaving out the technical adjustments (even at the cost of a bit of inaccuracy), I'll describe the current handicap system like this:

You start with your 20 most recent scores (with a few adjustments that I'll discuss later), relative to par. Then, you drop the 10 worst scores, leaving only the 10 best. You average those 10 best. That's your handicap.

Why change this system? For one thing, as Swartz points out, your handicap isn't a true indication of your expected score; golfers fail to shoot their handicap more often than not. As I see it, you're taking the average of the top half, so you'd expect the handicap to be at about the 25% mark. If you assume that scores are normal, then since the normal curve is fatter near the middle, it's a bit more than 25%. As it turns out, the mean of the right half of the normal curve is about .6367 .798, and the chance of beating a Z-score of +.6367 +.798 is about 26.2% 21.2%. So you'd expect a golfer to beat his handicap about 21% of the time.

Swartz checked, using a database of scores from a golf club in Alberta. As it turns out, golfers actually beat their handicap 36% of the time, not 21%. Maybe I made a mistake in the calculation; maybe golf scores aren't really normal; or maybe the various adjustments are causing the difference.

Another problem with the current system is that in casual head-to-head play, it favors the better golfer. Swartz generated a bunch of random matches from his database, and found that the better golfer won 55 percent of the time, rather than 50 percent.

A third problem, and an important one, is that in multi-player tournaments, the winner is likely to be a golfer with a higher handicap. That's because a bad golfer, with a handicap of (say) 20 (which represents a score of about 92), could reasonably have a very good day and shoot an 80, finishing at -12. But a scratch golfer (0 handicap, 72 average) is much less likely to match the -12 by shooting a 60 on the day.

The more players in the tournament, the more likely someone will have a much better game than normal. And those "much betters" are likely to be from the worst golfers.

In his simulation, Swartz found that the top third of golfers won only 27% of the 99-player tournaments. The middle third won 33%, and the worst third won 40%. So the current system favors the better golfer in tournaments of two players, but favors the worse golfers in tournaments of many players.

So how does Swartz fix the current system? Two ways: he makes the handicap represent the player's average score, instead of his 74th percentile score. Second, he divides by a player's standard deviation (effectively converting a raw score to a Z-score), which neutralizes the luck factor in large tournaments.

Here are the details.

Like the current system, Swartz considers only the 20 most recent scores. But instead of dropping the worst 10, he drops only the worst four, leaving 16 scores. Then, instead of just averaging them, the new system uses mathematical statistical techniques to estimate the best normal curve to fit the data (keeping in mind that the four worst scores are missing). That is, it asks the question: what is the best fit normal curve that takes into account that we're looking at only the best 16 of 20 observations?

Swartz gives linear formulas (like a Linear Weights estimate of the 16 scores) to estimate the mean and SD of that best-fit curve; he says that those formulas are minimum variance linear unbiased estimators, which means you can't do better (by using different weights) unless you go to a non-linear estimator.

Those estimates of mean and SD become the player's stated handicap (so, effectively, there are two numbers for the handicap instead of one). Then, for his next (21st) round, his raw score is converted to a Z-score, and that's what gets compared to the other players' Z-scores to determine the winner.

In the study's simulations, golfers shot their handicap 45% of the time with this new system (fairer than 36% with the old system); one-on-one matchups were won by the better golfer only 48 to 51 percent of the time (fairer than 55%); and in tournaments, the best golfers won 29% of the time (fairer than 27%) while the worst won 32 to 34 percent of the time (fairer than 40%).

I promised some details of the adjustments to player scores that go into the formulas. I'll outline them here, and you can see the paper (which is nicely presented and very easy to read) for the details.

First, under both systems, scores are adjusted twice for the difficulty of the course. There's the course rating, which specifies how hard the course is for excellent (scratch) golfers, and the slope rating, which specifies how hard the course is for worse (bogey) golfers after adjusting for the course rating.

Then, there's something called "equitable stroke control" (ESC). That sets a maximum possible score for each hole, so that (for instance) a bad golfer can't score more than a quadruple bogey. Even if it takes him ten strokes to finish a par-three, he can't put more than 7 down on the scorecard. (In Canada, the stroke limit varies by handicap between bogey and quadruple-bogey; in the US, it seems the limits are fixed and not based on par. Swartz says this is the only difference in the current system between the two countries.)

The idea is that very high scores measure golfer frustration rather than skill, and should be discounted. Also, Swartz says, it discourages "sandbagging," which is deliberately trying to inflate your handicap, and provides a maximum if you forget to write down your score.

In this study, Swartz often gives results both with ESC and without, and the results are fairly similar.


But, after all those adjustments, I think the essence is:

-- the old system adjusts for your score relative to your own 63rd percentile.
-- the new system adjusts any outliers in your worst quintile, and gives you a Z-score relative to your own distribution.

I'm not a serious golfer, so I don't know if the added complexity of the new system is worth the advantages. It does seem to me that the new system is better, though.

Labels:

Tuesday, May 19, 2009

Don't always blindly insist on statistical significance

Suppose you run a regression, and it turns out that the input you're investigating turns out to appear to have a real-life relationship to the output. But it also turns out that the despite being significant in the real-life sense, the relationship is not statistically significant. What do you do?

David Berri argues (scroll down to the second half of his post) that once you realize the variable is statistically insignificant, you stop dead:

We do not say (and this point should be emphasized) the “coefficient is insignificant” and then proceed to tell additional stories about the link between these two variables.

One of my co-authors puts it this way to her students.

“When I teach econometrics I tell my students that a sentence that begins by stating a coefficient is statistically insignificant ends with a period.” She tells her students that she never wants to see “The coefficient was insignificant, but…”


Well, I don't think that's always right. I explained why in a post two weeks ago, called "Low statistical significance doesn't necessarily mean no effect." My argument was that, if you already have some reason to believe there is a correlation between your input and your output, the result of your regression can help confirm your belief, even if it doesn't rise to statistical significance.

Here's an example with real data. I took all 30 major league teams for 2007, and I ran a regression to see if there was a relationship between the team's triples and its runs scored. It turned out that there was no statistically-significant relationship: the p-value was 0.23, far above the 0.05 that's normally regarded as the threshold.

Berri would now say that we should stop. As he writes,

"Even though we have questions, at this point it would be inappropriate to talk about the coefficient we have estimated ... as being anything else than statistically insignificant."


And maybe that would be the case if we didn't know anything about baseball. But, as baseball fans, we know that triples are good things, and we know that a triple does help teams score runs. That's why we cheer our team's players when they hit them. There is strong reason to believe there's a connection between triples and runs.

So I don't think it's inappropriate at all to look at our coefficient. It turns out that the coefficient is 1.88. On average, every additional triple a team hit was associated with an increase of 1.88 runs scored.

Of course, there's a large variance associated with that 1.88 estimate -- as you'd expect, since it wasn't statistically significant from zero. The standard deviation of the estimate was 1.53. That means a 95% confidence interval is approximately (-1.18, 4.94). Not only is the 1.88 not significantly different from zero, it's also not significantly different from -1, or from almost +5!

But why can't we say that? Why shouldn't we write that we found a coefficient of 1.88 with a standard deviation of 1.53? Why can't we discuss these numbers and the size of the real effect, if any?

Berri and his co-author would argue that it's because we have no good evidence that the effect is different from zero. But what makes zero special? We also have no good evidence that the effect is different from 1.88, or 4.1, or -0.6. Why is it necessary to proceed as if the "real" value of the coefficient is zero, when zero is just one special case?

As I argued before, zero is considered special because, most of the time, there's no reason to believe there's any connection between the input and the output. Do you think rubbing chocolate on your leg can cure cancer? Do you think red cars go faster than black cars just by virtue of their color? Do you think standing on your head makes you smarter?

In all three of these examples, I'd recommend following Berri's advice, because there's overwhelming logic that says the relationship "should" be zero. There's no scientific reason that red makes cars go faster. If you took a thousand similarly absurd hypotheses, you'd expect at least 999 of them to be zero. So if you get something positive but not statistically significant, the odds are overwhelming that the non-zero point estimate got that way just because of random luck.

But, for triples vs. runs, that's not the case. Our prior expectation should be that the result will turn out positive. How positive? Well, suppose we had never studied the issue, or read Bill James or Pete Palmer. Then, we might naively figure, the average triple scores a runner and a half on base, and there's a 70% chance of scoring the batter eventually. That's 2.2 runs. Maybe half the runners on base would score eventually even without the triple, so subtract off .75, to give us that the triple is worth 1.45 runs. (I know these numbers are wrong, but they're reasonable for what I might have guessed pre-Bill James.)

If our best estimate going in was that a triple should be worth 1.45 runs, and the regression gave us something close to that (and not statistically significantly different), then why should we be using zero as a basis for our decision for whether to consider this valid evidence?

Rather than end the discussion with a period, as Berri's colleague would have us do, I would suggest we do this:

-- give the regression's estimate of 1.88, along with the standard error of 1.53 and the confidence interval (-1.18, 4.94).
-- state that the estimate of 1.88 is significant in the baseball sense.
-- admit that it's not significantly different from zero.
-- BUT: argue that there's reason to think that the 1.88 is in the neighborhood of what theory predicts.

If I were writing a paper, that's exactly what I'd say. And I'd also admit that the confidence interval is huge, and we really should repeat this analysis with more years' worth of data, to reduce the standard error. But I'd argue that, even without statistical significance, the results actually SUPPORT the hypothesis that triples are associated with runs scored.

You've got to use common sense. If you got these results for a relationship between rubbing chocolate on your leg and cancer, it would be perfectly appropriate to assume that the relationship is zero. But if you get these results for a relationship between height and weight, zero is not a good option.

And, in any case: if you get results that are significant in the real world, but not statistically significant, it's a sign that your dataset is too small. Just get some more data, and run your regression again.

------

Here's another example of how you have to contort your logic if you want to blindly assume that statistical insignificance equals no effect.

I'm going to run the same regression, on the 2007 MLB teams, but I'm going to use doubles instead of triples. This time, the results are indeed statistically significant:

-- p=.0012 (signficant at 99.88%)
-- each double is associated with an additional 1.50 runs scored
-- the standard error is 0.417, so a 95% confidence interval is (0.67, 2.33)

Everyone would agree that there is a connection between hitting doubles and scoring runs.

But now, Berri and his colleague are in a strange situation. They have to argue that:

-- there is a connection between doubles and runs, but
-- there is NO connection between triples and runs!

If that's your position, and you have traditional beliefs about how doubles lead to more runs (by scoring baserunners and putting the batter on second base), those two statements are mutually contradictory. It's obvious to any baseball fan that, on the margin, a triple will lead to at least as many runs scoring as a double. It's just not possible that a double is worth 1.5 runs, but the act of stretching it into a triple makes it worth 0.0 runs instead. But if you follow Berri's rule, that's what you have to do! Your paper can't even argue against it, because "the coefficient was insignificant, but ..." is not allowed!

Now, in fairness, it's not logically impossible for doubles to be worth 1.5 runs in a regression but triples 0.0 runs. Maybe doubles are worth only 0.1 runs in current run value, but they come in at 1.5 because they're associated with power-hitting teams. Triples, on the other hand, might be associated with fast singles-hitting teams who are always below average.

In the absence of other evidence, that would be a valid possibility. But, unlike the chocolate-cures-cancer case, I don't think it's a very likely possibility. If you do think it's likely, then you still have to make the argument using other evidence. You can't just fall back on the "not significantly different from zero."

Using zero as your baseline for significance is not a law in the field of statistical analysis. It's a consequence of how things work in your actual field of study, an implementation of Carl Sagan's rule that "extraordinary claims require extraordinary evidence." For silly cancer cures, for red cars going faster than black cars, saying there's a non-zero effect is an extraordinary claim. And so you need statistical significance. (Indeed, silly cancer cures are so unlikely that you could argue that 95% significance is not enough, because that would allow too many false cures (2.5%) to get through.)

But for triples being worth about the same as doubles ... well, that's not extraordinary. Actually, it's the reverse that's extraordinary. Triples being worth zero while doubles are worth 1.5 runs? Are you kidding? I'd argue that if you want to say triples are worth less than doubles, the burden is reversed. It's not enough to show that the confidence interval includes zero. You have to show that the confidence interval does NOT include anything higher than the value of the double.


According to David Berri, the rule of thumb in econometrics is, "if you don't have signficance, ignore any effect you found." But that rule of thumb has certain hidden assumptions. One of those assumptions is that on your prior beliefs, the effect is likely to be zero. That's true for a lot of things in econometrics -- but not for doubles creating runs.

-----

This doubles/triples comparison is one I just made up. But there's a real life example, one I talked about a couple of years ago.

In that one, Cade Massey and Richard Thaler did a study (.pdf) of the NFL draft. As you would expect, they found that the earlier the draft pick, the more likely the player was to make an NFL roster. Earlier choices were also more likely to play more games, and more likely to make the Pro-Bowl. Draft choice was statistically significant for all three factors.

Then, the authors attempted to predict salary. Again as you'd expect, the more games you played, and the more you were selected to the Pro Bowl, the higher your salary. And, again, all these were statistically significant.

Finally, the authors held all these constant, and looked at whether draft position influenced salary over and above these factors. It did, but this factor did not reach statistical significance. Higher picks earned more money, but by somewhere between 1 and 2 SDs.

From the lack of significance, the authors wrote:

" ... we find that draft order is not a significant explanatory variable after controlling for [certain aspects of] prior performance."

I disagree. Because for that to be true, you have to argue that

-- higher draft choices are more likely to make the team
-- higher draft choices are more likely to play more games
-- higher draft choices are more likely to make the Pro-Bowl

but that

-- higher draft choices are NOT more likely to be better players in other ways than that.

That makes no sense. You have two offensive linemen on two different teams -- good enough to play every game for five years, but not good enough for the Pro Bowl. One was drafted in the first round; one was drafted in the third round. What Massey and Thaler are saying is that, despite the fact that the first round guy makes, on average, more money than the third round guy, that's likely to be random coincidence. That flies in the face of the evidence. Not statistically significant evidence, but good evidence nonetheless -- a coefficient that goes in the right direction, is signficant in the football sense, and is actually not that far below the 2 SD cutoff.

That isn't logical. You've shown, with statistical significance, that higher picks perform better than lower picks in terms of playing time and stardom. The obvious explanation, which you accept, is that the higher picks are just better players. So why would you conclude that higher picks are exactly the same quality as lower picks in the aspects of the game that you chose not to measure, when the data don't actually show that?

In this case, it's not only acceptable, but required, to say "the coefficient was insignificant, but ..."



Labels: , ,

Wednesday, May 13, 2009

How many runs are created by good baserunning?

There's a nice paper on baserunning in the latest issue of JQAS, "Using Simulation to Estimate the Impact of Baserunning Ability in Baseball." It's by Ben Baumer, the New York Mets' stats guy.

Baumer set out to quantify baserunning skill, in terms of runs. Specifically, he considered these seven skills:

-- advancing first to third on a single
-- advancing first to home on a double
-- advancing second to home on a single
-- beating out a DP attempt on a ground out
-- stealing second
-- stealing third
-- tagging up on a fly ball when on second or third

He created a (Markov) simulation using 2005-2007 league-average results for each of the seven skills, and proved that his model came close to actual league runs scored.

Then, he substituted actual team lineups, and, for every player, used their actual baserunning percentages for each of the seven situations. There were two probabilities for each situation: the probability of trying for an extra base (for double plays, this is the probability of there being a force play on the runner on first with less than two outs), and the probability of success given that an attempt was made.

For each team, he then ran the same simulation, but using league-average baserunning. The difference is an estimate of how many runs the team's players gained (or lost) with their baserunning.

The top three and bottom three:

+21.1 Mets
+18.0 Yankees
+14.7 Rockies

-12.3 Marlins
-12.7 Red Sox
-20.8 White Sox

Baumer concludes that most teams should be within 25 runs of average baserunning.

But now he wants to figure out, in theory, how many runs a really great baserunning team would gain, and how much a really bad team would lose. He tries a bunch of different selection criteria for "best" and "worst." The results, simplified a bit:

+41.0 runs, -54.6 runs -- high/low attempt rates
+39.4 runs, -35.1 runs -- high/low success rates
+68.4 runs, -42.5 runs -- high/low combination of attempts/successes


The +68 lineup consisted of: Joey Gathright, Willy Taveras, Jose Reyes, Willie Harris, Chone Figgins, Nook Logan, Josh Barfield, Pablo Ozuna, and Juan Pierre. The -54 lineup was Bengie Molina, Mike Piazza, Josh Bard, Bill Mueller, Frank Thomas, Olmedo Saenz, Ryan Garko, Jay Gibbons, and Toby Hall.

As I said, I really like this paper; it asks an interesting and well-defined question and answers it well. Moreover, it's written for readers who know baseball a bit. It does use more mathematical notation than is necessary for sabermetricians, but given that it's an academic paper, and given that the notation is not overdone and clearly explained, I'd have to say that it's very well done.

The one criticism I have is that, as far as I can tell, Baumer used actual raw success rates and didn't regress to the mean at all. That means that while the results wind up accurate in terms of what the actual run contribution was, they are exaggerated estimates of the actual skill of the players involved. If you're thinking about 2010, there's probably no way to estimate, in advance, what any given set of baserunners will do. While the "combination" group added 68.4 runs a season from 2005 to 2007, they'd regress to the mean in 2009-2001 by some amount. What's that amount? We don't really know.

Oh, and one useful point that I'll use in future: for leagues that score 0.531 runs per inning, the variance of runs per inning is 1.125. I've always used 1.000 as an estimate, based on some research I did on the 1988 AL a long time ago, but I think that league scored only .5 runs/inning. Also of note: a simulation that assumes average pitching and an average lineup has a variance a bit smaller: around 1.1 runs instead of 1.125. That's obviously because the pitching doesn't vary in the simulation, only the hitting.


Labels: ,

Friday, May 08, 2009

The regression equation versus r-squared

OK, I hope I'm not beating a dead horse here, but here's another way to think of the difference between r-squared and the regression equation.

The r-squared comes from the standpoint of stepping back and looking at the distribution of wins among teams in your dataset. Some teams have over 60 wins, some teams have under 20 wins, and some teams are in the middle. If you look at the standings, and ask yourself, "how important are differences in salary to how we got this way?", then you're asking about r-squared.

The regression equation matters more if you're interested in the future, if you care about how much you can influence wins by increasing payroll. If you ask yourself, "how much do I have to spend to get a few extra wins?", then you want the regression equation.

The r-squared looks at the past, and asks, "was salary important to how we got to this variance in wins?". The regression equation looks to the future, and says, "can we use salary to influence wins?"

It's very possible, and very easy, to have two different answers to these two questions. Here's an example.

Suppose you're trying to see what activities 25-year-olds partake in that affect their life expectancy. You might discover that the average 25-year-old lives to 80, but you want to try to figure out what factors influence that. You run a multiple regression, and you figure out that if the person smokes at 25, it appears to cut five years off his life expectancy. If he eats healthy, it adds four years. If he commits suicide at 25, it cuts off 55 years (since he dies at 25 instead of 80).

Your regression equation would look something like:

life expectancy = 80 - (5 * smoker) + (4 * eats healthy) - (55 * commits suicide).

We should all agree that committing suicide has a big effect on life expectancy, right?

Now, let's look at the r-squared. To do that, look at all the 25-year-olds in the sample (which might be several thousand). You'll see a few that live to 25, some that live to 45, a bunch that live to 65, a larger bunch that live to 80, and some that live to 100. The distribution is probably bell-shaped.

For the r-squared, ask yourself: how much did suicide contribute to the curve looking like this? The answer: very little. There are probably very few suicides at 25, and even if you adjusted for those, by taking those points out of the left side of the curve and moving them to the peak, the curve would still look roughly the same. Suicide is not a very big factor in making the curve look like it does.

And so, you get a very low r-squared for suicide. Maybe it would be .01, or even less.

See the apparent contradiction?

-- suicide has a HUGE effect on lifespan.
-- r-squared for suicide vs. lifespan is very low

And, again, that's because:

-- the regression equation tells you what effect the input has on the output;
-- the r-squared tells you how important that input was in creating the distribution you see.

The regression equations tell you that having a piano drop on your head is very dangerous. The low r-squared tells you that pianos haven't historically been a major source of death.

----

Here's a different way to explain this, which might make more sense to gamblers:

Suppose that you had to predict the lifespan of a random 25-year-old. Obviously, the more information you have, the more accurate your estimate will be. And, imagine the amount you lose is the square of the error in your guess. So if you guess 80, and the random person dies at 60, you lose $400 (the square of 80 minus 60).

Without any information, your best strategy is to guess the average, which we said was 80. Your average loss will be the variance, which is the square of the SD. Suppose that SD is 15. Then, your average loss would be $225.

Now, how valuable is knowing the value of whether or not the guy committed suicide? It's probably not that valuable. Most of the time, the answer will be "no", and you're only slightly better off than when you started (maybe you guess 80.05 now instead of 80). A tiny, tiny proportion of the time, the answer will be "yes," and you can safely guess 25 and be right on. On balance, you're a little better off, but not much.

On average, how much less will you lose given the extra information? The answer is given by the r-squared. If the r-squared of the suicide vs. lifespan regression is .01, as estimated above, then your loss will be reduced by 1%. Instead of losing $225, on average, you'll lose only about $222.75.

Again: the r-squared doesn't tell you that suicide is dangerous. It just tells you that, because of *some combination of dangerousness of suicide and historical frequency of suicide*, you can shave 1% off your error by taking it into account.

----

Now, let's reapply this to basketball. The r-squared for salary vs. wins was .2561. The SD of wins was 14.1, so the variance was the square of that, or 199.

If you took a bet where you had to guess a random team's wins, and had to pay the square of the difference, you'd pick "41" and, on average, owe $199. But let's suppose someone tells you the team's payroll. Now, you can adjust your guess, to predict higher if the team has a high payroll, or lower if the team has a low payroll. If you adjust your guess optimally -- by using the results of the regression equation -- you'll cut your average loss by 25.61%. So, on average, you'd lose only 74.39% as much as before. That works out to $148.11.

What Berri, Brook and Schmidt are saying, in "The Wages of Wins," is, "look, if you can only cut your losses by 25.61% by knowing salary, then money can't be that important in buying wins." But that's wrong. What they should conclude is that "how important money is, combined with how often it's been used to buy wins," isn't that important.

And, really, if you look at the full results of the regression, it turns out that money IS important in buying wins, but that not too many teams took advantage of that fact in 2008-09.

The equation shows that every $1.6 million dollars in additional salary will buy you a win -- so if you want to go 61-21, it should only cost you $32 million more than the league-average payroll of $68.5 million.

That's pretty important, and so the low r-squared must be that not a lot of teams varied much in salary. If you look at the salary chart, there's a huge group bunched near the average: there are 18 teams between $62mm and $75mm, within $6.5 million of the average. Those teams are so close together that there's not much difference in their expected wins.

If you have to bet, and the random team you pick turns out to be the lowest-spending in the league, you'll reduce your estimate. You would have lost a lot of money guessing 41, so the information that you picked a low-spending team will cut your losses a lot. If it turns out be be one of the highest-spending in the league, same thing. But if it turns out to be one of the 18 teams in the mdidle, the salary information won't help you much. And why the r-squared is only about 25% -- for many of the teams in the sample, knowing the salary doesn't help you cut your losses much.


What if we take out those 18 teams, and regress only on the remaining 12? Well, the regression equation stays almost the same -- $1.5 million per win instead of $1.6. But the r-squared increases to .4586. Why does the r-squared increase? Because salary is much more significant a factor for those 12 teams than for the ones in the middle. Before, knowing the salary might not do you much good for your estimate if it's one of the teams bunched in the middle. But, now, those teams are gone. Your random team is much more likely to be the Cavaliers or the Clippers, so knowing the salary is a much bigger help, and it lets you cut your betting losses by almost half.

----

One last summary:

1. The regression equation tells you how powerful the input is in affecting output -- is it a nuclear weapon, or a pea-shooter?

2. The r-squared tells you how powerful the input is, "multiplied by" how extensively the input was historically used. That is: a nuclear weapon used once might give you the same r-squared as a pea-shooter used a billion times.

So a low r-squared might mean

-- an input that doesn't have much effect on the output (e.g., shoe size probably doesn't affect lifespan much);

-- an input that has a big effect on output but doesn't happen much (e.g., suicide curtails 100% of lifespan but happens rarely); or

-- an input that doesn't affect output and also doesn't happen much. (e.g., fluorescent purple shoes' effect on lifespan).

In the case of the 2008-09 NBA, the regression equation shows that salary is a fairly powerful bomb. And the moderate r-squared shows that not every team uses it to its full potential.

Bottom line: salary can indeed very effectively buy wins. The r-squared is as small as it is because, in 2008-09, NBA teams differed only moderately in how they chose to vary their spending.


Labels: , , , , ,

Why r-squared doesn't tell you much, revisited

In a blog post I wrote about yesterday, "Wages of Wins" author Stacey Brook ran a regression to try to figure out what kind of relationship there is between an NBA team's payroll and its success on the court.

The regression gives you several pieces of information. Which ones should you use to best explain the relationship?

Brook says it's the r-squared. He writes,

"We use R2 since we are interested in the proportion of variance that is in common between NBA team payroll and NBA team performance."


But is that truly what we're interested in? I don't think so.

I do agree with Brook when he says that R-squared gives you "the proportion of variance that is in common between NBA team payroll and NBA team performance." But what does that mean? Almost nothing, unless you're a statistician.

When you do research like this, there's a question that you want to answer. In this case, if your question is "what proportion of variance is in common between NBA team payroll and NBA team performance?," well, then, there's your answer. But that's not the question. It's not even Brook's real question. His real question is implied by the first paragraph of his post:

"I have to disagree that NBA (or for that matter NHL, MLB or NFL) teams that have high payrolls result in higher winning percentages; nor am I the first to say this."


The question is: do teams with higher payrolls do better on the court? And that question is different from "what proportion of variance is in common between NBA team payroll and NBA team performance?"

If you want to see what payroll does to performance, what you want to see is the regression equation. The way regression works, of course, is to plot all the datapoints on a graph, then draw the best fit straight line among those points. That line represents the best-fit relationship between payroll and wins.

If you do that for the 2008-09 NBA teams, you get

Wins = 0.61 (millions of $ spent) - 0.76

This, basically, answers your question, in several ways

-- every extra million dollars you spend on salaries gives you three-fifths of a win.
-- every extra $1.64 million you spend gives you an extra win.
-- if you spend $100 million, like the Knicks, you should win about 60 games.
-- if you spend only $45 million, like the Grizzlies, you should win only about 27 games.

Not that complicated, right? If you want to know about the direct relationship between salary and wins, the regression equation does it.

Of course, you want to check the statistical significance; it's possible that while the best-fit straight line says $1.64 million per win, that might not be significantly different from zero. (As it turns out, it IS significant, at the 99.5% level. In fairness to Brook, it appears his data source had incorrect information, and because of that, his results were not, in fact, significant.)

I think we can all agree, from these results, that it certainly does appear that spending leads to winning. When the highest-spending team is expected to go 60-22, and the lowest-spending team is expected to go 27-55, you can't really claim that payroll is irrelevant. (Again, in fairness to Brook, he didn't get results this extreme. With the incorrect data, the regression suggests the highest-spending team should only be 45-37.)

So if the regression equation is the gold standard for making these kinds of calculations, what's with the r-squared? Well, the r-squared answers a different question.

Let's suppose that you had no idea what makes teams win basketball games. You see the Cavs go 66-16, and you see the Clippers go 19-63, and you think, what causes the difference?

What you could do is list as many plausible things as you could think of. Payroll would be one of them. Maybe average days of rest. Maybe whether they're an offensive or defensive team. Maybe average age. Maybe pace of play. Just list them all, as many as you want. Then, run a regression, and look at the r-squared.

What the r-squared will do is tell you, in a certain mathematical sense, after correcting for all those variables, what percentage of all the variation in wins have you explained? What you're trying to do is get as close to 100% as you can. The closer you get, the more you've explained what makes teams win and what makes teams lose. Maybe, if you actually ran this regression, you'd get to something like 40%. If you adjusted team wins for all those variables, as best you could, your variance would decrease by 40%.

In this particular case, our regression didn't include all that other stuff, like pace of play or average age. We only had one variable, payroll. And it turned out that the r-squared was .256, which means that 25.6% of the variation is "explained" by payroll.

It doesn't sound like a lot. In "The Wages of Wins," Brook (and co-authors David Berri and Martin Schmidt) did that for MLB, and came up with only 18%. That doesn't sound like a very big number either, and those authors decide that means that payroll isn't very important.

But that doesn't follow.

The r-squared, the seemingly-low 25.6% number, does NOT tell you about the relationship between payroll and wins. It just tells you that payroll is 25.6% of the total variance, and other factors are 74.4%. But, if the total variance is large, 25.6% of it would be substantial.

When you go into the car dealership and ask for a price, you want the amount in dollars. If you ask "how much for that Camry," and the salesman says, "it's 700% of your monthly pay," it may sound like a lot. If he says, "it's 9.5% of your net worth," it may sound cheaper. And if he says, "it's less than 0.01% of Bill Gates' disposable income for the week," it may sound cheaper still. But those all represent the same number of dollars. The fact that one percentage is a large number, and one percentage is a small number, doesn't change that fact.

It's the same thing for r-squared. The size of the percentage number depends what it's a percentage of -- which happens to be the total variance of wins in the league. Do you know, intuitively, what that variance is? I don't. But I know that a lot of it is random chance. And random variation depends on sample size. You could have exactly the same relationship between salary and wins, but, in one case, the r-squared is .25, and in another case, it's .04, and in another case, it's .5.

I wrote before about one example of how that can happen. But I can do another right now.


Want to see how you can use the same data to get a larger r-squared? Easy. I'm going to take the actual data for the 30 teams, but group them into threes according to payroll. So instead of the three data points "$100 million, 32 wins" (Knicks), "$90.1 million, 66 wins" (Cavs), and "$86 million, 50 wins" (Mavericks), I'm going to add them all up into the one data point "$276.1 million, 148 wins". Then I'm going to repeat for the other 27 teams, until I have 10 sums of three teams. Then, I'm going to run a regression on those 10 data points.

What happens? The r-squared now goes up to .497 -- almost double what it was!

But while I was able to arbitrarily double the r-squared, the regression line stayed almost the same -- which makes sense, since the actual relationship between salary and wins shouldn't change just because we arranged the data differently. Using all 30 teams, we got 0.61 wins per million dollars. Using the 10 groups of three teams, we get 0.68 wins per million dollars. Pretty close.

Here, let me give you everything in one place:

30 teams.... r-squared = 0.256
10 groups... r-squared = 0.497

30 teams.... Wins = 0.61 ($millions) - 0.76
10 groups... Wins = 0.68 ($millions) - 5.5


If Stacey Brook did the analysis his way, using all 30 teams, he'd say "salary explains 25.6% of the variance in wins." If I do the analysis my way, using groups of three teams, I'd say "salary explains 49.7% of the variance in wins." Which one of us would be right? Both of us! Because we are using different denominators, different variances. The same Toyota Camry can be a smaller percentage of Brook's salary than of my salary, because our salaries are different.

And so saying "payroll explains 25.6% of the variance of wins" is like saying "a Camry costs 35% of salary." Whose salary, and how much does he earn? Unless you know that, the "35%" figure is useless.

But, again, despite the fact that Brook and I did our regression differently, the equation should come out very similar. It won't come out exactly the same, because of random fluctuation, but you should *expect* it to come out the same, in the same sense as you expect a coin to come up heads 50% of the time. 0.61 wins per $million and 0.68 wins per $million are pretty close.

The regression equation is meaningful, it requires less information to interpret, and its expected value is the same regardless of your sample size. Most importantly, it answers the exact question that you want to know.

The r-squared, on the other hand, is unintuitive, can be made to come out to almost anything you like by tweaking the sample size to get a different total variance, and requires you to know how the study was done in order to interpret what it means. In terms of answering real-life questions, it's not very useful at all.


Labels: , , , , ,

Thursday, May 07, 2009

USA Today's NBA salary data is flawed

My previous post pointed to a blog entry by sports economist Stacey Brook, in which Brook found a low correlation between team payroll and wins. Specifically, he found that for the 30 teams in the 2008-09 NBA, the r-squared was only .0410.

I think Brook used incorrect data. The article he pointed to in turn poined to the USA Today basketball salary page. However, the USA Today database misallocates salaries. When a player was with more than one team in 2008-09, it counts his entire salary for only one of those teams. That throws everything off.

For instance, mid-season, the Raptors traded Jermaine O'Neal and Jamario Moon to the Heat for Marcus Banks and Shawn Marion. All four of those players are listed on the Raptors page. This (and probably other similar situations) causes the Raptors payroll to come out to $95.3 million, compared to only $67.4 million at other sources (like this one). And since all four of those players are absent from the Heat page, Miami comes out with a payroll of only $50 million instead of $68.6 million.

If you use the more standard numbers, you wind up with a solid positive relationship between salary and wins, an r-squared of .2561 instead of .0410. (That's an r of .5061.)

This doesn't affect any of the comments I made in the last post (or plan to make in future posts), but I thought I should report it anyway.

Labels: , , ,

Wednesday, May 06, 2009

Low statistical significance doesn't necessarily mean no effect

The "Wages of Wins" blog is written mostly by David Berri, but, as it turns out, co-author Stacey Brook also blogs. Recently, Brook had a post on the relationship between salary and wins.

He says there is none. Seriously. Not that the relationship is weak, not that money doesn't help much. Brook seems to honestly believe that salary doesn't buy wins at all. Read the full post to see if I'm interpreting him correctly, but here's a quote:

"So not only the proportion of variance that is common between the two tiny, but here I am able to show that the correlation coefficient between the two populations (NBA payroll and NBA performance) for the 2008-2009 season is statistically zero."


I have several problems with this analysis. The first one is not unique to Brook, and it drives me nuts. It's the idea that if you do a regression, and the significance level is less than 95%, it's OK to claim that there is no relationship between the variables.

That's not always right. It's often right; I suppose you could even say it's *usually* right. But this is one of those exceptions where it's not right at all.

Let's suppose that somehow you get it into your head that rubbing chocolate on your legs can help cure cancer. So you set up a double-blind experiment, where one set of patients gets the chocolate rub, and the other set gets a rub with fake chocolate. It turns out that the first group actually improves more than the second group -- by a small amount, maybe 1%. But the result is not statistically significant. Maybe, instead of the 95% you were looking for, you only have 80% significance.

In this case, I agree with Brook -- it would be wrong to argue that the 1% improvement you saw was real. It's probably just random chance, and you'd be justified in saying that there's no reason to believe that a chocolate rub has any therapeutic value at all.

But, now, let's turn to salary and wins. Suppose you study actual NBA payrolls and records, and you find a similar small effect: every $1 million gives you 0.1 extra wins. Again, suppose that's significant at only the 80% level.

In this case, can you draw the same conclusion, that money has no effect on wins at all? No, you can't. In this case, it's likely that the effect is real, despite the low significance level.

Why the difference? Because in the first case, there was absolutely no reason to believe that chocolate can have any effect on cancer. There's no previous scientific evidence for it, and there isn't a plausible mechanism for how the effect might work.

Suppose that, going in to the study, you (generously) thought there couldn't be more than a one in a million chance that chocolate helps treat cancer. So imagine a million different universes where you run the experiment. One time, you'll get a real effect. 200,000 times, you'll get 80% significance just by chance. So the chance that the chocolate actually works in this universe is roughly 1 in 200,001. That's still no reason to believe.

But the salary case is very different. There's no basis to believe that chocolate can cure cancer, but there's very good reason to believe that spending money buys better players and leads to more wins. In fact, every serious basketball fan in the world (except maybe Stacey Brook) believes that you can buy wins. When the Celtics pay Kevin Garnett some $25 million, does anyone really believe that the signing won't help the team? That if the Celtics instead paid $500,000 for some mediocre guy, they'd be doing just as well?

In the salary case, when you run regressions and get only 80% significance, the calculation works out differently. Suppose that going into the study, you figured there was a 99% chance that money helped buy performance (which is again conservative). Then, in a million different universes, you'd get 2,000 where the 80% signfiicance came up just by chance; and you'd get 990,000 universes where the effect is real. The chance, then, that salary actually does buy wins in this particular universe is 99.8% (990,000 divided by 992,000). The effect that Brook found is probably a real one.

(The above argument can be put into more formal mathematics using Bayesian probability, but I won't bother -- first, because it makes more sense to explain it in plain English, and, second, because I don't remember all the terminology and notation from the one Bayesian course I took in 1996.)

-----

Here's another way to look at it, if you don't like the "multiple universes" approach.

There are two possible reasons you might get a non-significant correlation between two variables:

1. There really is no relationship between the variables; or

2. There *is* a relationship, but you haven't looked at enough data to get a high enough significance level.

Almost any relationship, no matter how strong, will give you low significance if your sample size is too small. If you look at one random Ted Williams game, and one random Mario Mendoza game, what kind of significance level will you get? Pretty low. Even if Ted goes 2-for-5, and Mario goes 1-for-5 -- both of which are more extreme than their career averages -- you won't find the difference to be significant at the 95% level. One game is just not enough.

That doesn't mean this particular experiment is useless. You can still show the effect that you found, and invite further investigation. In this case, the difference between Williams and Mendoza is huge in the baseball sense -- .400 vs. .200. As a general rule, when you find an effect that's significant in the real-life sense, but not in the statistical sense, that's an indication that you might need more data. If the observed effect does have real-life importance, you are NOT entitled to conclude that there is no relationship between the variables. You are only entitled to conclude that you need more data.

And, in my opinion, you MUST show the size of the effect you found, not just the signficance level. Brook doesn't do that in his blog post. He gives us significance levels, and r, and r-squared, but the purpose of the study was to estimate the relationship between payroll and wins. Is it $5 million per win? $10 million per win? $15 million per win? Because, regardless of the significance level, the slope of the best-fit line is still the best estimate of that relationship. And I suspect that the results are reasonable, very close to what other analysts have estimated as the rate at which you can buy wins.

I suspect if we were able to look more closely at Brook's study, we'll find that:

-- he got an estimate of wins per dollar that's close to conventional wisdom;
-- but he didn't have enough data to get statistical significance;
-- so he claims that the proper estimate of wins per dollar is zero.

That ain't right.

-----

P.S. Probably more on this topic in the next post -- for a preview, this is why I think Brook got such low correlation.

UPDATE: Actually, I think Brook got a low correlation because the data was flawed. Details in my next post here.





Labels: , , , , ,

Monday, May 04, 2009

NBA's debunking of referee bias flawed, says researcher

A couple of years ago, Joseph Price and Justin Wolfers came out with a study (.pdf) that found a bit of racial bias among NBA referees. The more white referees on the court, the more fouls were called against black players (relative to white players). And vice-versa: more black referees meant relatively more fouls against whites. (The vice-versa has to be true by definition, since the white referees can only be judged relative to their black peers.)

At the time, I summarized the Price/Wolfers study here, here, and here.

When that study came out, the NBA wasn't pleased, and David Stern commissioned a counter-study to refute it. I haven't seen that NBA study, but David Berri has, and he recently wrote about it on his blog. Apparently, it's very amateurish: according to Berri, the researchers estimated the results twice, once with a dummy variable for black refs, and again with a dummy variable for white refs. But, as Berri points out, that will just give the same results -- it doesn't matter which way you define the dummy variable, and "that suggests that the person doing the work for the NBA didn't understand dummy variables."

I wish I had a copy of the NBA study ... I haven't been able to find it online, and I think I read somewhere that it was only distributed to a select group of readers. One of them was Joseph Price, the author of the original study. Berri's post was based on a visit Price made to Berri's class. In addition to the dummy variable issue, Berri reports that Price said "that was just the beginning" of the problems with the NBA study, but gives no further details.

Anyone know where the NBA response can be found? I'd love to take a look at it firsthand.

P.S. Berri gets in a shot at non-academic researchers:

"Unfortunately, the quality of work offered by the consulting firm [the NBA hired] was consistent with what you sometimes see in on-line studies. In other words, it wasn’t very good. In fact, much of it consisted of mistakes you would not expect an undergraduate in econometrics to make."

Labels: , , ,

Wednesday, April 22, 2009

Has Runs Created stopped working?

Does Runs Created not work any more?

The reason I ask is that, if you take a look at the 2008 AL and NL pages on Baseball Reference, you'll see that RC overestimated actual runs for 28 of the 30 teams. The average discrepancy was a huge +58 runs in the NL, and +19 runs in the NL.

To emphasize: that's not the average after removing the signs, that's the average *including* the signs. If half the teams had been +58 and the other half had been –58, the average would have been zero. It wasn't.

So what I'm saying is, Runs Created now appears to be biased too high.

This has been happening since the mid 90s. Here is the average team discrepancy by season:

1985 -4
1986 –1
1987 +2
1988 –5
1989 –5
1990 +4
1991 –7
1992 +7
1993 +0
1994 +8
1995 +7
1996 +7
1997 +19
1998 +15
1999 +19
2000 +19
2001 +18
2002 +19
2003 +19
2004 +27
2005 +25
2006 +27
2007 +24
2008 +26

Now, we know that Runs Created is biased too high for higher run environments, so that might be part of it. But it's not all of it. In the three seasons 1994 to 1996, there were 4.92, 4.84, and 5.03 runs per game respectively. But in 2005 there were only 4.59 runs per game, and in 2008, only 4.65 runs per game.

Could it be that the pattern of offensive events is different? Maybe there are different patterns of offensive events than there used to be (maybe more walks per single, or something?), and Runs Created doesn't work well when that happens?

By the way, I tried Base Runs, using the first version found on page 18 here (.pdf) with X=.535; the results weren't as extreme, but they were similar.

Anyone know what's going on? Is this a well-known problem and I just missed it?

P.S. For the record, I think I'm using the "technical" version of Runs Created found on this Wikipedia page.

Labels: ,

Monday, April 20, 2009

A Diamond Mind simulation as baseball strategy research

A science column from Alan Schwarz a couple of weeks ago investigates the effects of various baseball strategies, using a simulation.

To check out batting orders, Schwarz got Luke Kraemer at Diamond Mind to simulate two sets of 100 seasons of the 2008 Yankees. In one set, A-Rod batted fourth; in the other set, he batted ninth. The difference was 42 runs; the regular Yanks scored 789 runs, while the A-Rod-at-the-bottom-of-the-order Yanks scored only 747.

Schwarz doesn't tell us how he checked intentional walks, but finds that they are a bad strategy, costing five runs per season. That's not a very useful result; there are times when the IBB makes more sense, and times when it makes less sense. Which did Diamond Mind simulate?

Stolen Bases: Diamond Mind took the 2008 Rays and the 2008 A's, and reversed their respective propensities to steal ("switched their mind-sets," is what the article says). The A's dropped by 20 runs, but the Rays *improved* by 47 runs, "suggesting that perhaps the Rays were running too often in real life."

As it turns out, the real Tampa Bay team stole 142 bases and were caught only 50 times, for a 74% success rate; that should put them well in the black, compared to the rule of thumb that you need to be successful 67% of the time to break even. So I'm at a loss to explain the 47 run difference.

The only thing I can think of is a sample size issue. I think the SD of a team's runs scored in a single game is about 3. So the SD of a season's worth of runs is 3 times the square root of 162, or about 38 runs. The SD of the average of 100 season's worth is one-tenth of that, or about 4 runs. The difference between two 100-season averages is the square root of 2 times that, or about 5.4 runs.

But 47 runs is almost 9 standard deviations. So I'm still not sure what's going on.

Finally, the sacrifice bunt. When the simulation forced the bunt-avoiding Red Sox (27 SH in 2008, compared to the league-average 34) to do it more often, they lost 19 runs. But when they got the bunt-loving Mets (73, league average 66) to do it less, the result was also a loss – 15 runs. Schwarz concludes that the Mets' real-life bunting was better than the Red Sox, that they chose to bunt in more favorable situations. But, weren't both these numbers based on the simulation? If so, the real-life situations should make no difference.

If the comparisons, however, *were* based on real life, then we have sample size issues based on the real-life sample, which is only 162 games, with an SD of about 38 runs. Maybe the 2008 Mets and Red Sox scored more or fewer than the simulation because of luck? We should be able to tell by looking at Runs Created – but, for some reason, almost all teams undershot their RC estimate in 2008 (and their Base Runs estimate too, at least for the versions I tried).

Anyway, while I like the simulation method, I wish the results had been presented more clearly. As it stands, I'll stick to "The Book"'s conclusions on these issues of baseball strategy.

P.S. Here's what Tony LaRussa thinks of these results:

“There’s way too much importance given to what you can produce from a machine,” he said. “These are human beings, and I don’t think any computer is going to model that close to what we deal with at this level.”

Hat Tip: Daniel Hamermesh at Freakonomics


Labels: , , ,

Wednesday, April 15, 2009

New issues of "By the Numbers"

Two new issues of "By the Numbers," the SABR baseball research newsletter I edit, are now available for download at my website.

Labels:

Sunday, April 12, 2009

The Utah Jazz have been playing much better when rested

This year, the Utah Jazz are 3-16 (.158) when they've played the night before, compared to 44-17 (.721) when they had the night off.

That's obviously a huge difference, close to 5 standard deviations. In two articles this week, Carl Bialik suggests that playing better in back-to-back games is a characteristic of a team, and may have predictive value in the playoffs:

"Two years ago, the Dallas Mavericks' 15-1 record in second legs of back-to-back games helped them earn the Western conference's top seed. Conversely, the Golden State Warriors were 5-17 without a day off. When the two teams met in the opening round of the playoffs, Golden State showed they were better than their No. 8 seed by sending the Mavericks home."


But: a quick check over at Basketball Reference shows that there might be a simple reason the Jazz haven't played well in those 19 games -- they're predominantly road games. Only three of the 19 games were at home (although Utah did lose all three).

The home-field advantage in basketball is by far the highest of the four major sports (the home team wins 60.8% of games, according to this presentation (.pdf)). If the average team is only .392 on the road, that works out to 7-12 (actually, close to 7-and-a-half wins). The Jazz still undershot, especially if you consider that they're a better-than-average team -- but not by as much as if you thought they should have been .721.

Basketball is also the sport in which the better team wins most often, and you could probably close the gap even more by accounting for the quality of the Jazz's opposition in those 19 games. I haven't looked, though. And, of course, the Jazz were cherry-picked for the article because of their extreme split -- Bialik says it's the largest in the past five seasons. If you assume the Jazz "should have" been, say, 9-11, then 3-16 doesn't seem that weird for being the worst outlier.

---

As for the Mavericks in 2006-07 ... they also played most of their second of back-to-back games on the road -- 10 out of 14 (I must have missed a couple from the game logs). Again, you'd have to check the quality of opposition to see if they played particularly weak opponents.

By the way, Bialik says that over the past five seasons, teams win 44% of their played-the-night-before games. That seems unremarkable, considering that it looks like those are predominantly road games.

Finally, I almost forgot about this study on how NBA teams play when rested.


Labels: ,

Friday, April 03, 2009

J.C. Bradbury on aging in baseball

J.C. Bradbury is on vacation from blogging, but is still posting occasionally. This week, he wrote that his article on baseball aging patterns has been published. Here's the link to the published version (gated), and here's a link to a freely-available version from last August.

Here's what JC did. He took every player with at least 5000 PA (4000 batters faced for pitchers) who debuted in 1921 or later. Then, for those players, he considered every season in which they had at least 300 PA (or 200 batters faced). That left a total of 4,627 player-seasons for hitters, and 4,145 for pitchers.

Then, for each season, he ran a regression for various measures of performance, such as linear weights batting runs. The regression predicted a single season number (actually, a Z-score), based on:

-- the player's career average
-- the player's age that year, and age-squared (that is, quadratic on age)
-- a dummy variable for the league-season
-- a "player-specific error term".

Numbers are park adjusted.

After running the regression, Bradbury calculates the implied "peak age" for each metric:

29.41 linear weights
29.13 OPS
30.04 OBP
28.58 SLG
28.35 AVG
32.30 BB
28.26 DPT (doubles plus triples rate)
29.89 HR
29.16 ERA
29.05 RA
23.56 Strikeouts (for pitchers)
32.47 Walks (allowed)
27.39 Home Runs (allowed)

For most of the hitting categories, the peak age is above the conventional wisdom of 27 – most are around 29. After quoting various studies that have found younger peaks, Bradbury writes,

"The results indicate that both hitters and pitchers peak around 29. This is older than some estimates of peak performance ..."


Bradbury also notes that the results are consistent with the idea that the more raw athleticism is required, the earlier the skill peaks; strikeouts, for instance, which require raw arm speed peak the earliest, and walks, which are largely mental, peak the latest:

"Consistent with studies of ageing in specific athletic skills, baseball players peak earlier (later) in abilities that require more (less) physical stress."


I agree with Bradbury on this last point, but I don't think his actual age estimates can be relied upon. Specifically, I think peak ages are really closer to 27 than to 29.

One reason for this is that the model specifically requires the curve to be a quadratic – that is, symmetrical before and after the peak. But are careers really symmetrical? Suppose they are not – suppose the average player rises sharply when he's young, then falls gradually until old age. The curve, then, would be skewed, with a longer tail to the right.

Now, suppose you try to fit a symmetrical curve to a skewed curve, as closely as you can. If you pull out a sheet of paper and try it, you'll see that the peak of the symmetrical curve will wind up to the right of the actual curve. The approximation peaks later than the actual, which is exactly what JC found.

I have no proof that the actual aging curve is asymmetrical in this exact way, but players career's are not as regular as the orbits of asteroids. There's no particular reason that you'd expect players to fall at exactly the same rate as they rise, especially when you factor in playing time and injuries. The quadratic is a reasonable approximation, but that's all it is.

Another reason is selective sampling. By choosing only players with long careers, Bradbury left out any player who flames out early. And so, his sample is overpopulated with players who aged particularly gracefully. That would tend to overestimate the age at which players peak.

(He limited his data to players between 24 and 35, which he says is done to minimize selection bias, but I'm not sure how that would help.)

There is perhaps some evidence that there's a real effect. JC ran the same regression again, but this time including only players with Hall of Fame careers. For hitters, the peak age dropped by almost an entire year, from 29.41 to 28.51. That might makes sense; HOFers are the best players ever, and were more likely to have had long careers even if they aged less gracefully. That is, they'd still be good enough to stay in the league after a substantial drop, and would be much more likely to hit the 5000 PA cutoff even if they peaked early and dropped sharply.

(In fairness, you could argue that HOFers were less likely to be injured, and therefore more likely to peak later. But I think the "good enough to stay in the league" effect is larger than that, although I have no proof. Also, the HOF pitchers' peak age dropped only 0.08 years from the non-HOFers, so the effect I cite seems to hold only for hitters.)

Finally, there's selective sampling on individual seasons. A player who falls sharply and suddenly won't get enough playing time to qualify for Bradbury's study that year. And so, a plot of his career will be gentler at the right side. He'd be nearly vertical between his next-to-last season and his last season. But, since Bradbury doesn't consider his last season, the study won't see that vertical drop, and the quadratic will be gentler, with its peak to the right of where it would be otherwise.

Try this yourself: draw an aging curve that peaks, drops a bit, then falls off vertically. Draw the best fit symmetrical curve on it.

Now, draw the same again curve, but, instead of the vertical line, have it just end before the vertical line starts. Draw the best-fit symmetrical curve on this second one. You'll see it peaks later than when the vertical line was there.

(Again, in fairness: Bradbury ran a version of his study in which there was no season minimum for plate appearances or batters faced – just the career minimums -- and the results were similar. I've explained why I think, in theory, the minimums should skew the results, but I have to admit that, in real life, they didn't. There are perhaps some other reasons it didn't happen – perhaps a lot of the effect comes from the "vertical" players released in spring training, so they didn't make the study at all – but still, the results do seem to contradict this third theory of mine.)

So you've got three ways in which the study may have made assumptions or simplifications that forced the peak age to be higher than it should be:

-- assuming symmetry;
-- selective sampling of long careers;
-- selective sampling of seasons.

In that light, my conclusions would be that Bradbury's methodology might yield a reasonable approximation, but not much more than that. I think the study can correctly identify the basic trend, and is probably correct within a couple of years, but I wouldn’t bet on it being any closer than that.




Labels: ,

Thursday, March 26, 2009

Cricket: the "nightwatchman" strategy

I don't understand cricket all that well, but I think I get the gist of this New Zealand article about the "nightwatchman."

Cricket fans reading this, please correct me if I'm wrong, and forgive my use of the wrong terminology. (For instance, when a cricketer bats, can you also say he "hits"? I hope so.)

The idea, I think, is this:

In cricket, a batter will hit until he makes an out, and which point he is replaced and will not bat for the remainder of the innings. Batters hit in pairs. Once ten of the eleven men are out, that leaves only one, who can't bat alone, and the innings ends.

Outs can be infrequent; typically, a batter can hit for 25 runs or more before making out, and good batters occasionally hit for 100 runs or more. Therefore, an innings (or game) can go on for several days.

The best batters normally come to bat first. Sometimes, one of the better batters will go out late in the day. When that happens, the team will sometimes send up a worse batter to end the day. That batter is called the "nightwatchman."

Why would they do this? According to Wikipedia, the idea is that the end of the day is a bad period in which to hit – the next batter may be tired, or the light may not be good. Also, if they do send up the good batter, and he is quickly put out, the psychological effect might hurt the team.

And so, they sometimes put in an inferior batter, who can waste some time between now and dusk, so the better batter can be saved until tomorrow.

Now, if all this is correct, what would be the strategic advantage? Every batter has to hit eventually, and there is no inherent benefit of putting good batters together as in baseball, because every batter comes up in the same situation (the equivalent of "bases empty"). And the psychological rationale seems weak to me.

That leaves the "hard to hit in the dark" hypothesis. If the dim light causes all players drop by the same percentage, then it makes sense to put in the batter who normally bats for 10 runs than the one who normally bats for 35 runs. Better to lose X percent of 10 then X percent of 35. But isn't it also possible that it's the other way around? Maybe the better the batter, the more able he is to handle the adverse conditions.

Also, you have to keep in mind that every batter gets the same chance to bat, except the one who's left after ten men have gone out. The longer you wait before putting in your best batters, the greater the chance it'll be one of those good ones who doesn't get to finish. So, generally, you'd want your better batters first.

So which is the better strategy? This seems like a good problem for cricket sabermetrics. The original article points to a study by Charles Davis, who (I get the impression) is cricket's foremost sabermetrician.

In that study, Davis finds that teams who used the nightwatchman strategy (late in the day after two men had gone out) undershot expectations by 25 runs over teams who didn't. It wasn't because the nightwatchmen didn't do well – they did about the same as their career average lower in the "batting order." So it must have been ... what? Maybe stranding a better batter after the last out? That still seems like a lot; the difference between a good batter and a bad batter might be ... what, 50 runs? And there are still 8 outs (wickets) left in the match. So the fraction 25/50 seems too large under the circumstances.

But look at Davis's graph: an increase of 100 runs scored in the first two wickets leads to a final score only about 35 runs higher. That shouldn't be the case, should it? Wickets are independent except for the identities of the players involved. Consider a baseball analogy: if the Houston Astros score three runs in the first two innings, wouldn't you expect their final score to be three runs higher than if they scored zero runs in the first two innings? Why isn't that happening in Davis's study? The only thing I can think of is that if you score more runs in the first two wickets, it's because you've used up your very best batters, and all that's left is your weaker ones. In that case, it means that team strategy is a huge factor in the distribution of scoring. And so, when you divide innings into "nightwatchman" and "non-nightwatchman," you can't assume the two groups are identical, as Davis did.

Again, please correct me if I've assumed something incorrectly, and I'll update this post.

P.S. Here's one intro to how cricket works. There are lots of others.

Hat Tip: Rod Nelson of SABR





Labels:

Saturday, March 21, 2009

NCAA overtime probabilities

In NCAA basketball on March 12, Syracuse beat UConn 127-117 -- after six overtime periods.

At "The Daily Fix," Carl Bialik analyzes the frequency of NCAA multiple-overtime games, and finds that they happen pretty much as you'd expect from a simple probability calculation.


Labels: , ,

Friday, March 13, 2009

The Gini coefficient

Note: non-sports post.

-------------

Income inequality is rising in the United States. If you measure inequality by the
Gini Coefficient, which I guess is as good as a measure as any, you get a fairly steady increase from 0.399 in 1967 to 0.466 in 2001. (A higher number means a more unequal distribution; 0.000 means everyone has the same income; 1.000 means one person has all the income.)

Inequality has increased in Canada, too; in fact, what got me interested in this topic was a recent newspaper article complaining about the rise in Canadian income dispersion. These kinds of articles come around all the time, in discussions about how the rich are getting richer.

The stated or unstated assumption is that more inequality is worse, and more equality is better. I don't think that's true. Rather, I think that rising inequality can indeed be worse, but it can also be neutral, or it might actually be a sign of improvement for everyone.

The problem isn't with the measurement itself; as I said above, the Gini Coefficient is a reasonable way to measure what it's supposed to measure. The problem is with the interpretation of movement in the measurement. Even if you think that, all things being equal, more equality is better than less equality, it doesn't necessarily follow that a higher Gini is bad, -- because all things are never equal.

A baseball analogy, perhaps a mediocre one, might be stolen bases. Yes, stolen bases are good for an offense; every additional steal creates about 0.2 runs. But it doesn't follow that an increase in the number of steals is good for offense (in the 60s, steals were up but offense was down.). It doesn't follow that bad offensive teams should try to steal more (they'll just get caught stealing more and score even fewer runs). And it doesn't mean that teams with more steals are somehow better than teams with fewer steals (stolen bases don't correlate well to winning).

The number of steals is just a measurement of – the number of steals. If you want to go beyond that, and draw conclusions from changes in the stolen base rate, you have to justify those conclusions. The naive view -- steals are good so it's bad when teams steal less! -- just isn't true.

It's the same for measures of income inequality. There are dozens of reasons why an increase in income inequality can be a neutral thing, or even a good thing, and you don't have to think very hard to come up with them. I have thirty of them here. I could probably think of ten more. (Some of them, of course, have been thought of before; this Wikipedia page includes several.)

1. First, absolute income matters a lot more than relative income. Would you rather live in the USA where the Gini is 0.46, or in a country where the Gini is zero but everyone earns exactly $5? That's a contrived example, but there are real-life examples too. Albania has a Gini of 0.27, but a per-capita income only one-eighth of the USA. Would you be willing to sacrifice 87% of your pay in order to be more equal to everyone else?

Obviously, other factors can be more important than income inequality. If you think it would be bad for the USA's Gini to go from 0.46 to (say) 0.48, would you change your mind if a 10% increase in income came with it? How about 20%, or 50%?

The Gini rose from 0.42 in 1991 to 0.47 in 2001. But per-capita income, in inflation-adjusted 2006 dollars, rose 23%, from $21,102 to $26,024. Is that a fair trade-off? If you think it is -- and even if you think it isn't -- then complaining about income inequality without mentioning the trade-off isn't very reasonable.

2. The above argument wouldn't be all that persuasive if there weren't a link between the level of equality and the level of income. And there is. If you want to lower the Gini, you have to take income from the rich and give it to the poor. (You could raise income to the poor in other ways, but if those ways were easy, the poor would be doing it themselves already.) Taking income from the rich reduces the incentive to get rich, which in turn reduces overall wealth. Economists could probably tell you what the trade-off actually is -- how much an increase in upper-income tax rates would slow down economic growth. But I think they'd all tell you that there *is* a significant trade-off. More on this later.

3. Suppose that tomorrow, someone comes up with a cure for an untreatable cancer. He sells a million cures at $1,000 each, and makes a billion dollars. Inequality goes up. But everyone is better off! Everyone has just as much income as before, but now we have the opportunity to buy a cure for cancer -- should we ever get it -- for only $1000.

So it's good when inequality goes up for reasons like this ... and I don't think it's that contrived an example. Bill Gates got rich creating Windows ... which costs only, what, about $40 a copy? I get way more than my $40 worth, and I'm happy that Microsoft succeeded and increased inequality on my behalf. The same is true for most of the products I use. I hope that there are more breakthroughs that make other people rich, even if I don't get richer at all.

4. As a general principle, the way you earn a high income is by doing things that benefit other people. You can cure cancer or create an operating system, but you can also perform appendectomies or write computer programs quickly or give a better haircut. A higher Gini might just mean that more people are figuring out better ways to benefit more other people. Or, it could just be a matter of population growth. Fifty years ago, movies played to only millions of people. Now, a movie of the same quality would be seen by billions. That means directors who would have merely been millionnaires in 1970 might be billionnaires now. That's a good thing, not a bad thing, that those who enrich others' lives have many more "others" to benefit. I'm glad that someone in China can now enjoy "Airplane!", the greatest movie ever made, even if it means that the Zucker brothers get a little richer and the US gets a little more unequal.

5. What's the right value for the Gini coefficient? Is .46 too high? Too low? What should it be?

We know that for incomes, higher is better. We know that for golf scores, lower is better. We know that for room temperature, the closer to 72 degrees Fahrenheit, the better.

What's ideal for the Gini? It's not zero; if everyone had to have equal incomes, that would be more Marxist than Marx -- nobody would want to work, and everyone's income would be very low. It's not 1 -- that means one person has all the income, and the rest of us starve.

So what's the point we're shooting for? Nobody knows. And if nobody knows, how can you draw any conclusions at all about whether 0.46 is too high or too low?

In fairness, there is a counterargument here: that we don't know what the Gini should be, but we know there's too much inequality, so we need to go lower. To which I have two answers: first, suppose we reduce inequality. How will you know when we're done? If there's a way to tell, why not tell us now? And, second, some people may *not* think there's too much inequality now. To those people, the fact that the Gini is rising is irrelevant -- you have to convince them that more equality is essential, in order that they see the rise as a bad thing.

The issue shouldn't be that inequality is rising -- the issue should be that inequality is too high. Is it?

6. The Gini doesn't consider that people could be equal over a lifetime, but temporarily unequal due to age differences. Even if everyone had exactly equal income patterns -- say, zero for the first three years of adulthood while they're in school, then escalating equal salaries until retirement, then equal pensions in old age -- the Gini would be non-zero, because, in any given year, you'd be looking at some zeroes, some low-income, some high-income,and some retirees.

Part of the Gini is simply the effect of differing lifetime income among any given individual. If it turns out that people are voluntarily changing their work patterns to have fewer low-income years and more high-income years, the Gini increases even if inequality hasn't.

7. There's a trend to more years of higher education, which would cause part of that increase. If the long-term trend is for more schooling (meaning more years at zero income) and higher incomes after, that would increase within-lifetime inequality, and therefore the Gini, even if everyone remains exactly equal over their lifetime.

Look at it this way: suppose everyone has three years of school at $0, then forty years of work at $50K. They're all equal over their lifetime. But what happens any given year? Of every 43 people, then, three have $0 and forty have $50K, which looks unequal.

Now, suppose everyone decides to go to school for five years to make more money: now you have five people at $0 and thirty-eight people at $70K. This is more unequal, resulting in a higher Gini -- but it's actually better for everyone.

8. People are living longer. If the trend is to make lots of money between 25 and 65, and take a low pension later, then adding more low-pension years will appear to reduce inequality, even if everyone is still the same.

To oversimplify: the trend used to be that you start work at 18, then earn a fairly low salary until you die (because you can't afford a retirement pension). Now, the trend is: make no money for five years while you're at school. Then make good money until 65. Then get a lower pension. The old way, everyone is fairly equal each year. The new way, everyone is still fairly equal over a lifetime, but income varies considerably each year. The new way is less equal when comparing individuals during any given year, but just as equal over a lifetime. And, of course, more desirable.

9. People have different propensities to work; that's just human nature.

Co-workers Bob and Joe are each offered 50 hours of overtime work. Bob accepts, and makes $60K that year. Joe declines, preferring more time with his family, and makes $50K that year. Inequality of income has increased, but that's because inequality of WORK has increased. That's perfectly fair and desirable.

Shouldn't inequality of income be fair to match the inequality of work or effort that created it? And isn't it good that Bob and Joe both have more choices, even though the Gini goes up?

10. People have different ambitions. Ann and Cathy both make $40K a year and are of equal proficiency at their job. Ann makes an extra effort to climb the corporate ladder into management. Cathy doesn't -- she doesn't like kissing butt and prefers to avoid the hassles of supervising other people. Ann jumps to $60K a year; Cathy stays at $40K. Both are happy. Inequality of income goes up, but, again, this is a good thing, as both Ann and Cathy got what they wanted.

11. People have different propensities for risk-taking. Right now, T-bills are paying less than 1%, while other, riskier investments yield 10% and more. Suppose Tom and Jerry have equal jobs and salaries, but Tom is more conservative than Jerry. Tom invests in a CD paying 1.5%, while Jerry invests in real estate yielding 10%. Their incomes suddenly become unequal, but, again, both Tom and Jerry got what they wanted, so the increase in inequality is again a good thing.

12. Even when people have the same propensity for risk-taking, they might just get different results. Jerry and Kevin might be the same in every other respect, but if Jerry invests in mutual fund A, and Kevin invests in mutual fund B, their returns will be different, and their incomes will start to become unequal. Why is that necessarily bad?

Or, take lottery tickets. Everyone who buys a lottery ticket knows that the distribution of incomes among ticket-buyers will be unequal. But they don't mind, and they do it anyway. Is it a bad thing when lotteries increase the Gini? Ticket-buyers probably don't think so.

13. There are other ways to buy security. I work in the IT field, and several of my co-workers, who were on contract, voluntarily took a large pay cut, in some cases close to 50%, to become government employees. Other of my co-workers did not. That created income inequality between the two groups. But did it really make us unequal? Government jobs are very secure. Isn't it reasonable to assume that my friends who took the pay cut simply bought their security, and we're just as equal as we were before? Half of us take our pay in money, and the other half take it in security.

The moral: you can't just consider monetary income; you have to consider things that money can "buy" in the non-traditional sense.

14. There are lots of examples of non-traditional income-substitutes. Suppose you have two identical couples who are neighbors. Helen works, earns $40,000 a year, and takes her kids to day care. Iris quit her $40,000 job to take care of her kids at home. Again, is this unequal? I don't think so. It looks unequal to the Gini, which only measures monetary income. But if Iris chose to stay home and forgo her $40,000 a year, she's obviously benefiting by at least $40,000 -- or she wouldn't stay home! To me, Helen and Iris are exactly equal, despite what the paychecks say.

15. Here's another one: home ownership. Kevin and Laura both make $50,000 a year, but Kevin owns his home, and Laura pays $1,000 a month to rent an identical one. They are equal in money income, but, really, Kevin has more total income -- he "earns" an extra $12,000 a year by having the use of his home rent-free. In this particular Kevin-and-Laura case, the Gini actually *underestimates* inequality, unlike all the previous examples, where it *overestimates* it. However, for the country as a whole, it might again overestimate it, if low-income earners are more likely to own their homes outright. And retirees are likely to have paid-off mortgages.

16. Here's still another one: children. Oscar has two kids, and has to accept a lower income in order to stay in their town and raise them in a stable environment. Paul has no kids, and can move around looking for higher-paying jobs. Their incomes are unequal, but are they really? Oscar has chosen to have kids and pay the price; he values being a father more than he values the bit of extra money he could earn if he were childless. While it's politically incorrect to compare children to money, the fact remains that children are hugely important to people, and they're not free -- there are real costs and opportunity costs to having children. It seems weird to feel sorry for Oscar because he chose children over money, just as it would be weird to feel sorry for someone who was poorer because they chose to buy a fancier car. You have to count everything that affects income, not just the actual income.

These cases of trading money for non-money might be more frequent now then ever, because, as we get richer, they become more affordable -- it's easier to stay home with the kids when your spouse makes $80K a year as a computer programmer in 2009 than when she made $40K (inflation-adjusted) as a programmer back in 1974. As overall wealth increases, the ability to choose to earn less increases. In turn, the diversity of choices increases, and the distribution of monetary income gets wider. Again, this is a most excellent development, in my opinion, and I hope it continues.

17. I quit full-time work a couple of years ago. Right now, I'm not working much, just occasional jobs. If I go back to work, I will earn an above-average income and increase the Gini coefficient. Is that really a bad thing?

18. Small-business owners have incomes that fluctuate from year to year. This year, Al might make $100K and Bruce $20K, but the next year it's reversed. Since the Gini is calculated per year, it looks like Al and Bruce are unequal, even though they had exactly the same income over a two-year span.

19. Part of what's measured by "income" is capital gains. But the stock market has up years and down years. If inequality fell last year because of all the capital losses realized in the stock-market collapse, is that a good thing? Should we rejoice because Warren Buffett is poorer, even if we're poorer too? If that's not worth celebrating, then why is it worth complaining when the reverse happens, when Warren Buffett gets richer but inequality goes up?

20. Do the measurements of income inequality take taxes into account? I don't think they do. That defeats the purpose, doesn't it? A large part of what we expect government to do is help out the poor who need it. And they do that by disproportionately taking money from the rich.

So if the point of the Gini is to measure how much money people actually have to spend -- which is the number that actually means something -- you certainly have to adjust for taxes! It's very possible that the pre-tax Gini is up but the after-tax Gini is down.

21. It's not just cash benefits that we get from government; it's services too. Suppose it costs $10 million to run the public library, and that works out to $10 per person. Since everyone has equal access to the library, shouldn't we add $10 to everyone's (after-tax) income before we calculate the Gini? I think we should. (You could argue that the poor don't benefit as much from the library as the rich (or that hungry poor people can't eat books!), but I'm not sure that's the case. Because, if it were, we'd give the library tax to the poor, and pay for the library with membership fees. That would make everyone better off. Since we don't, it's fair to assume that giving the poor free access to the library benefits them more.)

It's not just libraries -- it's police, and fire protection, too -- the poor are served just as well as the rich by police protection. And don't forget public transit (which certainly *does* benefit the poor more than the rich), and all the other services that government provides.

Part of the reason we pay for these services out of taxes is that they're so essential that we want to make sure even the poorest members of society benefit from them. But in that case, we need to figure them into the calculation.

My perception is that the level government services has exploded over the last few decades, while the Gini has increased only moderately. Figure in the value of those services -- and taxation, if you haven't already done so -- and the increase would be substantially lower.

22. It's spending that matters, not income. If I gave you an income of $1,000,000, but told you you could never spend it, it would be useless. And if I let you have a million dollars worth of merchandise (of your choice), instead of cash, the merchandise would be just as good.

And it's a principle of economics that people work to smooth out their consumption (spending) over their lifetime. That means they borrow money when they're young, to pay for cars and TVs and houses, and pay it back when they're older. And so, consumption is much less unequal than income. (It's theoretically possible for everyone to consume exactly the same, despite differing incomes over time; imagine an insurance company that offers you $50,000 worth of consumption every year in exchange for your salary every year. That would make everyone equal. Then, imagine doing this yourself -- you borrow in years where you're below $50,000, and pay back the debt in years when you're above $50,000.)

I remember seeing a few articles that mentioned that low-income people actually spend a lot more than their income, almost twice as much in some cases. (This might be through tax credits, or welfare, or such.) In that case, measuring income inequality is a pretty crappy way of measuring the differences in how people actually live. We should measure *consumption* inequality.

23. And if we DO measure inequality of consumption, the number goes way, way down. The way the Gini index is constructed, the more money you make, the more effect you have on the Gini. But suppose a highest-income earner makes $1 billion per year. There's no way that person can, or would, spend that much money. If he's really good at spending, he might consume (say) $10 million. (I couldn't consume anything close to $10 million in a year, but let's be conservative.) So if you want to measure spending, rather than income, the Gini is going to be overestimated by the effect of that $990 million.

As the very-rich get very-richer, they have the potential to stretch the Gini way out of proportion; but, in the most meaningful sense, consumption, inequality won't have changed much at all.

24. The super-rich, the ones that account for so much of the Gini, are also the biggest charitable donors. Bill Gates gives a lot of money to philanthropic causes. Even if they're in Africa, rather than the US, that actually increases worldwide equality, doesn't it? If you gave a million dollars to 100 random people, they'd spend it on themselves. If you give $100 million to Bill Gates, he'd spend it on some of the poorest people in the world. So greater inequality among the super-rich becomes greater equality overall!

25. Bernie Madoff lowered the Gini a little bit last year.

26. A substantial portion of income comes from savings and investments. And, because of the power of compounding, a small increase in savings today can add up to a huge difference in income in the future.

Ruth orders a pizza every week for a year, at $1,000. Sarah, having seen those commercials about saving money and living better, buys her pizza at Wal-Mart, for $400. She invests the $600 difference at 5% after inflation.

Forty years later, her $600 has grown into about $4200. She now pulls in an extra $300 or so in income from that $4200. The Gini rises, but there was never any inequality there, just differences in saving patterns.

And differences in savings are HUGE. I know older couples who made incomes I couldn't live on myself, who managed to save enough that they make more in retirement than they did when they were working. And I know people with six-figure incomes who are deep in debt and can't afford to pay their bills. The difference is not inequality of opportunity to earn income -- it's inequality of savings rates.

27. As we get richer and richer, it becomes easier to save. A 20" (tube) color TV now costs $100; I bought one thirteen years ago for my Dad at $500. In terms of TVs, you can save $400 with no change of lifestyle from 1996.

Or, of course, you could use the $400 to upgrade to a flat panel LCD, which many people do. But there's a choice there now that wasn't there only a few years ago. And because people are different, they make different choices. More choices means more diversity. More diversity means more dispersion in savings rates. More dispersion in savings rates means higher income inequality.

I think all this is a good thing.

28. Again because of the power of compounding, the effect of savings grows the more years you can save. So as life expectancies rise, you'd expect income inequality to rise. Imagine saving $1000 at age 25, again at 5%. By 65, you'll have $7,040. But if you can live to 95, you'll have $30,426. So it follows that the Gini should rise as people live longer and save longer.

If people lived to 200, inequality would be absolutely huge: save $1000 at 25 and you'll have $1.5 million by age 175. It wouldn't mean that society was unequal, just that it naturally takes time to build wealth. We don't worry about "inequality of knowledge" between a 25-year-old doctor and a 65-year-old doctor -- why should it be any different for savings?

Or, looked at another way: a 65-year-old has worked, over his lifetime, 40 times as much as a 25-year-old. Why would you expect their incomes to be equal?

29. It's pretty much accepted that we could create more equal incomes if we wanted to, by increasing tax rates on the rich. And it's also accepted that it would slow down economic growth -- say, by reducing it from 3% a year to 1.5% a year -- because of reduced incentive to work or take risks.

Now, suppose that country A decides to follow that route, and grow by 1.5% a year. Country B decides to let the inequality be, and grow by 3% a year.

100 years later, average income in country A has grown from $30,000 a year to $132,961. In country B, income has grown from $30,000 to $576,559.

So B is four times richer than A.

If you were to combine country A and B into one country, you'd almost certainly measure more inequality than in B alone! Now, if there was a moral obligation in A to increase equality even at the expense of total wealth, then is there also a moral obligation to increase equality *between A and B*? If there is, how would you do it? By force? And wouldn't that be unfair to B, whose population decided that the trade-off favored higher incomes over equal incomes?

30. Suppose that we -- the US and Canada -- had decided to increase equality back in 1908, and raised tax rates on the rich. And suppose that had lowered our growth rate by 1.5 percentage points over the last century. Then, we'd have only 23% as much income as we have now.

Would it have been worth it? Would you be willing, in retrospect, to take a 75% pay cut (and a 75% cut in government services, and probably a huge cut in medical advances) so that previous generations would have been more equal?

I wouldn't, and I doubt if anyone else would.

So then, don't you also have to think it would be a bad thing to make your great-great-grandchildren take a 75% pay cut 100 years into the future? Because that's the trade-off. At 1.5 percentage points difference, we'd seriously be costing the next generations hundreds of thousands of dollars each.

Now, 1.5 percentage points might be a bit high. If we assume only 1.0 percentage points, then instead of a 75% pay cut, it's a 60% pay cut. If we assume 0.5 percentage points, it's a 39% pay cut. At what point are we happy with the trade-off?

It's a legitimate question, but I don't think the Gini coefficient factors into the answer very easily. At least not unless you have an argument about what the Gini *should* be, and how much it's worth paying to get it there.

Absent that, the Gini is not very useful at all.

-------

UPDATE

Brian Burke reminds me in the comments of another problem with the Gini, an important one.

31. New immigrants to Canada and the US tend to be poorer than average, if only because most countries in the world are poorer than we are. This distorts the Gini. To paraphrase Brian's theatre analogy in the comments:

Suppose that there are five people in the country, with incomes of $10K, $20K, $30K, $40K, and $50K. Over the next 10 years, everyone's salary rises $10K, substantially reducing inequality. But a new immigrant arrives, who earns $10K.

What do we now have? Six people, earning $10K, $20K, $30K, $40K, $50K, and $60K. This is a higher Gini than before -- but only because of the new immigrant! Within the country, equality is actually increasing.

This is huge, and I'm kicking myself for forgetting to include it. I remember several economists mentioning it in the context of income -- that the reason lower-quintile income hasn't seemed to increase is simply because new immigrants replace the lower-income people who moved up and out of the low-income group. (Here's a post by Arnold Kling, as an example.)

--------

UPDATE: I'm just going to add more as I think of them.

32. Immigration, which mostly brings in people who start out at low incomes, increases the Gini, as I noted above. However, it *reduces* inequality for the world as a whole -- immigrants usually do much better here than they would in their home countries. Doesn't that suggest that looking at the Gini for a particular country, in isolation, could be misleading?

33. Activists decry the low wages American firms pay employees in poor countries. But if those US companies paid higher wages, they would be well above-market for those countries, increasing inequality there. Does that mean a higher Gini in a poorer country is OK?






Labels: ,

Saturday, March 07, 2009

The "Verducci Effect" revisited

The Wall Street Journal's new sports blog, "The Daily Fix," has revisited the "Verducci Effect."

That's the forecasting principle, invented by Sports Illustrated writer Tom Verducci, that when a young pitcher throws 30 more innings than in his previous season, he's due for a comedown next season.

But as I said before, I think the effect is simply regression to the mean. When a pitcher throws more innings than before, it's usually because he had a better year (since they don't normally let lousy pitchers throw a lot of innings). And when a pitcher has a better year, it's usually because he's somewhat lucky. And so he'll slide back to his normal level of performance the next year.

While I have no argument with the truth of Verducci's finding, I think it's not a matter of the innings, but, rather, a matter of the good performance.

In fact, I think that all things being equal, a pitcher with more innings is LESS likely to regress. Consider two 23-year-old pitchers: each has a career ERA of 4.50, and each pitched 100 innings in 2007. In 2008, both pitchers improved to 4.00. But pitcher A threw 105 innings, and pitcher B threw 150 innings.

According to Verducci, pitcher B is due for a comedown, while pitcher A is not. I disagree. I think pitcher A is more likely to drop back to his 4.50 career average. That's because there's less luck in pitcher B's record, and so his improvement from 4.50 to 4.00 is more likely to have been real.

I could be wrong.

P.S. Here's a piece by David Gassko, who did a control-group study and found no Verducci effect.


Labels: ,

Thursday, March 05, 2009

Why hasn't foul shooting improved?

Free-throw shooting percentages haven't changed much over the past 50 years, according to this New York Times article. Between 1950 and 1970, the conversion rate was around 72 percent. Since 1970, it's fluctuated between 72 and 77%.

Here's the NYT graph:



So it looks like free throwing hasn't really improved over the decades. That makes foul shooting an anomaly, because most other skills have improved: marathon times are better, football kicking is better, and "swimming records seemingly fall at each international event."

Why hasn't foul shooting improved? According to the article:

Ray Stefani, a professor emeritus at California State University, Long Beach, is an expert in the statistical analysis of sports. Widespread improvement over time in any sport, he said, depends on a combination of four factors: physiology (the size and fitness of athletes, perhaps aided by performance-enhancing drugs), technology or innovation (things like the advent of rowing machines to train rowers, and the Fosbury Flop in high jumping), coaching (changes in strategy) and equipment (like the clap skate in speedskating or fiberglass poles in pole vaulting). ...

“There are not a lot of those four things that would help in free-throw shooting,” Stefani said.


And that's fair enough. But what about, say, bowling? The article says explicitly that "bowling a 300 game is not as unlikely as it once was," and there are strong similarities between bowling and foul shooting. Physiology doesn't seem like it would help either way; technology and innovation don't seem like issues; and it's hard to see how coaching would be of more help in bowling than in foul shooting.

I'd propose another explanation: foul shooting is an ancillary skill in basketball – players are chosen for their overall ability, not just their free-throw potential. And so "natural selection" won't weed out mediocre shooters or reward the best shooters, at least not very much compared to other skills.

Compare this to other sports: bowling strikes is the primary goal of the game, the most important skill of all. And, in football, field-goal kickers are chosen for one thing: their ability to kick field goals. Any kicker below average in accuracy is out of the league instantly. But any NBA player who can't hit free throws can make it up in other aspects of the game (like Shaq). (A version of this argument was also made in the first comment of a discussion on Tango's blog, here). And coaches don't force their players to shoot underhand, which would make many players more accurate; that provides support for the idea that the NBA thinks free throw percentage doesn't matter that much.

If you want to *really* see if the skill is improving, don't look to NBA players, who may not be the best in the world at the skill. You'd have to look at free-throw specialists. I Googled "free throw shooting contest results," and got a link to an Iowa State contest where the winner made 49 out of 50 throws. That's 98%, and about 4 standard deviations away from the NBA average of 75%. Even considering that the contest had 72 entries, that's pretty significant.

And here's another argument: if foul shooting isn't considered a major skill, young players won't practice it as much, and it stands to reason that you won't get as much improvement over time if there's not as much energy expended to get better at it.

One last point: if you consider the graph's increase from 71 to 77 percent to be real, then that's actually pretty good evidence of an increase in skill. When you're already at a 71% level, it's harder to improve than if you start from, say, a 34% level (as field-goal percentage did). In 1950, players were missing 29% of their foul shots. In 2008, they were missing only 23%. That means that over the past 58 years, players learned to convert 20% of their misses into hits. That's pretty good. The field goal percentage improvement, from 34% to 46%, looks more impressive, but results from converting 18% of misses into hits – almost an identical improvement (although they probably shouldn't be compared directly, because field goals are influenced by where they're taken from, and the quality of the defense).

In summary:

-- there are good reasons you wouldn’t expect foul-shooting to improve as much as other skills over time;
-- if you look at the numbers more closely, there actually *is* a significant amount of improvement.

So I don't think there's as huge a mystery there like the Times does.


Labels: , ,