Friday, November 13, 2009

Consumer Reports alarmist on reverse mortgages

(Warning: non-sports post.)

In their September issue, Consumer Reports issues another muddled panic about a financial product; this time it's reverse mortgages.


Basically, a reverse mortgage is a loan you take out using your house as collateral. Normally, you'd do that with a line of credit -- you borrow the money as you need it, and make at least your minimum payment every month (as interest accrues). It's like a credit card, but with a much lower interest rate because it's backed by your house.

The reverse mortgage is also a loan on your home equity, but it's meant for poorer elderly people who don't have the income to make payments on the loan. With the reverse mortgage, you still get the loan, but the interest accumulates and compounds, and you don't have to pay it back until you move out of the house (or die). The idea is that when you're no longer living in the house, you sell it and use the proceeds to pay off the loan.

What if the loan has compounded so high that the value of the house isn't enough to pay it off? In that case, the borrower is off the hook. One of the benefits of the reverse mortgage is that the borrower is never on the line for more than the house itself.

As CR points out, this benefit has a price: the borrower winds up paying for "insurance" against that happening, insurance that tops up the loan if the house is eventually not worth enough. It's government insurance, and comes with government regulations on reverse mortgages. For instance, you have to be over 62, and you have to do lots of expensive legal paperwork.

It's called a "reverse mortgage" because it's often taken out to provide a stream of payments to supplement social security. That stream of payments is backwards from a normal mortgage: instead of you paying off the mortgage every month, the mortgage pays you.

My feeling, and CR's too, is that a reverse mortgage is a reasonable thing to do if you plan to stay in your house forever, and won't need money afterwards (because either you've died, or you're so ill you move to a nursing home paid for by government). Why die with money in the bank (or equity in your house)?

Another benefit of the reverse mortgage is that it sometimes it can provide the only way to get a lump sum of money in case of sudden need, like a medical emergency.

--------

So what's CR's problem with reverse mortgages? They have a few. Some of them are not completely unreasonable. CR gives stories of people being sold expensive reverse mortgages in order to use the money for inappropriate investments, which is certainly a bad thing. But that's not the fault of the reverse mortgage -- seniors are sold questionable financial products all the time, and sometimes persuaded to borrow money in other ways.

And CR gives examples of seniors who didn't really understand what they were getting into. For instance, sometimes salespeople hand customers overoptimistic projections of what their house will be worth, misleading them about the amount of equity that will be left for their children to inherit. But, again, biased salespeople are hazards of any financial transaction.

CR is also concerned that due to the housing meltdown, a lot of reverse mortgages end in the red, where the government-sponsored insurance comes into play. The payouts have started to exceed the premiums paid by borrowers, and CR is concerned about the burden on the taxpayer. It could be "the next financial fiasco."

I don't really understand their concern. In 2008, the worst year of the housing crisis, the fund only had to pay $400 million in claims. Suppose they pay at that rate for five years. That's about $6 per American. Compare that to the $7 trillion bailout, which is $13,000 per American. Is it really worth worrying about a $6 "fiasco" while ignoring the $13,000?

Not only is it a small amount of money, but you could argue that it's money well spent. Reverse mortgage insurance money is not a gift to the irresponsible: it's part of a social program that allows senior citizens to hold on to their homes, while living better in their old age. I'm not a big fan of government spending, but such a small sum, for such a good purpose, as the result of once-in-a-lifetime anomaly in housing prices, is probably 1000th on my list of government policy issues people should be concerned about.

---------

But the thing that *really* bugged me about the article is the lead anecdote that purports to show the human side of why reverse mortgages are harmful. But, as with their medical credit card screed last year, they got their conclusion completely backwards! Their example actually shows a reverse mortgage that handsomely rewarded the borrower.

When Ernest Minor was 61, his wife had serious medical problems. The Minors still owed $70,000 on their home. They took out a reverse mortgage loan for $176,000. $70,000 of that simply replaced the outstanding mortgage balance. $15,000 went for fees and insurance. And $92,000 went to pay the medical bills.

Because Mr. Minor was not yet 62 (but his wife was), they had to transfer the deed to Mrs. Minor's name in order to be eligible -- which they did.

Mrs. Minor died two years later. That made the loan come due. With interest, it's about $200,000. Obviously, Mr. Minor can't afford to pay that, so he will lose his home.

Sure, it's sad that Mr. Minor will lose the house he lived in for so many years. But, after all, he needed $92,000 for medical bills. Without the reverse mortgage, what would the Minors have done? They'd have sold the house, rented an apartment, and used the proceeds to pay the bills. They would have lost the home immediately. But, with the reverse mortgage, they got to live in the house a couple more years. Plus, if Mrs. Minor had gotten better, they would have been able to stay there indefinitely! Of all the Minors' options, the reverse mortgage was, in fact, the very best way to handle the situation.

CR is upset that the broker had Mr. Minor take his name off the deed to get the loan -- if he hadn't, he'd still be allowed to stay in the house until he moved out. But, unfortunately, government regulation gave him no choice! Instead of picking on the salesman, maybe CR should lobby the government to lower the minimum age, or at least allow one of the spouses to be under 62.

Anyway, Mr. Minor benefits even further. He owes $200,000. But because of the real estate meltdown, his home is only worth $130,000. You'd think he'd be in the hole to the tune of $70,000. But he's not! Remember, with the reverse mortgage, you can't owe more than the house is worth. Mr. Minor can walk away, because his $70,000 deficit is covered by the insurance he bought!

Let's do an accounting. Suppose that before the real-estate crisis, the house was worth $300,000 (which is about right, based on what CR tells us about what percentage of the home's value is loanable). If Mr. Minor had raised the $92,000 some other way, via a conventional loan, he'd be liable for that $92,000. He'd still owe $70,000 on his mortgage And he'd have made maybe $20,000 in interest payments over the three years on that $162,000.

So his total owing would be $182,000. His house would be worth only $130,000. His net worth would be negative $52,000.

But, with the reverse mortgage, he just walks away! He loses the house, but his net worth is $0. The reverse mortgage saved him $52,000! Of course, most of that savings came from avoiding the housing crisis, but still -- rather that Mr. Minor being a reverse mortgage sob story, he should be a success story!

CR prints a half-page photo of a sad Mr. Minor, holding a photo of his deceased wife. It's indeed very sad and moving that the illness lost him his wife and his house. But the reverse mortgage was the lone, small bright spot. It wound up saving Mr. Minor thousands and thousands of dollars. And CR didn't even notice.

Labels:

Saturday, November 07, 2009

Do younger umpires call a more accurate strike zone?

In a post on the economic incentives facing would-be umpires, J.C. Bradbury has an interesting study on how older umpires are more likely to have a larger or smaller strike zone.

Bradbury ran a regression, to predict an umpire's season strikeout-to-walk ratio based on his age and who he is. He found that the older the ump, the more different his strike zone size. He writes,

"It turns out that every year an umpire ages he increases his deviation from the league average by about 0.8% [.16 change in K/BB ratio]. That doesn’t seem like a lot—and it really isn’t a huge effect—but over a period of 12 years that pushes the umpire a full standard deviation (9.5%) [.19] above/below the average deviation. Thus, by the end of an umpire’s career, his calls are about two standard deviations from the typical deviation. This is evidence of a tenure effect or a loss of competency." [square brackets mine.]


How big is that effect? Well, 0.19 might be the difference between a K/BB ratio of 2 and a ratio of 2.19. Assuming the same number of total K+BB, that means that instead of (say) 20 strikeouts for every 10 walks, there would be 20.6 strikeouts for every 9.4 walks. That turns 0.6 walks into strikeouts per 30 K+BB events, which is about 0.35 runs.

In 2009, there were 10.33 such events per game (per team), not 30. That means a standard deviation is worth a bit over a third of .35, or .12 runs per game. So a pitcher's ERA might go up or down by .11. For two standard deviations, it would be .24 runs or .22 earned runs. Assuming both teams are equal, there would obviously be no effect on who wins the game (although it seems likely that pitchers may be affected differently).

What's that in terms of pitches? A study I did (.pdf, page 4) finds that the difference between a ball and a strike is about 0.14 runs. So, at 2 standard deviations, an umpire calls about 1.7 pitches differently per game, per team. That means that more than 95% of umpires are less than 1.7 pitches per game different from the mean.

But since the pitchers and batters are presumably aware of the differences between umpires, they would adjust accordingly. So the effect might be more than .12 runs per SD -- it could be .12 for the actual strikeouts and walks observed, but it might cost the batter another .12 (or some other number) in having to swing at bad pitchers (which would be called balls by another umpire).

And, of course, even umpires who call normal numbers of strikeouts and walks might have their own particular strike zone -- it might be the same size, but have a different shape or location.

Bradbury concludes that, as umpires gain experience, they get more confident and less reverent, and feel less of a need to stick to the league's interpretation of where the strike zone should be. That sounds reasonable, especially considering that Bradbury's regression controlled for calendar year.

It's worth reading Bradbury's entire post, which includes a list of umpires and their individual K/BB ratios.

---

(Note: post was updated in response to an e-mail from Guy, who pointed out I had misinterpreted Bradbury's percentages and got the effect being half of what it should be. I think it's correct now.)


Labels: ,

Thursday, November 05, 2009

Do field-goal kickers do worse in the clutch?

My favorite studies are ones that don't need any fancy math or statistics, but those where you can just look at the data and answer the question almost instantly.

Brian Burke, of "Advanced NFL Stats," had one of those last week. He wondered: is there an overall "choke" effect for field goal kicking? Are kickers less likely to make their kick when the game is on the line, either because of nervousness, or because defenses change their strategy?

The answer appears to be: no. Adjusted for distance, the clutch success rates track the overall success rates almost exactly, except for one blip in the data at 44 yards.

Of course, that doesn't mean that *no* kicker is different in the clutch. The data are consistent with the possibility that only one or two kickers are clutch or choke, which wouldn't be enough to show up in the graph. It's also possible that half of all kickers are clutch, and the other half are choke, and they exactly offset each other so there appears to be no effect.

But in view of the baseball evidence, which shows only very, very slight "clutch" variation among hitters, it doesn't seem likely that field-goal kickers would be significantly clutch.

Furthermore, considering how much data it took to test the clutch hypothesis for baseball, it's probably impossible to find an effect for kickers, even of the same rough size as for batters (less than 3% variation in success rate). Baseball hitters get several hundred opportunities in a season; kickers get maybe 30 or 40. If an effect for individual kickers exists in the NFL, it would have to be huge to have any chance of being detected.


Labels: , ,

Friday, October 30, 2009

Don't use regression to calculate Linear Weights, Part II

Last post, I wrote about how using regression to estimate Linear Weights values is a poor choice, and that the play-by-play (PBP) method is better. After thinking a bit more about it, I realized that I could have made a stronger case than I did. Specifically, that the regression method gives you an *estimate* of the value of events, but the play-by-play method gives you the *actual answer*. That is: a "perfect" regression, with an unlimited amount of data, will never be able to be more accurate than the results of the PBP method.

An analogy: suppose that, a while ago, someone randomly dumped some red balls and some white balls into an urn. If you were to draw one ball out of the urn, what would be the probability it's red?

Here are two different ways of trying to figure that out. First, you can observe people coming and drawing balls, and you can see what proportion of balls drawn turned out to be red. Maybe one guy draws ten balls (with replacement) and six are red. Maybe someone else comes up and draws one ball, and it's white. A third person comes along and draws five white out of 11. And so on. Over all, maybe there are 68 balls drawn total, and 40 of them are red.


So what do you do? You figure that the most likely estimate is 40/68, or 58.8%. You then use the binomial approximation to the normal distribution to figure out a confidence interval for your estimate.

That's the first way. What's the second way?

The second way is: you just empty the urn and count the balls! Maybe it turns out that the urn contains exactly 60 red balls and 40 white balls. So we now *know* the probability of drawing a red ball is 0.6.

If the second method is available, the first is completely unnecessary. It gives you no additional information about the question once you know what's in the urn. The second method has given you an exact answer.

That, I think, is the situation with Linear Weights. The regression is like observing people draw balls; you can then make inferences about the actual values of the events. But the PBP method is like looking in the urn -- you get the answer to the question you're asking. It's a deterministic calculation of what the regression values will converge to, if you eventually get the regression perfect.

-------

To make my case, let me start by (again) telling you what question the PBP method answers. It's this:

-- If you were to take a team made up of league-average players, and add one double to its batting line, how many more runs would it score?

That question is pretty much identical to the reverse:

-- If you were to take a team made up of league average players, and remove one double from its batting line, how many fewer runs would it score?

I'm going to show how to answer the second question, because the explanation is a bit easier. I'll do it for the 1992 American League.

Start with a listing of the play-by-play for every game (available from Retrosheet, of course). Now, let's randomly choose which double we're going to eliminate. There were 3,596 doubles hit that year; pick one at random, and find the inning in which it was hit.

Now, write out the inning. Write out the inning again, without the double. Then, see what the difference is in runs scored. (The process is almost exactly the same as how you figure earned runs: you replay the inning pretending the error never happened, and see if it saves you some runs.)

Suppose we randomly came up with Gene Larkin's double in the bottom of the fourth inning against the Angels on June 24, 1992. The inning went like this:

Actual: Walk / Double Play / DOUBLE / Fly Out -- 0 runs scored.

Without the double, our hypothetical inning would have been

Hypoth: Walk / Double Play / ------ / Fly Out -- 0 runs scored.

Pretty simple: in this case, taking away the double makes no difference, and costs the team 0 runs.

On the other hand, suppose we chose Brady Anderson's leadoff first-inning double on April 10. With and without the double:

Actual: DOUBLE / Double / Single / Double Play / HR / Ground Out -- 3 runs scored.

Hypoth: ------ / Double / Single / Double Play / HR / Ground Out -- 2 runs scored.

So, in this case, removing the double cost 1 run.

If we were to do this for each of the 3,596 doubles, we could just average out all the values, and we'd know how much a double was worth. The only problem is that sometimes it's hard to recreate the inning. For instance, Don Mattingly's double in the sixth inning on September 8:

Actual: Out / Single / DOUBLE (runner holds at 3rd) / Single (scores both runners) / Fly Ball / Fly Ball -- 2 runs scored.

Removing the double gives

Hypoth: Out / Single / ------ / Single / Fly Ball / Fly Ball

How many runs score in this reconstructed inning? We don't know. If the second single advanced the runner to third, and the subsequent fly ball was deep enough, one run would have scored. Otherwise, it would be 0 runs. So we don't know which it would have been. What do we do?

The situation arises that we were picking innings randomly, and dividing them into halves (the half before the double, and the half after the double). The problem is that the process creates an inconsistency in the hypothetical inning. The second half of the inning, in real life, started with one out and a runner on second and third. The hypothetical second half started with one out and a runner on first. That created the problem.

So, since we're picking randomly anyway, why don't we throw away the *real* second half of the inning, and instead pick the second half of some *other* inning, some inning where there actually IS one out and a runner on first? That will always give us a consistent inning. And while it will give us a different result for this run of our random test, over many random tests, it'll all even out.

We might randomly choose Cleveland's fourth inning against the Royals on July 16. In that frame, Mark Whiten struck out and Glenallen Hill singled, which gives us our required runner-on-first-with-one-out. After that, Jim Thome singled, and Sandy Alomar Jr. ground into a double play.

Grafting the end of that inning (single, double play) on to the beginning of the original inning gives us our "consistent" hypothetical inning:

Hypoth: (stuff to put a runner on first and one out) / ------ / Single / GIDP -- 0 runs scored.

Since the Indians scored two runs in the original, unadulerated inning, and zero runs in the hypothetical inning, this run of the simulation winds up with a loss of two runs.

Now, there's nothing special about that Cleveland fourth inning: we just happened to choose it randomly. There were 6,380 cases of a runner on first with one out, and we could have chosen any of them instead.

The inning could have gone:

Out / Single / ------ / result of inng 1 of 6,380
Out / Single / ------ / result of inng 2 of 6,380
Out / Single / ------ / result of inng 3 of 6,380
Out / Single / ------ / result of inng 4 of 6,380
...
Out / Single / ------ / result of inng 6,380 of 6,380

If we run the simulation long enough, we'll choose every one of the 6,380 equally. And so, we'll wind up with just the average of those 6,380 innings. So we can get rid of the randomness in the second half of the inning just by substituting the average of the 6,380. Then our "remove the double" hypothetical becomes:

Out / Single / ------ / average of all 6,380 innings

And, with the help of Retrosheet, we find that after having a runner on first and one out, those 6,380 innings resulted in an average of 0.510 runs being scored. So now we have:

Actual: Out / Single / DOUBLE (runner holds at 3rd) / Single (scores both runners) / Fly Ball / Fly Ball -- 2 runs scored.

Hypoth: Out / Single / ------ / plus an additional 0.510 runs.

I'll rewrite that to make it a bit easier to see what's going on:

Actual: (stuff that put a runner on first with one out) / DOUBLE / other stuff that caused 2 runs to be scored

Hypoth: (stuff that put a runner on first with one out) / other stuff that caused 0.510 runs to be scored, on average

So, for this inning, we can say that removing the double cost 1.490 runs.

Now, the "actual" inning was again random. We happened to choose the Yankees' 6th inning on September 8. But we might have chosen another, similar, inning where there was a runner on first with one out, and a double was hit, and the runner held at third. This particular September 8 inning led to two runs. Another such inning may have led to six runs, or three runs, or no runs (maybe there were two strikeouts after the double and the runners were stranded).

So, what we can do, is aggregate all these types of innings. If we look to Retrosheet, we would find that there were 796 times where there were runners on second and third with one out. In the remainder of those 796 innings, 1129 runs were scored. That's an average of 1.418 per inning.

So we can write:

Actual: stuff that put a runner on 1st with one out / DOUBLE leading to runners and 2nd and 3rd with one out / Other stuff leading to 1.418 runs scoring, on average.

Hypoth: stuff that put a runner on 1st with one out / ------ / Other stuff leading to 0.410 runs scoring, on average.

And so, we know that a double with a runner on first and one out, which ends with runners on 2nd and 3rd, is worth, on average, 1.008 runs.

Let's write this down this way, as an equation:

+1.008 runs = Runner on 1st and one out + Double, runner holds

We can repeat this analysis. What if the runner scores? Then, it turns out, the average inning led to 1.646 runs scoring instead of 0.410. So:

+1.236 runs = Runner on 1st and one out + Double, run scores

We can repeat this for every combination of bases and double results we want. For instance:

-0.117 runs = Runner on 1st and one out + Double, runner thrown out at home

+1.000 runs = Runner on 2nd and nobody out + Double

+1.212 runs = Runner on 1st and two out + Double, runner safe at home and batter goes to third on the throw

I'm not sure how many of these cases there are, but we can look to Retrosheet and list them all. At the end, we have a huge list of all possible combinations of doubles, and what they were worth in runs. We just have to average them, weighted by how often they happened, and we're done. We then have the answer.

As it turns out, the answer for the 1992 American League works out to 0.763 runs.

The answer is NOT a estimate based on a model with random errors that we have to eliminate. It's the exact answer to the question, the same way counting the balls in the urn gave us an exact answer.

Just to be absolutely clear, here's what we've shown:

Suppose we randomly remove one double from the 1992 American League. Then, we reconstruct the inning from the point of that double forward, by looking at the base/out situation before the double, finding a random inning with that same base/out situation, and substituting that new inning instead of what really happened.

If we do that, we should expect 0.763 runs fewer will be scored. If we were to run this same random test a trillion times, the runs lost will average out to .763 almost exactly.

If you try to answer this question by running a regression, to the extent that your estimate is different from 0.763, you got the wrong answer.

------

Anyway, the explanation above was a complicated way of describing the process. Here's a simpler description of the algorithm.

1. Using Retrosheet data, find every situation where there was a runner on second and no outs. It turns out there were 1,572 such situations in the 1992 AL. Count the total number of runs that were scored in the remainder of those innings. It turns out there were, on average, 1.095 runs scored each time that happened (1,722 runs scored in those 1,572 innings).

2. Repeat this process for the other 23 base-out states (two-outs-bases-loaded, one-out-runners-on-first-and-third, and so on). If you do that, and put the results in the traditional matrix, you get:

0 out 1 out 2 out
------------------------------
0.482 0.248 0.096 nobody on
0.853 0.510 0.211 first
1.095 0.646 0.293 second
1.494 0.907 0.423 first/second
1.356 0.940 0.377 third
1.804 1.151 0.470 first/third
2.169 1.418 0.598 second/third
2.429 1.549 0.745 loaded

3. Find every double hit in the 1992 AL. For each of those 3,596 doubles, figure (a) the run value from the above table *before* the double was hit; (b) the run value for the situation *after* the double; and (c) the number of runs that scored on the play.

The value of that double is (b) - (a) + (c). For instance, a 3-run double with the bases loaded and 2 outs is worth 0.293 minus 0.745 plus 3. That works out to 2.548 runs.

4. Average out each of the 3,596 run values. You'll get 0.763.

It's that simple. You can repeat the above for whatever event you like: triples, stolen bases, strikeouts, whatever. Here are the values I got:

-0.272 strikeout
-0.264 other batting out
+0.178 steal
+0.139 defensive indifference
-0.421 caught stealing
+0.276 wild pitch
+0.286 passed ball
+0.277 balk
+0.307 walk (except intentional)
+0.174 intentional walk
+0.331 HBP
+0.378 interference
+0.491 reached on error
+0.460 single
+0.763 double
+1.018 triple
+1.417 home run

-----

Anyway, my point in the original post wasn't meant to be "regression is bad." What I really meant was, why randomly pull balls from urns when Retrosheet gives you enough data to actually count the balls? This method gives you an exact number.

One objection might be that, to do it this way, there's way too much data to use, and so regression is a more practical alternative. But is it really better to use a wrong value just because it's easier to calculate?

Besides, you don't have to calculate them yourself -- they've been calculated, repeatedly, by others who can be referenced. In the worst case, they're close to the "traditional" weights, as calculated by Pete Palmer some 25 years ago. If you need a solid reference, just use Pete's numbers. They're closer than you'll get by running a regression, even a particularly comprehensive one.


Labels: ,

Tuesday, October 27, 2009

Don't use regression to calculate Linear Weights

About a year ago, in a post titled "Regression, Schmegression," Tom Tango argued that regression is not usually one of the better techniques to use in sabermetric research. He's right, especially for the example he used, which is using regression to find the correct Linear Weight values for the basic offensive events.

What a lot of researchers have done, and are still doing, is listing batting lines for various team-years -- singles, doubles, triples, etc. -- and running a regression to predict runs scored. It's not that bad a technique, but there are other, better ones you can use. Still, by looking a bit closer at the regression results, you can get a good idea of why regression results don't always mean what you think they mean.

Let's start with the triple. How much is a triple worth? That is: how many more runs would an average team score if you gave them exactly one extra triples?

We can run a simple regression, runs scored vs. triples hit. I used a dataset consisting of all full team-seasons from 1961 to 2008 (only for teams that played at least 159 games, to omit strike seasons). That was 1,121 teams. The result of the regression:

Runs = 731 - (0.44 * triples)

That's not a misprint: the regression tells us that every triple actually *costs* its team almost half a run!

It's not a sample size issue, either. The standard error of the -0.44 estimate is 0.27. The estimate was actually significantly different from zero (in the wrong direction!) at the 10% level.

Is it possible that a triple actually lowers your runs scored? Of course not. Our baseball knowledge tells us that's logically impossible. A triple maximizes the value of runners on base (by scoring them all), and then puts a runner on third, where he's also likely to score. It's all positive. There must be something else happening here.

It's pretty obvious, but to understand that we can't take the results at face value, we needed subject matter expertise -- we needed to know something about baseball. In this case, we didn't need to know much, just that triples have to be a good thing. But that's subject matter knowledge nonetheless.

No matter how expert you are in the technique of regression, you have to know something about the subject you're researching to be able to reach the correct conclusions from the evidence. Because, as the saying goes, correlation doesn't imply causation. But it doesn't imply *non-causation* either. It could be that triples cause fewer runs, or it could be that there's some third factor that's positively correlated with triples, but negatively correlated with runs scored. Knowing something about baseball lets us argue for which conclusion makes more sense.

Normally, when you interpret a regression result like this, you say something like: "all else being equal, one extra triple will reduce the number of runs scored by about 0.44." But that's not quite right. The "all else" doesn't refer to everything in the universe -- it only refers to everything else *you controlled for in that regression*. Which, in this case, was nothing -- we only regressed on triples.

A more accurate way to put the regression result is:

"One extra triple is associated with a reduction of the number of runs scored by about 0.44. That's either because of the triple itself, or because of something else about teams who hit more triples, something that wasn't controlled for in the regression."

Now, a possible explanation becomes apparent. Teams that hit lots of triples are usually faster teams. Fast teams tend to have fewer fat strong guys who hit for power. Therefore, maybe hitting lots of triples suggests that your team doesn't have much power, which is why triples are negatively correlated with runs scored.

Again, the regression didn't suggest that: it was our knowledge of baseball.

We can test that hypothesis, and there are a couple of ways to test it. First, we can test for a correlation between triples and other hits. And, yes, the correlation between triples and home runs is -0.31: teams who hit a lot of triples do indeed hit fewer home runs than average.

Or, we can just include home runs in the regression. If we do that, we get the equation

Runs = 373 + (1.84 * triples) + (1.93 * home runs)

Which means:

"Home runs being equal, one extra triple will increase the number of runs scored by about 1.84. That's either because of the triple itself, or because of something else about teams who hit more triples (something other than home runs, which was controlled for)."


Of course, there's still something else other than home runs: our baseball knowledge tells us that teams that hit lots of triples are likely to be different in doubles power, too. And, in fact, in almost every other category: singles, outs, walks, steals, and caught stealings. So if we do a regression on all that stuff, we get:

Runs = 42
+ (0.52 * singles)
+ (0.67 * doubles)
+ (1.18 * triples)
+ (1.48 * home runs)
+ (0.33 * walks)
+ (0.18 * steals)
- (0.21 * caught stealing)
- (0.11 * batting outs (which is AB-H)).

Now we're getting values that are close to traditional Linear Weights. But not completely. For instance (and as Tango noted), we get that a double is worth only 0.67 runs, rather than the 0.8 that we're used to.

Assuming the 0.8 is actually the correct number, we ask: What's going on? Why are we getting only 0.67? It isn't just random variation, because the standard error of the 0.67 estimate in the regression is only 0.02. So what is it?

There must be something about teams that hit a lot of doubles that reduce the number of runs they score, in ways *other* than changing the number of singles, doubles, triples, home runs, walks, caught stealings, CSs, and batting outs. What could that be?

I don't know the answer, but here are some possibilities:

-- maybe teams that hit a lot of doubles (relative to the other events) are more likely to be intentionally walked. Therefore, their walks are less valuable than those of other teams. Every additional double may correlate with one extra walk turning out to be an IBB, which results in the regression giving a lower coefficient for the double.

-- the regression went from 1961 to 2008. Maybe teams that hit a lot of doubles (relative to other events) played in low-offense eras (like the mid 60s). The extra doubles mark the team as being from that era, which makes all events worth less, which makes the regression adjust by giving a lower coefficient for the double.

-- maybe teams that hit a lot of doubles ground into a lot of double plays. Since double plays are extra outs that don't show up in (AB-H), that would cause runs to be overestimated. The regression adjusts for that by building the extra DPs into the value of the double.

And so on. I don't know what the true answer is; none of the suggestions above seem very likely to me. It's a bit of a mystery. I'd add IBB to the regression, but the Lahman database doesn't seem to have it for teams. Maybe I'll calculate it some other way and try again.

Anyway, it was a bit of a shock to me that the doubles estimate was so far off. I would have thought that a technique like linear regression, with over 1,000 rows of data, would be able to come up with the answer. But it didn't, almost certainly because of outside factors that we didn't control for. Not only that, but we don't even really know what those outside factors are! (Although if you have an idea, let me know in the comments.)

-----

So the accepted value for the double is 0.8 runs, using a method I will explain shortly. The regression, on the other hand, gives only 0.67 runs.

It's not that the regression answer is wrong -- it's just that the regression answers a different question than the one we want.

The 0.8 method asks: "If a team happens to hit an extra double, how many more runs will it score?"

The 0.67 method asks, "If one team hits one more double than another team, how many more runs will it score taking into account that the extra double means the team might be slightly different in other ways?"

To most analysts, the first question is more important. Why? Because we really *do* want to find the cause-and-effect relationship. Confounding variables may be interesting, but they usually get in the way of what we're really trying to find. It's interesting to note that more triples is correlated with fewer runs scored. But that information isn't very useful to our understanding of baseball -- it doesn't tell us what makes teams win. That is, we hopefully aren't about to tell the 1985 Cardinals that they would have scored more runs by hitting fewer triples, at least not unless we really believe that triples hurt offense.

What we usually want to do, using Linear Weights, is answer questions like: if you release player X, and sign player Y, where player Y hits 10 more triples than player X, how much will the team improve? And, for that question, the regression gives us the wrong answer.

So what's the method we use to find the *true* value of a triple? It's pretty simple:

-- For a certain period of baseball history, look at the play-by-plays of all the games, and divide all plate appearances (and baserunning events) according to which of the 24 base/out states (such as no outs, runners on second and third) was happening when it occurred.

-- For each state, calculate how many runs were scored after that state was achieved. (Here's the one for 2009, from Baseball Prospectus.)

-- Now, for every triple, calculate the difference between the runs before the triple, and the runs after the triple. For instance, a leadoff triple would have been worth .79 runs (there was an expectation of .52 runs before the triple, and 1.31 runs after the triple). But a triple with the bases loaded and two outs was worth 2.36 runs (before the triple, .75 runs were expected to score. After the triple, only .11 runs were expected, but 3 runs actually did score. 3.11 minus .75 equals 2.36).

-- Average out all the values of all the triples, as calculated above.

That average is how much an extra triple is worth to an average team.

If you do this for the 1988 American League (which is the one I happen to have on hand), you get that a triple is worth 1.024 runs. A double is worth 0.775.

There are several ways why this estimate is much better than what you could get from a regression:

-- as we pointed out, the regression estimate is influenced by superflous other factors possessed by teams that hit triples.

-- as Tango pointed out in his link, the regression uses data aggregated into team-seasons, which means you're losing a lot of information. This method uses PA by PA, inning by inning data, for a much more reliable estimate.

-- we have a direct, logical, cause-and-effect relationship.

-- in effect, we are able to hold *everything* constant, even factors we don't know about. That's because we are not comparing team X's triples to team X's runs. We're comparing a league-average triple to the league-average runs. All other confounding factors are averaged out.

Another way to look at it: the regression looks only at inputs and outputs. So it has no idea if the input *caused* the output, or if there's some third factor that links the two. But the play-by-play method isolates the direct effect of the input. It knows for sure that the triple *caused* the change from one state (bases empty, no outs) to the next (runner on third, no outs), and so it's not fooled by outside factors.

Correlation does not imply causation, and regression can only provide correlation. Why not use this method, which is based on causation, and therefore gives you the right answer?

-----

UPDATE: Below, commenter Ted links to a paper (.pdf) where he used regression to figure Linear Weights, and found that when he added variables for GIDP and reached on error, the doubles coefficient increased (from .689 to .722). The HR coefficient also increased (by 10 points). So that, I think, explains part of the mystery: the doubles are artificially low because teams that hit a lot of doubles and home runs are so slow that they also hit into a lot of DPs and don't reach base on error as much. When these are accounted for separately, some of the true value of the double is restored.


Labels: ,

Monday, October 19, 2009

Premature accusations of anti-French NHL racism

Another accusation of racism in sports hit the newspapers today, on the front page of Canada's "National Post." This time, it's English-speakers who are accused of discrimination, in the form of racism against French-Canadian players.
The story is about Bob Sirois, a former NHL forward from Montreal, who did some analysis on NHL demographics and concluded that there is an "anti-francophone virus" in pro hockey. The reporter also quotes Réjean Tremblay, a sportswriter for Montreal's "La Presse," who got a look at the findings, and argues that "discrimination against the frogs is absolute." (Here's an article by Tremblay on the issue.)

What is the evidence for these accusations? We don't know for sure, because the full argument is in Sirois' upcoming book. But the article gives a few statistics:

-- Forty-two percent of francophone Québeckers who played three or more years in the NHL won a trophy or were named to the All-Star team. "Only francophones at the highest level were able to have lasting careers," Sirois said.

-- Of all 16-year-old players at the midget level in Quebec, 1 in 334 anglophones was eventually drafted, but only 1 in 618 francophones.

-- Francophone players in Quebec are less likely to get drafted than anglophone players in Quebec, and they go lower in the draft.

-- Of the 763 francophones drafted since 1970, one-third of them went to four teams: the Quebec Nordiques, Montréal Canadiens, Buffalo Sabres, and Philadelphia Flyers. (The teams drafting the fewest francophones were the Dallas Stars, Nashville Predators, and Phoenix Coyotes.)

-- Sometimes, undrafted players manage to eventually make it into the NHL. That group represents 10% of players overall, but 19% of players from Québec, suggesting that more francophone players are going overlooked.

Since the Post reporter wrote that he had obtained a pre-publication copy of the book, we can probably assume these are the most damning facts behind the accusations.

But are they actually evidence of discrimination? In every case, there are other, more plausible, explanations for the results. Let's take them one by one.

1. 42% of francophone Quebeckers who played three or more years in the NHL won a trophy or were named to the All-Star team.

The idea, presumably, is that to last in the NHL as a francophone, you have to be really, really good. But where's the evidence? Maybe the figure for anglophones is even higher than that? Forty-two percent sounds like a lot, but it's meaningless without a comparison number.

But maybe the lack of a contrasting figure for English Canada is the reporter's fault. Let's suppose the book has the anglophone number, and it's less. Does that prove anything?

No, actually, it doesn't. This is an old argument, actually. A couple of decades ago, baseball was accused of discriminating against blacks on similar evidence: there were lots of blacks in the league leaders, but fewer blacks as marginal players. Bill James effectively rebutted the argument then, based on the characteristics of the distribution of players.

Converted to hockey, the argument goes like this. Suppose that francophones happen to be better players, on average, than anglophones. More specifically, suppose skills are normally distributed with a standard deviation of 15 "points". English players have an average skill of (say) 100 points, but French players have an average skill of 105 points. You need to be over 130 to make the NHL, and over 135 to be considered a star.

So anglophone players need to be 3 SDs above the mean to hit 130 and make the league. That's about 135 players per 100,000 candidates. Francophone players need only be 2.5 SDs above their own mean. That's 233 players per 100,000 candidates.

To hit the superstar 135 mark, the anglophones need to be 3.5 SDs above 100; that's about 23 stars per 100,000 population, which means 23 stars per 135 players. But the Francophones only need to be 3 SDs above 105. That's 135 stars out of 100,000, or 135 stars out of 620 players.

Which means:

17% of anglophone players are stars (23/135)
21% of francophone players are stars (135/620).

So there's a larger proportion of francophone stars than anglophone stars. The difference in our contrived example is only 21% to 17%. But it would be relatively easy to come up with numbers to make the difference bigger, or smaller.

The point is that a small difference in means adds up to a big difference at the far tails of the normal distribution. That's not discrimination, it's just the way the bell curve works.

Here's a more intuitive way to look at it. Suppose the anglophones and francophones were exactly equal in terms of players and stars. Now, let's make the francophones better by taking a couple of Mario Lemieux clones and throwing them into the francophone pot. Doesn't it now make sense that a larger proportion of francophones will be superstars? It's not racism -- it's just that the francophones are now BETTER.

One objection to this line of reasoning might be: if francophones are so much better than anglophones, shouldn't we see them disproportionately represented in the NHL? Yes, we probably should. And who says we don't? The Post article does NOT say that fewer francophones make the NHL, per capita, than anglophones. It says only that francophone Quebeckers were less likely to be drafted than *Anglophone Quebeckers*. I'd be willing to bet, right now, that francophones are more likely to be drafted than non-Quebec anglophones. That's based partly on this logic, and partly on my feeling that if it weren't true, Mr. Sirois would be trumpeting that fact in the article.

[ --> UPDATE: that's apparently not right. "Hawerchuk" says that Québeckers comprise 18% of Canadian NHL players (by games played). But they're 23% of the population. ]

(Oh, and why might it be that francophone players are better than anglophone players? It could be that anglophone Quebeckers live mostly in Montréal, where ice time is harder to get. Francophone Quebeckers are more likely to be in small, northern towns, where there are more rinks per capita and more frozen ponds to play on after school. That would give francophone boys more ice time and practice time, which would make them better players. It would be roughly the same reason that the Canadian Olympic team is competitive with the US team, despite having only one-tenth the population.)


2. Of all 16-year-old players at the midget level in Québec, 1 in 334 anglophones was eventually drafted, but only 1 in 618 francophones.

That could easily happen without discrimination. All it would take is for hockey to be a bigger part of francophone culture than anglophone culture.

Suppose that hockey popular enough among French-speaking families that the top 20% of boys are still playing organized hockey when they're 16. And suppose that hockey is less popular among English-speaking families, so that only the top 11% of boys are still playing organized hockey when they're 16.

That would explain the numbers exactly. The mediocre francophone players don't get drafted. The mediocre anglophone players don't get drafted either, but they dropped out of organized hockey early enough that they don't make Sirois's survey.

Again, I'd be willing to bet that this is what's going on. I live in Ottawa, which is on the border with Québec, and I can tell you that the francophone families I know are much, much more hockey-mad than the anglophone families, on both sides of the border.

Sirois says,

"If you're francophone and your son is talented in minor hockey, anglicize his name and you will double his chances of being drafted."


If Sirois is basing that comment only on this particular statistic, his conclusion is premature, to say the least.


3. Francophone players in Québec are less likely to get drafted than anglophone players in Québec, and they go lower in the draft.

Same argument. The less-skilled anglophones drop out of hockey more frequently, while the less-skilled francophones drop out of hockey less frequently. So the remaining anglophones are better, on average, than the remaining francophones.

Again, I'd bet that if you looked at raw population numbers, more Québec francophones get drafted than Québec anglophones, at every level of the draft. They are less likely to be drafted, as Sirois says, if you look only at the pool of 18-year-old players. But I'd bet they are MORE likely to be drafted if you look at the pool of all 18-year-olds in Québec, whether they play hockey or not.

It's selective sampling if the mediocre francophones are more likely to be in the sample than the mediocre anglophones.


4. Of the 763 francophones drafted since 1970, one-third of them went to four teams: the Québec Nordiques, Montréal Canadiens, Buffalo Sabres, and Philadelphia Flyers. (The teams drafting the fewest francophones were the Dallas Stars, Nashville Predators, and Phoenix Coyotes.)

First, and easiest: are there fewer francophone players in the league now than in the past? Given the number of players these days being drafted from outside North America, that would seem likely. That would explain why the Stars, Predators, and Coyotes -- teams that weren't in the league in the 70s and 80s -- would have drafted fewer francophones than the more established teams.

It's the same reason you'd also find that Québec, Montréal, Buffalo and Philadelphia have had more non-helmeted players than Dallas, Nashville, and Phoenix. It's not because Nashville discriminates against bare heads, but because the Predators weren't around when it was legal to go without a helmet.

Secondly: it might just be a difference in scouting. Back in 1985, Bill James did a study of the MLB draft, and found that, in baseball, players in the southern United States were much, much more likely to be drafted than players in the cold states, even if the players were of equal talent. That wasn't racism against Minnesotans, it was just where the scouts decided to go. James wrote,

" ... the explanation seems obvious. ... The scouts spend a lot of time in the South because it gets warm down there while the North is still freezing, and they go where the baseball is. They see more of the players, see the ones they like more often, and wind up falling in love with them."


Doesn't it make sense that the same thing might apply in hockey? There isn't a weather issue, but there *is* a language issue. Doesn't it make sense that the Phoenix Coyotes are less likely to have a french-speaking scout, and are therefore less likely to send someone up to Chicoutimi in February to check out some prospect? If francophone scouts are rarer than anglophone scouts (which they obviously are), it makes sense that not every team would have one, and, as a result, francophone players would be disproportionately drafted by the teams that do. That's not discrimination, it's just rational allocation of resources.

Dallas might just be saying, "you know, we don't have a francophone scout, so we'll let the Canadiens concentrate on prospects in Trois-Rivières, and we'll send our guy to Regina."


5. Sometimes, undrafted players manage to eventually make it into the NHL. That group represents 10% of players overall, but 19% of players from Québec, suggesting that more francophone players are going overlooked.

This can easily follow from the hypothesis that there are more second-tier francophones in the draft pool than anglophones.

Again, suppose that 20% of francophone boys are still playing at age 16 (and therefore scouted for the draft), but only 11% of anglophone boys are. Scouts know that only 1% of Québec boys, of either language, will make the NHL. So they duly draft only 1% of the anglophone population, and 1% of the francophone population.

Scouts aren't omniscient, and they'll miss a few good prospects. There will be undrafted players who bloom later, and finally attract some interest from NHL teams.

Under our assumptions, 19% of francophone boys will be initially passed over, but only 10% of anglophone boys. That leaves almost twice as many francophones who might get noticed (and signed) later. Of course, those nine percentage points of extra francophones are less skilled than the top ten percent, but some players are late bloomers, and the bigger the pool, the more missed players you're going to sign later.

I think that's the obvious true explanation: more players means more late bloomers.

-----

So the points raised in the article are certainly not enough evidence to conclude discrimination -- there is a perfectly plausible, non-racist explanation of each of them.

If you want to show discrimination, you need better arguments than these, to remove the selective-sampling problem intrinsic to each of the arguments here. What you can do is this: find all anglophones drafted in position X, and all francophones drafted in position X. See how they do in the NHL. If there's discrimination, you'll find that francophone 14th picks do better than anglophone 14th picks.

And even if you find there's discrimination, it doesn't mean it's racist, or even language-specific. It might just be a scouting issue, where there are fewer scouts in Québec than elsewhere, just as there were fewer MLB scouts in Minnesota than in Georgia.

My gut says you won't find much discrimination. I guess I wouldn't be surprised if you found a little bit, that team X might be less interested in a francophone eighth-round pick because of perceived language issues with the other players, when they can't really tell him much apart from a similar anglophone player who's also available. But discrimination is expensive, and every team wants to win. If you want to convince me that teams are deliberately leaving money on the table because of racism, you'll have to come up with some pretty good evidence.

These arguments, though, just don't cut it. There are many better, more plausible explanations for the apparent statistical anomalies in the article -- enough so, in fact, that, in my view, the accusations of racism are premature and irresponsible.


(Other views: Here's Tango, and here's mc79.)


Labels: , , ,

Sunday, October 11, 2009

Doesn't "The Book" study pretty much settle the clutch hitting question?

The clutch hitting debate continues. For the latest, here's Tango quoting Bradbury quoting Barra. Bradbury references Bill James' essay, and Barra references Dick Cramer's 1977 study.

In Tango's post, he says,

Anyway, as for actually finding a clutch skill, Andy [Dolphin] did in fact find it, and the results are published in The Book.


Absolutely. It's time, I think, that this study be acknowledged as the most relevant to the clutch question. Cramer's study gets quoted because it's the most famous, but recent studies (like Tom Ruane's) have used a lot more data. Dolphin's study improves on Ruane's by including even more data, by correcting for various factors, and by giving an actual quantitative estimate of how much clutch hitting talent there really is.

The one fault with Dolphin's work is that it hasn't been published in full. This is understandable: "The Book" contains a huge number of studies, and if they were all run in detail, the book would be a couple of thousand pages. But this is one of the most important studies, on one of the most asked questions in sabermetrics. If we want sabermetricians, academics, and reporters to accept the results, the study should be published in full, so as to be subject to full peer review. I'm not even completely sure how the study worked. I have a pretty good idea of the outline, but not the details. Part of the reason the study needs to be published is for the technical details to be available, so others can evaluate the method and reproduce the results if they choose to.

Anyway, here's what I *think* Andy did:

-- he took every regular-season game from 1960 to 1992.
-- he considered only PAs involving RHP, to eliminate platoon bias.
-- for every player who met minimum playing time, he computed his clutch and non-clutch OBP.
-- he adjusted those OBPs to reflect the quality of the opposing pitcher, and the fact that overall clutch and non-clutch OBPs differ.
-- he computed clutch performance by subtracting non-clutch from clutch.

That gave him clutch numbers for 848 players.

-- he looked at the distribution of clutch hitting, and figured the observed variance.
-- he then figured what the variance would have been if there were no clutch hitting.

It turned out that the actual variance was higher than the predicted variance, which is what you'd expect if there were something other than just luck causing the results (such as clutch hitting talent). The difference we can presume to be clutch hitting.

If luck and talent are independent (which is a pretty reasonable assumption), then

Variance caused by talent = (Total Variance) - (Variance caused by luck)

That calculation led Andy to conclude that the talent variance was .008 squared, which meant the standard deviation of clutch talent was 8 points of OBA.

Andy phrased it like this:

"Batters perform slightly differently when under pressure. About one in six players increases his inherent "OBP" skill by eight points or more in high-pressure situations; a comparable number of players decreases it by eight points or more."


That finding, I think, is the strongest we have, and I agree with Tango 100% that we should consider Andy's .008 figure to be the best available answer to the clutch hitting question.

----

As I said in previous posts, however, I do have some minor reservations about what we can conclude from the analysis, so it's appropriate to add a few caveats.

1. Mostly, I'm not convinced that the .008 represents individual clutch ability in the sense in which most fans think of it -- that the player "bears down" in important situations and performs better than normal. I wonder if, instead, it might just be a matter of both hitters and pitchers using different strategies in those clutch situations.

For instance, suppose you have a power hitter and a singles hitter, and neither gets any better in the clutch. But in those situations, the relative values of offensive events might change. Maybe, with the score close in the late innings, a home run becomes more valuable relative to a single. I'm making these numbers up, but, maybe instead of the HR being three times as valuable as a 1B, it becomes four times as valuable.

Now, the pitcher's strategy changes. Fearing the home run a little more than normal, he'd be apt to pitch around the power hitter, trading fewer home runs for more walks. That would cause the power hitter's OBP to increase more than expected. Even if there's no similar effect for the singles hitter, he'll look relatively worse in the clutch than the power hitter.

So it's possible, and even plausible, that the .008 might not be a reflection of the clutch behavior of an individual hitter, but just an artifact of the strategic manoeuvering in the batter-pitcher matchup.

To find out, you could check whether certain types of hitters have better clutch performances as a group. If you did find that, it would be evidence that at least part of what Andy found as "clutch ability" is just characteristics of the player.

There is some evidence that some of this is happening: in the book, Andy says that when he used wOBA (which weights events by their value, so HRs are worth about three times what a single is worth) instead of OBP (which weights all on-base events equally), the SD dropped from 8 points to 6. That suggests that clutch performance did indeed involve a trade-off between getting on base and hitting for power.

If you went one step further, and analyzed performance in terms of win probability (instead of OBP or wOBA), you might find some other result, such as no evidence of clutch talent at all. It could be that all the clutch differences are the result of hitters adjusting their game to what the situation requires, such as (say) a power hitter trying for a single with the bases loaded, vs. a home run with two outs and nobody on.

2. Just today, Matt Swartz suggested that lefties might be more "clutch" than righties, because they hit better with runners being held at first (I always thought that was because of the hole between first and second, but Matt suggests it's because that limits the defense's ability to shift in other ways). Again, that's something that's real -- so the team would know they could benefit from it -- but not "clutch" in the sense that the hitter is actually better in some way.

3. Another quibble I have with the conclusion is that the result appears to be not that significantly different from zero. Andy says there's a 68% probability that clutch talent is between 3 and 12 points; I calculated that the 95% confidence interval easily includes zero (the p-value of zero is somewhere around .14). So even if you're only interested in whether there's an ability to have a higher OBP (in the sense that some players' clutch OBPs vary more than others), the evidence is not conclusive beyond a reasonable doubt.

4. As Andy implies in "The Book" (and Guy explicitly suggests elsewhere), there could be other explanations for the .008. It could be that some players happened to have more clutch AB at home, so what we're seeing is partly HFA. It could be that some players happened to see a starter for the third time that game (when batters start gaining an advantage) more often in than expected in the clutch. It could be a lot of other things.

Guy suggests doing the same study, but choosing the PA randomly (instead of clutch and non/clutch). That would tell us how much of the .008 happens due to random clustering of factors.

(Note: just as I was about to submit this post, I found an earlier Andy Dolphin study that *does* do this kind of check. Andy found that dividing PA into other situations did not produce any false positives.)

----

Even if some of these criticisms turn out to be justified, it doesn't mean that clutch doesn't matter. Even if we find the entire effect is (say) due to lefties hitting better with runners on base, that's still something a manager or a GM should take into account. If you have two .270 hitters, but one hits .270 all the time, while the other hits .268 usually but .276 in the clutch ... well, you want the second guy. It doesn't really matter to you whether the extra performance comes from the players gutsiness, or just from something that's inherent in the game.

But my perception is that fans who talk about "clutch" are talking about something in a player's make-up or psychology that makes him more heroic in critical situations. I'd argue that while "The Book"'s study convincingly showed that some players hit slightly better (or worse) in clutch situations, it has NOT showed that it's because the players themselves are "clutch".

----

Looking back at what I wrote, I realize I'm repeating things I said before. But the point I was trying to make is that I agree with Tango: the study in "The Book" is state of the art, and, to my mind, the question of whether players hit differently in the clutch now has an answer.

I'm not sure how to get the result accepted. Well, publication of the study would help; the media are more likely to pay attention to a result if it's a full academic-type study instead of a few pages of a book. I'm sure JQAS would be happy to run it. Even a web publication would help.

What else? Well, I suppose that the more the sabermetric community cites the result, the more it'll spread, and the more likely sportswriters will be to come across it when researching clutch.

Or maybe a press release? It works for Steven Levitt!


Labels: ,

Monday, October 05, 2009

Stacey Brook on salary caps and competitive balance

You'd think that when a sport introduces a salary cap, it would lead to greater competitive balance in the league. That would make sense; with a cap, you won't have teams like the Yankees, who spend two-and-a-half times as much on players as the average team, and about five times as much as the Marlins. If you forced the Yankees to spend only the league average, they would have to get rid of many of their expensive star players, and they'd win fewer games.

In theory, if every team had to spend the same amount, they'd all start the year with equal expectations. I say "in theory" because, in practice, different teams would have different philosophies, some of which might work better than others. Certain teams might spend more on scouting, wind up drafting better, and win more games with the same payroll (at least until the draftees reach free agency). But, generally, you'd expect more balance among teams.

It seems that Stacey Brook, co-author of "The Wages of Wins," doesn't think that's true. He thinks that the salary cap (and floor) the NHL instituted in 2005 has had no effect on competitive balance.

Here are Brook's "Noll-Scully" measures of competitive balance for the last few years of the NHL (lower numbers = more balance):

2000-01 1.858
2001-02 1.581
2002-03 1.592
2003-04 1.633
-----------------
salary cap begins

-----------------
2005-06 1.637
2006-07 1.600
2007-08 1.037
2008-09 1.369

It does seem, Brook acknowledges, that competitive balance has improved the last couple of years. But, he says, that's part of a trend that's been going on for a long time. For one thing, there was virtually no change in the Noll-Scully the first two years after the cap. For another, balance has been improving since at least the 1970s:

1970s 2.557
1980s 1.969
1990s 1.796
2000s 1.538

Since competitive balance has been increasing even through most of hockey history that had no salary cap, he argues, it's just a continuation of the trend, and the salary cap doesn't have anything to do with the recent decline. He writes,


"As we argue in The Wages of Wins, and detail in our paper - The Short Supply of Tall People - competitive balance is declining not because of changes in league institutional rules - such as payroll caps - but rather due to the increasing pool of talent to play sports, such as hockey."


But that doesn't make logical sense. Sure, there's already a decreasing trend, for whatever reason, but that doesn't mean a change to the rules can't contribute to the trend. Does having the ability to send text messages lead to people using their phone more? Of course it does! But if you apply the same argument, you get something like, "well, cell phones were becoming more and more popular even before text messaging, so text messaging can't have anything to do with it." That's not right.

And, indeed, it contradicts their own findings in "The Wages of Wins" itself. The authors found that there was an r-squared of .16 between salary and performance in MLB. Which means that if you were to flatten out salaries, so that each team paid an equal amount, it would reduce the variance of wins by 16%. So, absent any compensating factors, "The Wages of Wins" is argues a salary cap MUST reduce the Noll-Scully measure!

----

By the way, take a look at the value of 1.037 for 2007-08. That's really, really low; the lowest you can expect Noll-Scully to be is 1.000, and that's when every team is of exactly equal talent. A value so close to 1 suggests a combination of (a) the league being really balanced that year, and (b) teams, by luck, playing closer to .500 than their talent suggested.

If you look at the standings, you see the usual suspects at the top of the conferences, so it doesn't really seem like all the teams were equal that year. Could it be that Brook used a formula for Noll-Scully that didn't consider the extra point for an overtime loss?

----

But what about Brook's (and Berri's) argument that balance has increased because players' skills are becoming more equal? Well, sure, that's been part of it, no question. But effects often have more than one cause. You may be earning more money because you're working overtime, but that doesn't mean winning the office hockey pool will *also* make you richer. Whatever was causing the levelling of team talent before might still be there ... but, now, there's an additional effect, the salary cap effect.

Now, maybe I'm not interpreting Brook's argument correctly. Maybe he's thinking that the salary cap does contribute to balance, but so much less than the other effect (players getting more equally talented) that it's not worth considering. But I think it's the other way around. With a salary cap, it doesn't matter much how the players' talent is distributed.

Suppose players vary a lot in talent, 100 players equally spaced from 0 to 100, with an average of 50. A team that has lots of money might buy players with an average of 70, and a team owned by Harold Ballard might buy players with an average of 30. Big difference.

Now, suppose the talent pool gets bigger, and competition gets tougher, and now the players are all spaced between 40 and 60. Now, no matter how much you want to spend, you can't get above 60. And no matter how cheap you are, you can't get below 40. But the league average is still 50.

So, yes, Brook is correct, a narrower range of talent leads to more competitive balance.

But, now, suppose that every team has a salary cap and a floor: they all have to spend exactly the same amount of money. Now, it doesn't matter how the talent is distributed: assuming every team is equally good at evaluating players, they'll all sign a team with an average of 50. Even if the distribution of talent is like it was in the 1970s, with lots of spread, it doesn't matter -- because even if there are lots of players in the 90s and 100s, no team can afford to sign more than one or two. The more talented the player, the more likely a team who signs him will have to sign *less* talented players to stay within the cap.

Even if you have the Babe Ruth of hockey, a player who's (say) a 500 when the other players top out at 100, it won't matter, because the teams will bid up the price of his services until they pay him what he's worth. The team who gets him will have less money to spend on other players, and it all evens out in the end.

What's happening is this: in the past decades, competitive balance decreased steadily for many reasons, including the increase of the talent pool that Brook cites. But, now, with a salary cap and floor, most of that stuff doesn't matter much any more!

It matters a bit, because not everyone is a free agent. The distribution of talent does matter for draft choices, because the top draft choice doesn't cost that much more than the others (but can be a whole lot better, as in Sidney Crosby).

---

Of course, NHL hockey teams are more than collections of free agents priced at market value, so we shouldn't expect competitive balance to be perfectly level. There are some factors that might cause the Noll-Scully to actually rise a bit from the theoretical bottom created by the salary cap.

For instance: the first draft choice goes to a team near the bottom of the standings. Back in the days of less competitive balance, that went to a team that was probably legitimately awful. Now, with teams closer in talent, it could go to a team that was just unlucky. If the team that gets the next Sidney Crosby is an average team, rather than a bad team, that won't reduce competitive balance the way it used to.

Also, scouting: an investment in scouting now pays off more than it used to. Before, if you were a low-spending team, maybe a better draft choice might move you from .400 to .450. Now, if all teams are medium-spending, maybe it'll move you from .500 to .550, and give you a legitimate shot at the Stanley Cup. So more teams should be willing to spend the money to improve their drafting. And so, the rich teams could "buy" better players, not by spending to pay them, but by spending to identify them better.

And there are probably other ways to get around the cap: didn't companies introduce employee health plans to get around wage controls in World War II? If a superstar free agent has knee problems, and I wanted to sign that player, I'd offer to hire the best knee doctor in the business and keep him on staff. Whatever he costs, it's not going to count against my cap. That may not actually be practical, but I'm sure rich teams will figure out ways to buy better teams, one way or another.

My point is not to say that these factors will push inequality back to where it was when teams could sign all the free agents they were willing to pay for, just that there may be other theoretical reasons that Noll-Scully may bounce back up a little bit. I think all those factors will be minor, and as long as the salary cap and floor stay within roughly the same range of each other, we'll continue to see a balanced league, regardless of how the talent pool changes.



Hat tip: The Wages of Wins



Labels: , ,

Sunday, September 27, 2009

A game theory study on pitch selection

Commenter "Eddy" was kind enough to send me this link to what looks like a press release on a new study by Kenneth Kovash and Steve Levitt (of Freakonomics fame). The link is to a summary only; to get the actual study, I had to pay $5.

The study is in two parts: one baseball, and one football. I'll talk about the baseball results here; I'll send the football portion of the study to Brian Burke (of "Advanced NFL Stats") in case he wants to review it himself. I hope that's allowed under fair use and I don't have to pay another $5.

In the baseball half, the authors claim that pitchers throw too many fastballs. They would do better -- much better, in fact -- if they threw other kinds of pitches more often.

How can you tell, using game theory, whether fastballs are being overused? Simple: you just check the outcomes. If opposition hitters bat for an OPS of .850 when you throw a first-pitch fastball, but they OPS (can you use OPS as a verb?) .800 when you don't, they obviously you should cut back on the fastballs. At first glance, it might look like you should get rid of them entirely, because, that way, you could shave .050 off the opposition's OPS. But it's not that simple: as soon as the opposition realizes that you're not throwing fastballs, they'll be able to predict your pitches more accurately, and they'll wind up OPSing higher than .800 -- probably even higher than the original .850. Game theory can't tell you the right proportion, at least not without having to make assumptions that would probably be wrong. But it *can* tell you that you should adjust your strategy until the OPS-after-fastball is exactly equal to the OPS-after-non-fastball.

If that's what the Kovash/Levitt study did, it would be great. But it didn't. Instead, it did something that doesn't make sense, and makes almost all its conclusions invalid.

What did it do? It considered outcomes only for pitches that ended the "at bat". (The authors say "at bat", but I think they mean "plate appearance". I'll also use "at bat" to mean "plate appearance" for consistency with the paper.)

That's a huge selective sampling issue. It means that when a pitch on a 3-0 count is a ball, you count it; when it's put in play, you count it; but when it's a strike, you don't include it. That doesn't work. I can make up some data to show you why. Suppose:

-- Fastballs are 50% put in play, for an OPS of 1.000
-- Fastballs are 50% strikes, for an OPS of .800 after the 3-1 count.

-- Non-fastballs are 25% put in play, for an OPS of .900
-- Non-fastballs are 25% strikes, for an OPS of .800 after the 3-1 count
-- Non-fastballs are 50% balls, for an OPS of 1.000.


That summarizes to:

0.900 OPS for fastball
0.925 OPS for non-fastball


Clearly, you should throw a fastball, right?

But if you consider only the last pitch of the at-bat, you have to ignore those 3-1 counts. Then you get:

1.000 OPS for fastball
0.933 OPS for non-fastball


And it looks like you should throw *fewer* fastballs, not more. And that's wrong.

This kind of thing is exactly what Kovash and Levitt have done. They think they've shown that the fastball is a worse pitch than the non-fastball. But what they've *really* shown is that the fastball is a worse pitch than the non-fastball only if you ignore the fact that if the pitch doesn't end the at-bat, the fastball is more likely to put the count more in the pitcher's favor.

So I don't think their main regression result, the one in Table 4, holds water, and I don't think there's a way for the reader to work around it. If the authors just reran that regression, but considered the outcome even if it wasn't the last pitch of the at-bat, that would fix the problem. I'm not sure why they chose not to do that.

----

Still, there are some other aspects of the study that are interesting.

In Table 2, the authors show results for every count separately. On 3-2, every pitch is the last pitch of the AB (except for foul balls, which the authors actually included in the study, but don't affect the results). Therefore, the change in count isn't a consideration, and we can take the results at close to face value.

So what happens? There is indeed a big difference between fastballs and non-fastballs:

.769 OPS after a 3-2 fastball
.651 OPS after a 3-2 non-fastball.


This would certainly lead to a conclusion that pitchers are throwing too many 3-2 fastballs, and the results stunned me: I didn't expect this big a difference. But then it occurred to me: most of the OPS on 3-2 is walks. And walks are undervalued in OPS. If a 3-2 fastball results in more balls in play, but the 3-2 curveball (or whatever) results in more walks, the actual run values might be more even. That is: pitchers know that walks are "worse" than OPS says they are, so they're willing to tolerate a higher OPS for fastballs if it's contains fewer walks. That seems quite reasonable.

Suppose walks form half of OBP for fastballs, but 60% of OBP from curveballs. That's a difference of .100 in OPS due to walks. If you assume that should "really" be .140, that closes the gap from 120 points down to 80.

That adjustment is still not enough to explain the entire gap between fastballs and non-fastballs, but it's certainly part of it. In studies like this, where you're looking for very small discrepancies, and you have non-traditional proportions of offensive events, you need to use something more accurate than OPS.

----

But here's something that makes me worry, and I wonder if there's a problem with the authors' database. Here are the overall OPS values for ABs ending on that pitch, from the authors' Table 1:

.753 fastball
.620 non-fastball

Do you see the problem? This data puts the average OPS at .709 (fastballs being twice as likely as non-fastballs). But the overall major-league OPS for the years of the study (2002-2006) was around .750. Why the discrepancy? The authors do say they left out about 6% of pitches, mostly "unknown", but with a few knuckleballs and screwballs. But there's no way 6% of the data could bring a .750 OPS down to .709. So I'm thinking something's wrong here.

There's no such problem with Table 2, which is broken down by count instead of pitch type. That table does average out to about .750.

UPDATE: in the comments, Guy reports that if you calculate SLG with a denominator of PA instead of AB, the numbers appear to work out OK. So the authors probably just miscalculated.

----

Finally, the authors argue that pitchers aren't randomizing enough. According to game theory, there should be no correlation between your choice of this pitch, and your choice of the next pitch. If you have a correlation, because you're choosing not to randomize properly, the opposition can pick up on that, guess pitches with more confidence, and take advantage.

Kovash and Levitt found that pitchers have negative correlation: after a fastball, they're more likely to throw a non-fastball, and vice-versa. They conclude that teams are not playing the optimal strategy, and it's costing them runs.

However: couldn't there be another factor making it beneficial to do that? It's conventional wisdom that, after seeing a fastball, it's harder to hit a breaking pitch, because your brain is still "tuned" to the trajectory of the fastball. If that's true -- and I think every pitcher and broadcaster would think it is, to some extent -- that would easily explain how the negative correlation observed in the study could actually be the optimal strategy. But the authors don't mention it at all.

----

So I don't think we learn much from this paper, but there's a tidbit I found interesting. Apparently Kovash and Levitt have access to MLB bigwigs, and did a little survey:

"Executives of Major League Baseball teams with whom we spoke estimated that there would be a .150 gap in OPS between a batter who knew for certain a fastball was coming versus the same batter who mistakenly thought that there was a 100 percent chance the next pitch would *not* be a fastball, but in fact was surprised and faced a fastball."


That's kind of interesting. I have no idea how accurate the estimate is ... anybody seen any other research on this topic?



Labels: , ,

Wednesday, September 23, 2009

How much does a "clubhouse cancer" cost his team?

This past weekend, the Cubs suspended outfielder Milton Bradley for the remainder of the 2009 season. Bradley had made some remarks to the press critical of the "negativity" he had received in Chicago. That, combined with his reputation as a complainer who apparently didn't get along with his teammates, prompted GM Jim Hendry to send him home for the rest of the year, with pay.

Do "clubhouse cancers" cost a team wins? In an excellent article at Baseball Analysts, Sky Andrecheck admits he doesn't know. But he looks at the anecdotal evidence of other oncoplayers to at least try to get a handle on how much a team is willing to pay to get rid of him.

This season, Bradley had accumulated 1.2 wins above replacement (WAR) up to the day of his suspension. Andrecheck suggests that he's probably a little better player than that because he's having an off-year. So for Hendry to be willing to lose Bradley's contribution, he must think think that his continued presence would cost the team wins at at least that rate. Otherwise, he'd bite the bullet and keep him around.

That figure is in line with another recent disgruntled clubhouse influence, Shea Hillenbrand, who was projected as a 1.4 WAR player when he was released by the Blue Jays in 2007.

Finally, Tom Tango adds a third anecdote. He notes that no team was willing to sign Barry Bonds in 2008, even at minimum salary, when Bonds was projected to be around 1.5 WAR.

As for other "cancers": Albert Belle and Barry Bonds had poor clubhouse reputations, but weren't released by their teams. Those guys were substantially better than 1.5 WAR per season. That strongly suggests that the cost of keeping a player around is less than the cost of losing an all-star. Andrecheck writes, "I can't think of even a 3 or 4 WAR all-star caliber player ever having been given away or released largely due to clubhouse attitude. Instead, teams learn to deal with these players, rather than oust them."

So, it would appear, poisoning the clubhouse is worth somewhere between 1.5 and 3 wins a year.

That's very cool stuff. But I'm still wondering about a related subject, one that the article doesn't try to answer. My question is: just *how* does a clubhouse cancer cause the 1.5 win dropoff? It's unlikely that the personalities of the players affect their team's Runs Created or Pythagorean estimates (unless clutch play is affected more than non-clutch), so the dropoff must come in the performance of the player's teammates. How does one player's negative attitude cause another player's performance to suffer? Do the disheartened fellow players not try as hard? Are they less motivated to receive coaching, or stay in shape? Do they concentrate less on pitching strategy, maybe spending less time in with the coach going over scouting reports on opposing batters?

And whatever it is, how do we gather evidence? I suppose we could check the performance records of pitchers while Bradley is on the team, and compare them to their records before he arrived and after he left. But 1.5 wins a year, with the equivalent of 18 full-time players (nine hitters and nine pitchers), is only about an 0.8 run shortfall per player. That's not much signal to find among all the noise, isn't it? I suppose you could check the records in the few weeks prior to the player being kicked out, on the premise that that's when the situation became most intolerable. But you might find that the situation reached the breaking point only because the team was losing, so you might mix up cause and effect.

Or maybe it's that one guy who's on the cusp of breaking out, or having a comeback season, just gets discouraged and flames out: some 23-year-old prospect winds up a little less hungry, and gives up a bit too early. That doesn't seem like it could be 1.4 wins, but I guess it's possible.

Any suggestions? I'd even be interested in hearing plausible suggestions for how the 1.4 wins (14 runs) are lost. At least if we have some reasonable hypotheses, maybe we can think of some ways to test them.

----

My suspicion, though, is that the Milton Bradleys don't actually cost their teams 1.4 wins that way. I think there are other reasons that the Cubs might have for releasing Bradley than just a sober calculation of his effect on the team's on-field performance.

First, there's deterrence. There has to be some mechanism by which teams prevent their players from going off half-cocked and ruining team chemistry. There has to be the threat, explicit or implicit, that if the player is disrespectful towards the team, he will pay a price. For most players, who want their time with the team to be as pleasant as possible, the desire to get along with their teammates might be enough incentive. But when an anti-social player crosses the line, the punishment may have to have a negative cost to the team.

For instance, suppose a world-famous surgeon commits murder. Putting him in prison might cost the hundreds of lives his skills would save over the years. But society has to jail him anyway; otherwise, they give every surgeon a license to kill.

The same thing might be happening here. Even if keeping Milton Bradley on the team wouldn't cost only a small fraction of a win, they'd have to get rid of him anyway, just to make sure the other 24 players don't get similar ideas.

Second, Bradley's presence might cost the team wins in other ways than just on the field. If the clubhouse atmosphere is poisoned, the other players are unhappy. If they are unhappy, they are less likely to want to stay on the team. And so, the Cubs would have to offer them more money to stick around as free agents. Indeed, they'd have to pay *all* free agents more money than they would otherwise. If Chicago is a crappy team to play for, but Boston is wonderful, why would anyone sign with the Cubs? (You might also get an increase in disgusted players demanding to be traded.)

If word gets out around the league that Cubs' management is not willing to enforce normal standards of civility from their players, it could cost them a lot more than 1.4 wins per year.

Third, and thinking out loud: is it not possible that while a poisonous Milton Bradley costs his team 1.4 wins a year, a poisonous player of higher ability might cost the team nothing? Whatever mechanism it is that has Bradley hurting the team on the field, there's no doubt it's because of the reaction and chemistry among the other players, right?

Now, people get upset when social norms are violated: I'm going to be more upset if you steal $20 from me than if my taxes go up $20. Is it possible that putting up with an arrogant superstar is a social norm, but putting up with a marginal player is not?

Isn't it possible that when a superstar acts like a disagreeable moron, the other players kind of shrug and accept it? If the social norm is that some superstars are a**holes, and you just have to get used to it if you want to win, then it might cause no harm at all. Where I used to work, if the manager was being a big jerk, the rest of us would talk about it over coffee, and we'd grin and bear it and get back to work. But if one of our fellow grunts was acting like an idiot, that would be different: that would upset us a lot more, because he was one of us.

Could it be the same thing happening here? When Barry Bonds was a jerk, maybe management took the players aside and said, "yeah, we know he's acting like that, but he's our best chance of winning, so try to deal with it?" That wouldn't work with Milton Bradley, and so the players are less likely to put up with it, and management would have to get rid of him.

Anyway, as I said, just thinking out loud on this one.

Finally, could it be just money? If Milton Bradley is pissing off the other players, and the fans find out, and they start booing Bradley, and the team does nothing about it ... might that not get in the way of the fans' long-term loyalty to the team? The fans are loyal and rabid. They're proud to be Cubs supporters, and many have spent their whole lives dreaming a World Series win. Then Milton Bradley comes along, winds up in the absolute dream job of Chicago Cub outfielder, but doesn't appreciate what he's got, and starts insulting the Cubs and the fans and the tradition.

Doesn't getting rid of Bradley fulfill an obligation to those fans? Doesn't that build the brand and cement the relationship and lead to fan loyalty and revenues?

Bradley is only going to miss two weeks, and may get a chance to reform. Those two weeks are worth, what, maybe .1 wins? That's less than $1 million -- and, considering that the Cubs are out of playoff contention, it may be only a few hundred thousand. The suspension could pay for itself in no time at all.

Labels: , ,

Wednesday, September 16, 2009

You can't forecast outcomes that are random

Predictions are often wrong. In an article in the Wall Street Journal last month, "Numbers Guy" Carl Bialik points out a few that went awry. Two years ago, for instance, a government energy agency predicted that the price of oil would be between $75 and $85 in 2008. In reality, it started out the year close to $100, ran up past $140 in July, and dropped back below $40 by the end of the year. Bialik writes, "winging darts at numbers on a board might have been more accurate."

It's easy to make fun of prognosticators when they get this stuff wrong. But let's not get too hasty. The fact is, the things that are most worth predicting are things that are most unpredictable. If you want a prediction of what time the sun will rise tomorrow morning, you can get 100% accurate predictions from any competent astronomer. But what would be the point?

The price of oil varies so much because there are so many factors that influence it: wars, foreign government policies, consumer behavior, US election results, technological advances, natural disasters, and so on. These things are random. And they are very, very complex, most of them being the result of human thought and action.

Still, shouldn't some people be better skilled at making those predictions than others? Absolutely. Tancred Lidderdale, the economist quoted in Bialik's article, has an excellent understanding of the factors that impact the price of oil, much better than mine. So what's wrong with evaluating his predictions after the fact, to see if he's any good?

The problem is that no matter how much you know about the price of oil, it's random enough that the spread of outcomes is really, really wide: much wider than the effects of any knowledge you bring to the problem.

Suppose that on the basis of Miguel Tejada's career, everyone thinks he should hit .290 next year. But suppose Bob, who's a big fan of Tejada, and follows his plate appearances closely, has noticed something about his performance and thinks differently. Maybe it's some detailed observation that he swings a certain way, and other players with the same swing have declined more in their thirties than average. So Tejada should be only about .286.

That may be absolutely right, and figuring that out was an act of staggering sabermetric genius. Bob's estimate of .286 is correct, and the .290 estimates are all wrong. Bob is literally the only one in the world whose estimate is correct.

But in practice, how do you prove that? The standard deviation of batting average over 500 AB is about 20 points: so even with .286 being correct, there's still a 46% chance that A will hit closer to .290 than .286 next year. There's actually about a 1 in 3 chance that Tejada's average will be below .266 or above .306. For practical purposes, it's impossible to evaluate the two predictions on this one single sample. Even if Bob is omniscient, knowing everything possible about Tejada's talent, health, and diet, it's going to take a lot of evidence to prove that he's a better estimator than the mob, so long as the results of individual at-bats are random.

The problem is the small sample size: over 1000 predictions, or 1,000,000, Bob is going to have a better record than everyone else. But, who makes a million predictions, and who keeps track of them to evaluate them afterwards? And even if we do this a reasonable number of times, like 100, Bob still isn't assured of beating me. If his chance of beating me is 54%, then, if we predict 100 times each, I still have an almost 35% 21% chance of coming out the winner.

That is: an omnisicient expert can beat a reasonably-informed layman only about 65% 79% of the time. And that's after 100 trials each, 100 trials where the predictor actually has a significant edge in knowledge or analysis. In real life, if you get only one trial, and you're not even close to omniscient, and the prediction you're making may not be the one in which you have the most confidence, the public's expectations of you shouldn't be very high. Not because you're ignorant, but because life is just too random.

Of course, this is an arbitrary example, with more randomness (20 points) than knowledge (4 points). But isn't it roughly the same situation for the price of oil? The randomness in the economy is just huge. Part of the reason oil went down last year is because of the recession. The recession happened because of the credit crisis. And very few people foresaw the credit crisis, including people who had thousands, or millions, or billions of dollars on the line. For a government economist to be omniscient, he has to be omniscient about mortgage finance, and on the government's and public's reaction to every crisis that might possibly occur. That's asking a lot, isn't it? To an energy economist, the state of mortgage finance has to be taken as random.

Because life is random, and the price of oil is very sensitive to the randomness of human-caused shocks, you can't expect a single, point estimate of the price of oil to be 95% accurate within $1, or even $5. An estimate that precise is impossible, beyond the scope of human capability, and probably beyond the scope of the most powerful computers that could be imagined. An honest and competent forecaster will tell you that the best he can do is give you *distribution* for the future price of oil: maybe that there's a 60% chance it will be between $60 and $110, and a 10% chance it will be below $60, and maybe a 5% chance it'll go over $200 (if there's a major war in the Middle East, say), and so on. That's not something the newspapers are keen to report on -- it's hard to put in a headline, and it's harder for readers to understand.

What you hope Mr. Lidderdale's agency was probably saying was, "we have our best guess at a probability distribution for what the price of oil will be next year. Its mean is in the $75 to $85 range." If that phrasing makes journalists uncomfortable, fine. But that doesn't change the fact that it's the best anybody can do. And it doesn't change the fact that you can't decide how good a predictor is on the basis of one, two, or even a hundred point estimates. You need a LOT of data. And if an outlier happens, all evaluations are off. I'd bet that anyone who predicted, back in 2007, that oil would jump to $140 and then drop back to $37, is a kook, not an expert. What happened in 2008 was something of an outlier, random, unpredictable, and unknowable. Anyone who came close was probably just drop-dead lucky.



Labels:

Sunday, September 13, 2009

SABR journal looking for sabermerics submissions

SABR's "Baseball Research Journal" is looking for submissions.

BRJ is a large format paperback book, published twice a year by SABR and sent to all several thousand of its members. It used to have crappy statistical articles in it -- stuff that wasn't peer reviewed, from authors who may never have read Bill James. I am happy to report that, recently, under former editor Jim Charlton, and current editor Nicholas Frankovich, the quality is much higher. I may be biased, because they've run a few articles of mine, but it really is getting a lot better. BRJ is also the place where Bill first ran his "Underestimating the Fog" article (pdf).

But Nick Frankovich is getting more aggressive about pursuing even better stuff, and he asked me to post this bleg. SABR needs your research, and he's asking you to consider submitting an article to BRJ.

It doesn't matter if you're a member of SABR or not. It doesn't matter if you've already published your research on a website. All that matters is if it's a good article, suitable for readers who may not know a whole lot of sabermetrics. That doesn't necessarily mean it has to be dumbed down; it does mean you may have to explain all of your acronyms and start at the beginning rather than the middle.

Nick is especially interested in articles that explain the current state of a topic in sabermetrics. He (actually, someone in SABR) suggested an article summarizing the current state of the DIPS theory, which I think would be a very good idea. I've always been looking for articles that explain something in sabermetrics from the bottom up, because that way I have somewhere to refer people who contact me or submit articles to "By the Numbers". DIPS would be a very good candidate.

Anyway, any reasonable topic will do, and any submission would be appreciated. If you're accepted, you don't get paid, but you get three copies of the book, and you get full rights to do whatever you want with the article afterwards (although you grant SABR the right to use it too). You also help improve the quality of the sabermetric research in SABR, which, perhaps surprisingly, is something that's really needed.

You can contact Nick at frankovich@sabr.org. Or, feel free to e-mail me with any questions.

Labels: ,

Monday, September 07, 2009

Matt Swartz on home field advantage

Baseball Prospectus's Matt Swartz has completed a nice five-part series on home-field advantage (HFA) in major league baseball. I've always thought HFA was one of the biggest unresolved issues in sabermetrics. So does Swartz, and he said it better than I could:

"[HFA] should surprise us as analysts more than it does. Nearly every study of psychology with respect to baseball has come up revealing either small effects or no effect. We all know that players are human, but the numbers do not seem to indicate many obvious psychological aspects. Hundreds of researchers have tried to discover clutch hitting, but few have found any evidence of its being a repeatable skill. ... We have attempted all kinds of ways to splice the data to reveal a large psychological effect within baseball to show that baseball players don’t behave like statistical models, and there seems to be little evidence of any strong, detectable effects, even if we know they exist and occasionally can discover smaller ones. ...

"However, home-field advantage is perhaps the most obvious area where we see something resembling a psychological effect, or at least an effect that is not captured by our typical models of baseball players and ballgames. It is clear that something about being the home team trumps talent in a way that is mathematically equivalent to benching an average player on the road team."


Swartz proceeds to look at various aspects of HFA. Many of the findings are unremarkable, but there are a couple that are kind of interesting.

First, let me quickly summarize the other stuff that Matt found in each of his five parts.

Part 1: HFA has been very steady over the decades, at around 40 points (.540 to .460). It shows up in almost every statistical category for hitters and pitchers, except those related to errors.

Part 2: There doesn't seem to be a team-specific HFA, except for the Rockies, whose HFA is an outlier and much higher than most.

Part 3: There appears to be a "familiarity" effect. HFA is highest for interleague games, next highest for games between teams in different divisions, and lowest for intradivisional games (where presumably the teams face each other most often). Also, the farther apart the teams, the higher the HFA.

Part 4: The second-last game of a series seems to have a larger HFA than any other game. This apparently only holds for teams who are geographically close together. Lots of other breakdowns show no significant effect.

Part 5: Individual players do appear to show stable HFAs from year to year, suggesting that they can be more or less suited to their home park.

Most of this is roughly in line with what we knew already. But here's the thing I found most interesting: a lot more of HFA comes in the first three innings than in the rest. Here's Swartz's chart; for each inning, the percentages are the difference in runs scored for the home team vs. the visiting team:

1 16.2%
2 9.3%
3 10.1%
4 6.0%
5 7.8%
6 8.1%
7 8.7%
8 6.5%

The overall difference appears to be about 8%. By Pythagoras, if a team scores 8% more runs than their opponents, they'll win a little over 16% more games, which works out to about a .540 winning percentage, exactly as observed (.540 divided by .460 equals 1.17). But the first inning number is huge! If the home team outscored the visiting team by 16.2% overall, its winning percentage would be .575 (Pythagoras with exponent 2).

What could cause this? It could just be that the first inning is higher-scoring overall, and the difference isn't linear. But the difference is still huge. Could this be a real finding, that HFA diminishes later in the game? If it's a question of familiarity, that might make sense, except that why would the visiting team be less familiar with the park the first inning of Game 3 as opposed to the eighth inning of Game 2?

Still, this is something I haven't seen before, and I wonder if you'd find the same thing if you looked at other sports.

---

One thing that might be good is to break down HFA into its component parts. The articles show us the HFA appears in almost every statistical category, but they overlap. For instance, the home team strikes out less and walks more. This indicates that the visiting pitchers are throwing fewer strikes and more balls. Is that enough to be the entire effect? That is, if the road pitchers are getting behind in the count, the batters will do better, even if batting skill is completely unaffected by HFA. On 2-0, the batters will be seeing juicier pitches, and that alone could account for their extra doubles, triples, and home runs.

Does it? What you'd want to do to find out, is to compare batting lines based on count (and controlling for pitcher, if you really wanted to be thorough). As it stands now, we still don't really know what HFA comes from, whether it's evenly balanced between batter and pitcher, or what.

The home team scores, on average, about 0.4 runs per game more than the visiting team. Using Swartz's numbers and assuming 40 PA per game per team, the home team gets about 0.4 fewer strikeouts and 0.25 fewer walks. That adds up to about .18 runs. That's half the entire effect. Is it possible that just the different (favorable) counts account for the home team's remaining .22 run advantage? Seems possible to me.

Or, looking at it another way: a study I did a few years ago (.pdf, page 4) came up with the figure that turning a ball into a strike is worth about .14 runs. That's a three pitch per game difference between the two teams. Would a three pitch difference (three extra strikes and three fewer balls) be consistent with 0.4 extra strikeouts and 0.25 fewer walks? I don't know, but you could try looking at it that way.

If you went about it that way, you might wind up with a breakdown of HFA something like:

30% pitchers throwing more strikes
15% batters putting the ball in play more often
10% batters hitting a different LD/GB/FB mix
20% higher BABIP on a given type of ball in play
15% more HRs

I'm making these numbers up, of course. And for some of this stuff, you wouldn't be able to tell if it was the pitcher or the hitter; for instance, fewer strikes might just mean that the batter makes contact better, as opposed to the pitcher improving. And for a higher BABIP (which Swartz found), is it the hitters doing better, or the defense doing worse? We don't know. But still, a breakdown like that would be a start.

---

Another thing I'd like to see is just raw performance data. Do pitchers throw harder at home than on the road? Do their pitches have more break or movement, all else being equal? That might be hard to study, because all else is never equal, and Pitch F/X recorders might be different at different parks. Although, if the Braves' pitchers show 2 MPH more than their opponents at home, but 1 MPH less on the road ... that does indeed tell you something, although the caliber of the opposition might not even out in your two samples.

My guess is that you'd find that HFA goes right down to the most base level imaginable: the home team would have higher bat speeds and pitch velocities. Their players would run faster at home, and they'd have faster reaction times. I suspect that HFA is something universal, and both psychological and physiological. I'd bet that within a few years, evolutionary psychologists will be studying this stuff and have some theories about how we evolved to be physically more competent in familiar surroundings.

But I'm just guessing.



Labels: ,

Tuesday, September 01, 2009

Re-estimating an NHL team's Picasso value

Sports franchises are different from "regular" businesses in one important way -- they're a lot more fun. If you own a team, you get any profit it makes, but you get lots of perks in addition to that. You get to be on TV a lot. You get the best seat in the house. You get to hire and fire staff. You get quoted in the paper any time you want. You get to be a hero in your local community. And so on.

Because of this, you'd expect team owners to be willing to pay more for a team than its future earnings are worth; they want the "consumption value" in addition to the investment value. You can call this the "Picasso effect," because owning an expensive sports team is a bit like owning an expensive painting; you do it partly for the pride of ownership.

In the previous post, I tried to estimate the Picasso value this way: I ran a regression to predict team market value from team earnings (both values as estimated by Forbes). The equation came out

Market Value = 4 * annual earnings + $200 million

From that, I suggested that Picasso value was $200 million: that is, since the $200MM term didn't have anything to do with the success of the business, it must be the value that owners are willing to pay just to own the team.

But, following a post by Dackle over at "The Book" blog, I realized that isn't quite right.

The problem is that team value -- at least that portion that has to do with earnings -- is based on *future* prospects. And future prospects don't correlate 100% with current prospects. In effect, some of today's earnings is random noise -- the economy might be good in that particular city, or a promotion works well, or the team is just having a good year.

The more random noise, the higher the Picasso estimate. For instance, suppose that profits were completely random, and had nothing to do with any particular attribute of the team. Then, all teams would be valued equally, and the equation would be

Market Value = 0 * annual earnings + $220 million

And it would look like the entire value was Picasso, when, in reality, it could be that the value is driven entirely by earnings.

So to do the calculation right, you have to remove the noise from the earnings.

To try to figure out how to do that, I started by running a regression on Forbes 2008 earnings vs. 2007 earnings. If earnings were completely random, the correlation coefficient would be zero. Of course, it wasn't zero; the Leafs were profitable not because they were lucky that year, but because there are millions of loyal idiots like me who worship the team even though it continues to suck. The correlation coefficient was actually a very high .93. I'll put that in courier font:

One-year earnings correlation coefficient = .93

An r of .93 doesn't suggest a lot of noise, so it won't change things much. But maybe the .93 is still too high. Remember, the economic value of the team is the present value of *all* future earnings, not just next year. And earnings might change more in future years. For instance, between one year and the next, team performance is usually similar. Good teams stay good teams, and poor teams stay poor teams. Maybe that all evens out after, say, five years.

If we take .93 to the fifth power, in effect "compounding" the regression to the mean, we get about .70. This seems reasonably generous to me; a correlation of .7 is an r-squared of .5, which implies that the "fixed" component of a team's earnings has the same variance as the "variable" component.

That means that to get a team's "true" earnings from its 2008 earnings, we regress the number 30% towards the mean. To take one example: the Rangers had earnings of $30.7MM in 2008. The mean is $4.7MM. Regressing $30.7MM thirty percent towards $4.7MM gives $22.9 MM. So we assume that the expected value of the Rangers' "real" earnings was $22.9MM, and the remaining $7.8MM was due to random factors specific to that season.

If we do that for all 30 teams, and rerun the analysis using our regressed estimates of earnings, we now get

Market Value = 5.6 * annual earnings + $193 million

Not much different ... but better, I think. I'm more comfortable with a higher earnings multiple (5.6, in this case, rather than 4.0), since, for publicly traded securities, ratios (I think) tend to range between 7 and 11.

So this reduces our estimate of "Picasso value" from $200 million to $193 million. Not much. And it's easy to see why not much: according to the Forbes data, the money-losing teams are worth an average of about $160 million. If you believe these teams will continue to lose money, then, obviously, the Picasso value must be at least $160MM, since they're worth zero as a going concern.

I believe some of the $193 million is Picasso value, and some of it is hopes that the team will eventually be profitable: either by moving it to a city where it can make money, or by making more money in other ways (like a better TV deal).

Anyway, getting back to the Zimbalist/Balsillie question of how much more a team is worth in Hamilton ... if we run the revised numbers, we get an even bigger difference -- which makes sense, since the more profits matter, the more a team is worth in a money-making city as compared to a money-losing city.

The regressed estimate for Phoenix earnings is a loss of $5.4MM. For Hamilton, we continue to use Balsillie's own estimate of $11MM (we don't regress that since it's an estimate and not an actual observation).

That means, by this method,

$163MM market value for Phoenix
$255MM market value for Hamilton

Still, about the same as in the previous analysis. The benefit to the move is around $90MM, and three-quarters of the value of the Hamilton franchise is Picasso value.

(Thanks again to Dackle for the comment that led to this post.)

Labels: , , ,

Monday, August 31, 2009

Is a Hamilton NHL team worth as little as Andrew Zimbalist thinks?

I've just finished reading Andrew Zimbalist's sworn submission to the courts with regard to the Phoenix Coyotes situation (hat tip to blogger James Mirtle for posting the entire legal document), and there's lots in it I don't agree with. It could be that I don't understand the antitrust or economics issues, which, of course, are Zimbalist's specialties. I'll list my issues and maybe someone can explain.

First, a quick summary of the situation, as I understand it.

The Phoenix Coyotes are bankrupt. Jim Balsillie wants to buy the team and move it to Hamilton, Ontario, a hockey-mad city 45 miles from Toronto. The NHL doesn't like the idea. First, it believes that it, not the courts, have the right to decide where a team plays. Second, it seems to want to protect the Toronto Maple Leafs from competition. And, third, it doesn't like Balsillie, who is being combative with the NHL rather than cooperating with it.

Zimbalist's written testimony, written at the request of the Balsillie team, argues that

(a) a franchise in Hamilton is worth only $12 million more than if the bankrupt franchise was left in Phoenix, at $175 million versus $163 million;

(b) the effect on the Toronto Maple Leafs would be minimal;

(c) the price Balsillie is offering to pay for the team, $212 million is therefore more than the team is worth, and the difference is "Picasso value," the price Balsillie is willing to pay for the consumption pleasure of owning the team;

(d) the Hamilton expansion opportunity does not "belong" to the NHL.

Maybe there's something about the economics I don't understand, but I don't see it the same way. I'll deal with (b) and (d) in a future post, but for now, let me concentrate on (a) and (c). I think the team is worth substantially more than $175 million, and I think the "Picasso value" is huge, much more than the $37 million that Zimbalist thinks it is.

First, doesn't it seem strange that a hockey team in Hamilton, so close to the best hockey market in the world, would be worth only 7 percent more than the same, bankrupt team in a non-hockey market in the desert? The way Zimbalist gets his numbers is to multiply gross revenue by 2.4. That's based on Forbes Magazine's estimates of team revenue and market value (Zimbalist doesn't justify the 2.4 figure separately).

That seems strange to me, valuing a team by its revenues rather than its profits. It would kind of make sense in comparing "normal" businesses, companies of different sizes in the same industry. Suppose you have two widget manufacturers; Acme sells $10 million worth of widgets a year, and Consolidated sells $100 million worth. You'd expect Consolidated to be worth about 10 times as much as Acme. After all, Consolidated probably has 10 times as many employees, and 10 times as many machines, and 10 times the bill for raw materials, and 10 times the shipping costs, and so on. All else being equal, Consolidated should make 10 times the profit.

But that's not the case in the NHL. With the salary cap, you could argue that team expenses are roughly the same, whether the team is in Glendale or Hamilton. Most of the expense is salaries, and those are now fixed in the range of $41 to $57 million. Forbes has the Coyotes at revenue of $68 million, meaning that if they paid $50 in player salaries, that would leave only $18 million for other expenses and profit. On the other hand, a team like Vancouver, with $107 million in revenue, has $57 million left for other expenses and profit.

For both teams, it looks like those "other expenses" are around $30 million: because Forbes has Vancouver turning a profit of $19 million, whereas Phoenix *lost* $10 million. Vancouver is a profitable enterprise, whereas Phoenix would struggle just to break even. Profits are much less proportional to revenues in hockey than they are in a "regular" business. So why use revenues as your measure?

As a verification, I ran a regression to predict team value based on revenues. The results:

Market Value = 3.2 * annual revenue - $73 million

or, rephrased,

Market Value = 3.2 * (annual revenue - $22.8 million)

The correlation coefficient was .965.

So the value of a team isn't a multiple of revenues: it's a multiple of revenues *above $22.8 million*. Suppose the Coyotes make $68 million revenue, and the Hamiltons twice that. Hamilton won't be worth twice Phoenix, then: it'll be worth two-and-a-half times. Apparently you need at least $22.8 million in revenue to make the team desirable even at $0. Only revenue after that translates into market value.

If you look at Forbes' chart, you can see that: the top 6 teams have a little less than twice the revenues of the Coyotes: and they're worth a little less than 250% as much.

Anyway, if you use this formula for the Coyotes instead of just 2.4 times revenue, you get $145 million, not $163 million. That makes sense, since the regression was based on Forbes data, which values the Coyotes at $142 million. However, Zimbalist did consider subsidies from the city of Glendale, which might make up part of the difference.

As for Zimbalist's Hamilton estimate ... well, he takes Balsillie's own estimate, which assumes revenues would be $73 million. That, to me, seems *way* too low. It would put Hamilton last among the other Canadian teams:

$160MM Leafs
$139MM Habs
$107MM Canucks
$ 96MM Senators
$ 97MM Flames
$ 85MM Oilers

I think an estimate of $100 million would be much more appropriate, given the size of the market. Based on the results of the regression, that would make the new Hamilton franchise worth $247 million -- not $175 million.


---------

Now, Zimbalist also calculates team value another way, a better way: by estimating actual future profits, and calculating their present value. He doesn't do that for Phoenix, which I think is because he can't predict the Coyotes to ever make a profit in the future (in which case, shouldn't its value by this method be zero? But I digress). However, he does it for the proposed Hamilton franchise. Here's how: he starts with Balsillie's own projections of earnings the first five years of the franchise. Then, he assumes earnings will grow steadily for the next 25 years. He then discounts all 30 years' profit into today's dollars.

Zimbalist performs this calculation for five different 25-year growth rates (from 3 to 7 percent) and for three different discount rates (from 8 to 12 percent). He winds up with a franchise value ranging from $70 million to $177 million, with a typical value of $150 million.

This is all quite reasonable, although you have to keep in mind that Balsillie is probably being very conservative in his earnings projections in order to keep his price down. Still, it doesn't seem like this is how other teams are valued, probably because of the "Picasso factor." Looking at the Forbes chart, the market value of teams is much, much flatter than their earnings. The top three teams (Leafs, Rangers, Habs) make an average of about $45 million a year, and their market value is about $400 million -- an earnings/price ratio of about 11%. But the teams in the middle, who look like they make an average of about $3 million a year, are worth about $200 million -- an earnings/price ratio of about 1.5%. And the teams at the bottom are all losing money -- but their market values are still around $160 million.

Why are the values so flat relative to profits, where a team that makes $1 million a year is worth almost half as much as a team that makes $40 million a year? It could be the Picasso effect. I ran a regression to predict market value based on earnings. The results, rounded:

Market value = $200 million + 4 times annual earnings

The correlation coefficient? 0.88. Not as high as for revenues, but still huge.

What that tells us is that, regardless of earnings, there's a value of $200 million dollars to owning a team, even if it only breaks even every year. That might be Picasso value. Or, it might partially reflect the value of the right to move the team if it starts losing money. It might reflect the fact that owners think that earnings will jump soon -- maybe they think a new TV contract will someday be worth a present value of $30 million each, and that's part of the $200 million. But I think it's consumption value, Picasso value.

$200 million does seem reasonable in terms of consumption value. At today's low interest rates of (say) 4%, the opportunity cost of locking up $200 million is only $8 million. Most of these owners are billionnaires -- what's a tiny $8 million a year? Jim Balsillie's own willingness to pay is no doubt much more than $8 million. He likes publicity. He's making much of the fact that he wants to bring the NHL to more Canadian cities, making him something of a hero in some circles. He might have some ambitions beyond NHL owner, ambitions which being in the limelight will further.

Using the regression results puts the Coyotes at $161 million, which is about where Zimbalist has them in his revenue model (he can't use the earnings model because the Coyotes have negative earnings).

So, let's say we use this same regression equation to value the Hamilton team. Balsillie claims that five years from now, the team will be making $11 million. That means it'll be worth about $244 million then. Discounting that to today's dollars, at 4%, gives $209 million today. Adding in $35 million of Picasso value ($40 million discounted) for the next five years takes us back to $244 million.

And, again, that's conservative because it uses Balsillie's own estimates of his profits. Here are the earnings of the six Canadian teams last year, according to Forbes:

$66.4 million (Leafs)
$39.6 million (Canadiens)
$19.2 million (Canucks)
$ 4.7 million (Senators)
$ 7.4 million (Flames)
$11.8 million (Oilers)

Judging by this, I'd say that, for a Hamilton franchise, $11 million five years from now is pretty conservative. Even so, the Picasso value drives so much of franchise valuation that it doesn't matter much: even if the Hamiltons made as much as the Canucks, it would only raise the franchise value from $244 million to $276 million.

So I think the team in Hamilton is worth about $250 million. Not only is this substantially higher than its worth in Phoenix, but it's even more than Balsillie has offered. So I bet Balsillie is willing to spend a whole lot more than his $212 million offer, if necessary, to achieve his dream of a team in Hamilton.

------

So, a summary of our respective market value estimates:

Zimbalist thinks:

-- Phoenix $163MM by revenues
-- Hamilton $175MM by revenues (based on $78MM in revenues)
-- Hamilton $150MM by earnings

I think:

-- Phoenix $145MM by revenues, plus government subsidies
-- Phoenix $161MM by earnings
-- Hamilton $247MM by revenues (based on $100MM in revenues)
-- Hamilton $250MM by earnings

Zimbalist thinks the difference between Phoenix and Hamilton is maybe $12 million at most. I think the difference is close to $100 million.

Am I missing something?


Labels: , , ,

Saturday, August 22, 2009

Changing my mind on "The Book" and clutch hitting

My last post talked about the clutch study in "The Book." It turns out that study was written by Andy Dolphin, who responds in a comment at "The Book" blog here, as does co-author Mitchel Lichtman (mgl). The comments are definitely worth reading.

I had two arguments, one about statistical significance, and one about walks. To summarize them (perhaps more clearly than in the original post):

-- previous studies found no evidence of clutch hitting talent.
-- Andy's study found evidence of clutch hitting (OBA) talent with variance .008.
-- The .008 is not statistically significant only at p=.14 (14% rather than the traditional 5%). It therefore constitutes fairly weak evidence.
-- Combine that weak evidence of .008 with the previous studies that found zero, and there's still a fair bit of doubt on whether clutch hitting exists.

And also:

-- if you include intentional walks, it seems obvious that the best hitters will appear to be "clutch".
-- there is such a thing as a "semi-intentional walk".
-- generally, the players who receive IBBs will be the same ones who receive "semi" IBBs.
-- so it seems like the best hitters will appear to be "clutch" just because of those semi-intentional walks.
-- but extra semi-intentional walks is not what "clutch hitting" traditionally means;
-- and so Andy's study may not answer the same question that's being asked.

To clarify: I have no objection to anything in Andy's study itself, just as to the conclusions you can draw from its results.

Anyway, I've changed my mind; I now think that we can draw firmer conclusions from Andy's study, and lean towards his result that clutch hitting exists. I stand by my original logic; but I did another simulation, and my view of the facts has changed.

Specifically: I no longer believe that the previous studies necessarily found evidence of zero clutch hitting. I thought they did, but, on further examination, I think the Tom Ruane study gives results that are perfectly consistent with what Andy found: clutch hitting variance of .008 points of OBA.

Here's what Tom did. He found 727 players who met his cutoff for plate appearances. For each player, he found the difference between each player's "clutch" BPS (batting average plus slugging), and compared it to his "non-clutch" BPS. Then, he broke the 727 differences into categories -- 0 to 15 points clutch, 45-60 points choke, and so on.

Then, he did the same thing, but, instead of using "clutch" and "non-clutch" AB, he divided the AB in each group randomly. And so, if there is no such thing as clutch talent -- if clutch hitting is, in effect, random -- the two groups should break down exactly the same.

And they pretty much did. Here, from Tom's study, are the two groups:

-J -I -H -G -F -E -D -C -B -A A B C D E F G H I J
Real 1 3 6 5 15 21 46 76 77 115 105 88 69 41 31 13 10 1 3 1
Fake 1 2 3 7 14 26 45 70 94 109 109 92 68 45 26 14 7 3 2 1


They're very, very close. It's hard to tell because of the columns not lining up, so let me leave out the middle columns and make things easier to read:

... -J I H G .... G H I J
Real 1 3 6 5 ... 10 1 3 1
Fake 1 2 3 7 .... 7 3 2 1



Taking groups G to J, which comprise players who were at least 105 points better in the clutch, we see that there were 15 in real life, and 13 random. On the choke side of -G to J, there were again 15 in real life, and 13 random.


Here's where I made my wrong assumption: I figured that if there were any real difference between the two groups, even a small one, we'd see a much larger dispersion in the "real" row. I thought we'd see a lot more extreme values -- more than a ratio of 15:13.

I was wrong. I ran a simulation, where I ran a "fake" row, then added an extra variance of .006 points (which is what Andy found for wOBA) to simulate what a "fake" row would look like if Andy's number were real. The results were indistinguishable. Indeed, I think you could add a lot more than .006 and still not be able to see any difference in the two rows. There is just so much randomness there that any difference in talent gets washed out in this kind of comparison.

Also, Tom's results are consistent with my simulation of Andy's result. My simulation found a p-value of .14. Tom's study found that the "real" data were at the 11th percentile of the distribution of "fake" data -- a p of .11. So it seems that Tom and Andy are consistent with each other. That makes sense, because some of the data they used overlapped. Also, Tom's data didn't include walks, which calls into question my argument that the walks might be causing a large proportion of the effect.

So what we have is now:

-- Andy found an effect of .008 of OBA;
-- That's completely consistent with Tom Ruane's study;
-- I think it's also consistent with other studies I've seen;
-- So maybe .008 should indeed be our best estimate of the variance of clutch talent, given all the available evidence.

I have to say, though, that I'm not still completely satisfied about the walks thing. In his reply, Andy said he checked the results without including walks, and he got approximately the same result. I guess this should satisfy me, but I'm still a bit dubious, perhaps irrationally. I wouldn't mind, as commenter Guy suggested, that we check to see if the clutch hitters also tended to be the better hitters (or the guys who IBB the most). That would help me feel better about the walk issue.

Guy also points out that Andy's .008 result doesn't actually represent the variance of talent alone -- rather, it represents all the variance other than luck. The implicit assumption in Andy's study is that it's all talent; but some of it might be other factors: park, non-random distribution of pitchers, etc. Guy suggests running the same study, but dividing the AB by day of the month instead of clutch. Assuming that day of the month is irrelevant to hitting, if we get the same .008 result, that would suggest that what Andy found was something other than clutch talent. More likely, we'd get something between .000 and .008, and we could calculate how much of the .008 is really clutch talent, and how much is other, random, things.

Both those tests would make me happier. But, until then, I guess I have to agree that the current state of the evidence is that the most reasonable estimate for the extent of clutch talent is closer to .008 than to .000.


Labels: ,

Monday, August 17, 2009

Did "The Book" really find evidence for clutch hitting?

For a long time, the most thorough sabermetric studies showed no evidence for the idea that "clutch hitting" exists -- that some players can "turn it on" more than others when the situation is particularly important. Dick Cramer's 1977 study, which compared batters' 1969 clutch performances to those in 1970, found only a very slight tendency for clutch hitters to repeat. That conclusion was criticized by Bill James in his recent "Underestimating the Fog," but better analyses have existed for many years. Pete Palmer's study in 1990 (.pdf, page 6) compared the actual distribution of players' clutch stats to what would be observed if clutchiness were completely random; it found almost an exact match. Then, in 2005, Tom Ruane did the same thing, but for a much larger population of batters, and came up with a similar result.

But three years ago, in "The Book," authors Tom Tango, Mitchel Lichtman, and Andy Dolphin used a different technique (and, I think, even more data), and came up with a different answer. They found that a tendency to clutch hitting does exist, and has a standard deviation of .008 points of OBA. That is, one out of every six batters will hit more than .008 (8 points) better in the clutch than overall; and, by symmetry, one in six players will hit 8 points *worse* in the clutch.

As far as I know, the authors never published their study in full, and their book gives only an outline of how they did it. But, still, I think I was able to figure out their method -- or at least a method that's probably close to what they did -- and I don't have the same confidence in their conclusions that their book does.

I have two disagreements with their study. First, that they used OBA instead of batting average; second, and more seriously, their result of .008 is not statistically significant is significant only at the 14% level, which is only moderate evidence against the competing view that clutch talent does not exist.

First, OBA. The difference between OBA and BA is mostly a matter of including walks. Walks are certainly important, and if you're trying to measure a player's ability or performance, on-base percentage is a much better measure than batting average. But when it comes to clutch, the traditional question is about *hitting* in the clutch, not *walking* in the clutch.

To my knowledge, ability to draw a base on balls in clutch situations has not been studied. But, unlike hitting, it wouldn't be surprising to find that some players are "better" at it than others. Take Barry Bonds, for example. In clutch situations, Bonds was more likely to be walked. (Here are his career splits.)

Of course, Bonds' walks were mostly intentional, and "The Book" omitted the IBB from its totals. But, still, if Bonds was much more likely to be walked, you'd think he'd also have been more likely to be pitched around; and so he'd draw more unintentional walks in clutch situations as well. Maybe there weren't as many "semi-intentional" bases on balls as intentional ones, but, still, a small number would be enough to account for a chunk of a standard deviation of .008.

For instance: suppose on every team the best hitter increases his OBA by about 17 points (.017) in the clutch, because of the semi-intentional walk, and the worst hitter decreases his OBA by the same 17 points. If the other 7 batters are exactly the same in clutch situations, and only these two are different, that's enough to give you an SD of almost exactly .008.

What's 17 points in practice? It's an increase of about 17 walks per 600 PA. And if a typical hitter gets 60 clutch PA a season, you're talking about 1.7 extra walks for one player on the team, and 1.7 fewer walks for a second player. That difference of 3.4 walks total is enough to give you the SD of .008 that the authors found.

That seems pretty realistic, and reasonable, doesn't it? Well, maybe not; I've artificially decided that only two players on the team are affected, which makes the variance move a lot more for 3.4 walks than it would if every player had some tendency. But, still, intuitively, it does seem like a small effect for walks could explain the whole thing.

And that means:

-- several studies have found no clutch ability in batting average;
-- "The Book" found clutch ability in on-base percentage;
-- intuitively, "clutch walking" would seem to be able to account for everything "The Book" found.

So, with that being the state of the evidence, I am inclined to believe that the evidence still suggests that clutch hitting skill doesn't exist, but "clutch walking" skill does.

----------

But even if the authors had used batting average instead of OBP, and got the same result, the result isn't statistically significant. That's not just my conclusion, but also theirs; they say, on page 102,


"... we can merely state that there is a 68% probability that [the clutch talent SD] is between 3 and 12 points."


Since a 68% probability is 1 SD each way, the authors seem to be implying a standard error of about 4.5 points. That means a 95% confidence interval is about 9 points either way -- which includes zero.

Actually, I get an even wider confidence interval using my method (which might actually be the same as theirs). Let me go through it. For those of you who don't care about the math, you can skip this smaller print.

-- Math/details start here --

The study said that it included 848 players, with an average 2450 PA in non-clutch situations, and 200 in clutch situations. So I created 848 identical players with those numbers, and gave each player exactly zero clutch ability. Every player had an OBA of .340.

From the binomial distribution, the SD of each player's OBA over the non-clutch 2450 PA is .00957. The SD of each player's OBA over the clutch 200 PA is .0335. The SD of the difference between the two is the square root of the sums of the squares, which is .03484. That's 34.84 points of OBA.

That's the variance only due to randomness, or luck. If there truly is variance in players' *talent* for clutch hitting, the observed variance would be higher. How much higher? Well, if you assume that talent and luck are independent, then, as the authors often point out on their blog,

Variance (observed) = variance (talent) + variance (luck)

Since the authors concluded a talent variance of 8 points squared, we can assume that

Variance (observed) = 8 points squared + 34.84 points squared

Which means that

Variance (observed) = 35.75 points squared

Since the SD is the square root of the variance, we get

SD(observed) = 35.75 points

So, presumably, in their population of 848 players, the authors observed the SD of the clutch difference was 35.75 points.

Now, if there really was no such thing as clutch ability, how often would we observe an SD of more than 35.75 points due to luck alone, when the expected number is only 34.84? To check, I ran a simulation, and the answer was: about 14% of the time.


That's obviously not significant, 14%.

Another way to check: the SD of the simulated variance was about .88 of a point. The difference between 35.75 and 34.84 is about .91 of a point. So the observed difference was almost exactly 1 SD from zero. Again, that's not significant.

If we look for a 68% confidence interval like the authors had, 1 SD on each side, we get (34.87, 36.63). That means a 68% confidence interval for clutch talent is 0.1 to 11.3 points. That's different than what the authors gave -- 3 to 12 points -- but I'm not sure why.

Either way, the observed effect is certainly not statistically significant.


-- math/details end here --

To restate my conclusions for those who skipped the math:

The effect "The Book" found is about 1 SD from zero, which is certainly not statistically significant. It's at the 14% level, not the required traditional 5%. This doesn't mean it can be ignored, but that it constitutes fairly weak evidence.


-------

So, to sum up:

-- two previous studies found no evidence of clutch talent in batting average;

-- Tango/mgl/Dolphin found a small measure of clutch talent, but it wasn't statistically significant.

From that alone, I'd say our conclusion still has to be: not evidence to assume clutch talent. But if you add:

-- Tango/mgl/Dolphin's non-significant result included clutch walks, which common sense strongly suggests *do* vary by player,

Then, to me, that removes most of the last bit of doubt. I think that even if the effect they found is real, there's a really good chance it's caused by walks.

Hey, guys, how about running the study again using batting average?


(UPDATE: some statements on statistical significance replaced by something more accurate.)


Labels: ,

Wednesday, August 12, 2009

Testing Tim Donaghy's allegations of NBA playoff manipulation

Here's another academic study trying to find bias in the NBA. From the latest issue of JQAS, it's called "Testing For Bias and Manipulation in the National Basketball Association Playoffs," by Timothy Zimmer and Todd H. Kuethe.

Last year, Tim Donaghy, the NBA referee convicted of betting on games, suggested that there is a conspiracy between the referees and the NBA. The league, Donaghy alleged, wants large-market teams to advance in the playoffs, and it wants series to go the maximum number of games to maximize excitement and TV revenue. He accused certain referees, "company men," of calling the critical games differently to try to achieve the league's desired result.

In this study, Zimmer and Kuethe attempt to look at the evidence for Donaghy's charge. Is there a "big city" bias in playoff series? And do the underdog teams have an increased chance of winning games when they are behind in the series?

The authors ran a regression, trying to predict margin of victory based on (a) what game in the series it was, (b) the difference in conference seed position between the two teams (so that the #2 team playing the #7 team would be a 5-seed difference), (c) which team was at home, and (d) a couple of other factors that proved unimportant.

They ran the regression only on the first three rounds of the playoffs; they ignored the finals, due to concerns that "seeding" didn't really make sense when the teams are from different conferences. The regression covered 2003 to 2008; I have no idea why they chose to use only six seasons, when there's so much more data available and they could have got much more reliable results. (Gratuitous link to basketball-reference.com.)

Anyway, the results were that the stronger team's margin of victory is roughly:

-4.94 points
plus 1.55 points * the seed difference
plus 10.19 points if they're at home
plus 0.02 points for every extra 100,000 population
minus 2.67 points if it's game 1
minus 1.48 points if it's game 2
minus 4.68 points if it's game 3
minus 0.00 points if it's game 4
minus 4.04 points if it's game 5
minus 0.25 points if it's game 6
minus 1.59 points if it's game 7


The seed and the home field advantage were significant, as you would expect. But so was the population difference (about 2 SDs), and the Game 3 difference (2.6 SDs).

The authors conclude that there is some evidence for Donaghy's claims; the large-market teams have an advantage, and the significantly increased performance of the underdog in Game 3 shows something funny is going on there.

I don't think that's right, in either case.

First, for population: Zimmer and Kuethe figured the difference in team quality only by standings position; so the #1 team facing the #8 team was scored as 7, no matter how good those teams actually were. But isn't it possible that when a #1 team is a big-market team, they're a better team than a typical #1 team from a smaller market? It seems likely to me. The authors acknowledge that big cities might have better teams than small cities, but they argue that

"If large-market teams attract better players, either through pay or lifestyle, the regular-season winning percentage will reflect this disparity in pay."


Yes, but not all of it the authors don't use winning percentage -- they use standings position. When the highest-paid resident of your street is a CEO, he's probably going to make more money than if the highest-paid resident of your street is a just a professor. Looking at the ranking captures a lot of the information of salary, or team quality, but not all of it. I'd bet that what's being measured by the regression is just that leftover, that #1 teams from large-markets are just better than #1 teams from small markets.

I haven't done any work to prove that. But still, I wonder why the authors chose standings rank instead of season wins, when wins is just as easy to collect and would likely be more accurate.

Also, I'm not sure if it's reasonable to calculate the standard error the way the authors did, as if every observation is independent. Suppose one large-market team has an unlucky season, finishing (say) fifth in its conference when its talent was really good enough for second. If that team goes tearing through three playoff rounds, winning all three as the apparent underdog, those results are certainly not independent. And so the SE of the "population" coefficient is understated; since it was only 2 standard deviations from zero anyway, it's very likely that it's no longer significant once you adjust for the fact that teams with "inaccurate" seedings would be more likely to appear in subsequent rounds.

So, I don't think the population results mean much. What about the Game 3 result of significance?

First, the result of a 4.68 point differential isn't relative to every other game; it's only relative to Game 4. That is, the underdog performed 4.68 points better in Game 3 than in Game 4. Is that consistent with referee bias? I suppose it could be; when the favorite goes 2-0, the referees try to have them lose Game 3, for a longer series. But why not have them lose Game 4 instead? If the idea is to prolong the series without affecting who wins, going from 3-0 to 3-1 is much safer than from 2-0 to 2-1.

But you can't predict the NBA's methods of cheating, I suppose, so let's assume that they do shoot for a Game 3 underdog win. Still, the favorite is going to win some of those games anyway, and go 3-0. Wouldn't the NBA want to see the underdog win Game 4, then? You'd think so: but Game 4 actually shows the *best* performance by the favorite; every other game coefficient is negative, meaning the favorite loses points in those games relative to the fourth game. So why would that be? Why would the underdog do best in Game 3, but worst in Game 4, if the NBA is trying to orchestrate a longer series? That doesn't make a lot of sense to me.

Second: If you take the results as presented, Game 3 is the only one of the six games that shows statistically significant results. But it's only 2.6 SDs away. That's significant at almost exactly the 1% level (assuming a two-tailed test, 2.6 SDs on either end of the curve). The chance that at least one of six variables would show that kind of 1% significance is ... about 6%. So, really, unless you had good reason to suspect Game 3 in the first place, the result isn't really significant enough by the typical 5% standard for these sorts of things.

It's even less significant when you look a little deeper. Game 3 is only significant when compared to Game 4: and Game 4 just happens to be the most extreme observation in the other direction! So you're looking at the difference of the two extremes out of seven.

The chance that the *most positive* of seven normal variables will be more than 2 SDs (of itself) away from the *least positive* of seven normal variables is pretty high. There are actually 21 pairs of the seven variables; if every pair has a 1% chance of showing a result, then, even though the 21 pairs aren't independent, on average you'll find 0.21 apparently significant results. That is: if you run the experiment 100 times, with different sets of data, you'll find 21 significant results. It's therefore not all that surprising -- and certainly not statistically significant at any reasonable level -- when this study finds exactly 1.

It *looks* significant, sure, but that's because the authors of the study happened, luckily, to have randomly chosen Game 4 as their reference point. Had they chosen, say, Game 1, they would have found no significant-looking effect at all.

So, in summary:

-- the population effect is probably (in my judgment) due to the study not adjusting for the fact that big-market teams are better than small-market teams;

-- even if that turns out not to be the case, the effect found is probably not significant anyway, due to underestimation of the SE;

-- the Game 3 effect is significant ONLY when compared to Game 4;

-- there are 21 possible significant game vs. game effects, so the fact that exactly one of those 21 was found to be 2.6 SDs away from zero is not a very low-probability event;

-- the observation that Game 3 most favors the underdog and Game 4 most favors the favorite does not, on its face, appear to be very consistent with Donaghy's conspiracy theory.

So I don't think there's much here at all. Of course, the authors only analyzed six seasons, so there's lots more data if someone wants to investigate further.


UPDATE: had the numbers wrong in the table above. Now fixed.





Labels: , ,

Thursday, August 06, 2009

Evidence on whether teams "own" other teams

Here's some new evidence on momentum, as well as on whether teams "own" certain other teams.

For instance: as of right now, the Yankees are 0-8 against the Red Sox this season. Does that mean they should be expected to continue losing to Boston, at least more than you'd expect from the teams' relative talent levels?

Apparently not. "Numbers Guy" Carl Bialik crunched the numbers (disclosure: I produced the raw data for him from Retrosheet game logs), and found that, when one team starts out the season 8-0 against the other:

-- the team with the 8 wins went about .530 that season against other teams.

-- the team with the 0 wins went about .450 that season against other teams.

-- in all remaining games that season between the two teams (of which there were 545 total), the "8" team went about .600 against the "0" team.


What does that mean? Well, the .530 team probably played about .560 for the season when you include the missing games (the eight consecutive games it won, plus the additional games against that team where it went .600). The .450 team probably went around .420.

Regressing to the mean a bit, the .560 team is probably truly .545 or something. The .420 team is probably around .440.

How often will a .545 team beat a .440 team? I'm guessing about 61% of the time -- pretty close to the 60% observed.

Again, my calculations are only as good as my estimates, and don't take all factors into account (for instance: you'd expect the 8-0 team to have had an above-average number of home games out of those eight, since they wound up winning them all. That means you'd expect more road games in the remaining head-to-head matchups, which should reduce the .610 estimate a bit). Still, I'm confident it's all close enough that if you studied the issue in more detail, the results wouldn't be much different.

Conclusion: no evidence of streakiness, momentum, or "owning" another team.


Labels: ,

Sunday, August 02, 2009

Pitchers targeting 20 wins -- followup and slides

Last year, I ran a study on why there are more pitchers who win 20 games in a season than 19. I updated that study slightly for my presentation at last week's SABR convention, and the Powerpoint slides (.ppt) are now available on my website, or by direct click here.


Labels: , , ,