Sabermetric Research: Evaluating scientific debates: some ramblings

Tuesday, January 19, 2010

Evaluating scientific debates: some ramblings

Last week's renewed debate on JC Bradbury's aging study (JC posted a new article to Baseball Prospectus, and comments followed there and on "The Book" blog) got me thinking about some things that are tangential to the study itself ... and since I have nothing else to write about at the moment, I thought I'd dump some of those random thoughts here.

1. Peer review works much better after publication than before.

When there's a debate between academics and non-academics, some observers argue that the academics are more likely to be correct, because their work was peer reviewed, while the critics' work was not.

I think it's the other way around. I think post-publication reaction, even informally on the internet, is a much better way to evaluate the paper than academic peer review.

Why? Because academic peer reviewers hear only one side of the question -- the author's. At best, they might have access to the comments of a couple of other referees. That's not enough.

After publication, on the internet, there's a back and forth between people on one side of the question and people on the other. That's the best way to get at the truth -- to have a debate about it.

Peer review is like the police deciding there's enough evidence to lay charges. Post-publication debate is like two lawyers arguing the case before a jury. It's when all the evidence is heard, not just the evidence on one side.

More importantly, no peer reviewer has as good a mastery of previous work on a subject than the collective mastery of the public. I may be an OK peer reviewer, but you know who's a better peer reviewer? The combination of me, and Tango, and MGL, and Pizza Cutter, and tens of other informed sabermetricians, some of whom I might only meet through the informal peer review process of blog commenting.

If you took twelve random sabermetricians whom I respect, and they unanimously came to the verdict that paper X is flawed, I would be at least 99% sure they were right and the peer reviewer was wrong.

2. The scientific consensus matters if you're not a scientist.

It's a principle of the scientific method that only evidence and argument count -- the identity of the arguer is irrelevant.

Indeed, there's a fallacy called "argument from authority," where someone argues that a particular view must be correct because the person espousing it is an expert on the subject. That's wrong because even experts can be wrong, and even the expertest expert has to bow to logic and evidence.

But that's a formal principle that applies to situations where you're trying to judge an argument on its merits. Not all of us are in a position to be able to do that all the time, and it's a reasonable shortcut in everyday life to base your decision on the expertise of the arguer.

If my doctor tells me I have disease X, and the guy who cleans my office tells me he saw my file and he thinks I really have disease Y ... well, it's perfectly legitimate for me to dismiss what the office cleaner says, and trust my doctor.

It only becomes "argument from authority" where I assert that I am going to judge the arguments on their merits. Then, and only then, am I required to look seriously at the office cleaner's argument, without being prejudiced by the fact that he has zero medical training.

Indeed, we make decisions based on authority all the time. We have to. There are many claims that are widely accepted, but still have a following of people who believe the opposite. There are people who believe the government is covering up UFO visits. There are people who believe the world is flat. There are people who believe 9/11 was an inside job.

If you're like me, you don't believe 9/11 was an inside job. And, again, if you're like me, you can't actually refute the arguments of those who do believe it. Still, your disbelief is rational, and based solely on what other people have said and written, and your evaluations of their credibility.

Disbelieving solely because of experts is NOT the result of a fallacy. The fallacy only happens when you try to use the experts as evidence. Experts are a substitute for evidence.

You get your choice: experts or evidence. If you choose evidence, you can't cite the experts. If you choose experts, you can't claim to be impartially evaluating the evidence, at least that part of the evidence on which you're deferring to the experts.

The experts are your agents -- if you look to them, it's because you are trusting them to evaluate the evidence in your stead. You're saying, "you know, your UFO arguments are extraordinary and weird. They might be absolutely correct, because you might have extraordinary evidence that refutes everyone else. But I don't have the time or inclination to bother weighing the evidence. So I'm going to just defer to the scientists who *have* looked at the evidence and decided you're wrong. Work on convincing them, and maybe I'll follow."

The reason I bring this up is that, over at BPro, MGL made this comment:

"I think that this is JC against the world on this one. There is no one in his corner that I am aware of, at least that actually does any serious baseball work. And there are plenty of brilliant minds who thoroughly understand this issue who have spoken their piece. Either JC is a cockeyed genius and we (Colin, Brian, Tango, me, et. al.) are all idiots, or..."

Is that comment relevant, or is it a fallacious argument from authority? It depends. If you're planning on reading all the studies and comments, and reaching a conclusion based on that, then you should totally ignore it -- whether an argument is correct doesn't depend on how many people think it is.

But if you're just reading casually and trying to get an intuitive grip on who's right, then it's perfectly legitimate.

And that's how MGL meant it. What he's saying is something like: "I've explained why I think JC is wrong and I'm right. But if you don't want to wade through all that, and if you're basing your unscientific decision on which side seems more credible -- which happens 99% of the time that we read opposing opinions on a question of scientific fact -- be aware that the weight of expert opinion is on my side."

Put that way, it's not an appeal to authority. It's a true statement about the scientific consensus.

3. Simple methods are often more trustworthy than complex ones.

There are lots of studies out there that have found that the peak age for hitters in MLB is about 27. There is one study, JC Bradbury's, that shows a peak of 29.

But it seems to me that there is a perception, in some quarters, that because JC's study is more mathematically sophisticated than the others, it's therefore more trustworthy. I think the opposite: that the complicated methods JC used make his results *less* believable, not more.

I've written before about simpler methods, in the context of regression and linear weights. Basically, there are two different methods that have been used to calculate the coefficients for the linear weights formula. One involves doing a regression. Another involves looking at play-by-play data and doing simple arithmetic. The simple method actually works better.

More importantly, for the argument I'm making here, the simple method is easily comprehensible, even without stats classes. It can be explained in a few sentences to any baseball fan of reasonable intelligence. And if you're going to say you know a specific fact, like that a single is worth about .46 runs, it's always nicer to know *why* than to have to trust someone else, who used a mathematical technique you don't completely understand.

Another advantage of the simple technique is that, because so many more people understand it, its pros and cons are discovered early. A complex method can have problems that don't get found out until much later, if ever.

For instance, how much do hitters lose in batting skill between age 28 and age 35? Well, one way to find out is to average the performance of 28-year-olds, and compare it to the averaged performance of 29-year-olds, 30-year-olds, and so on, up to 35-year-olds. Pretty simple method, right, and easy to understand? If you do it, you'll find there's not much difference among the ages. You might conclude that players don't lose much between 28 and 35.

But there's an obvious flaw: the two groups don't comprise the same players. Only above-average hitters stay in the league at 35, so you're comparing good players at 35 to all players at 28. That's why they look similar: the average of a young Joe Morgan and a young Roy Howell looks similar to the average of an old Joe Morgan and a retired, zero-at-bat Roy Howell, even though they both Morgan and Howell each declined substantially in the intervening seven years.

Now that flaw ... it's easy to spot, and the reason it's easy to spot is that the method is simple enough to understand. It's also easy to explain, and the reason it's easy to explain is again that the method is simple enough to understand.

If I use the more complicated method of linear regression (and a not very complicated regression), and describe it mathematically, it looks something like this:

"I ran an ordinary least squares regression, using the model P(it) = ax(it) + b[x(it)^2] + e, where P(it) is the performance of player i at age t, x(it) is the age of player i at age t, and all player-seasons of less than 300 PA were omitted. The e is an error term, assumed iid normal with mean 0."

The flaw is actually the same as in the original, simpler case, the fact that the sample of players is different at each age. But it's harder to see the flaw way, isn't it? It's also harder to describe where the flaw resides -- there's no easy one-sentence explanation about Morgan and Howell like there was before.

So why would you trust the complicated method more than the simple one?

Now, I'm not saying that complexity is necessarily bad. A complex method might be more precise, and give you better results, assuming that there aren't any flaws. But, you still have to check for flaws. If the complex method gives you substantially different results (peak age 29) from the simple methods (peak age 27), that's a warning sign. And so you have to explain the difference. Something must be wrong, either with the complex method, or with all the simple methods. It's not enough to just explain why the complex method is right. You also have to explain why the simple methods, which came up with 27, came out so wrong.

In the absence of a convincing explanation, all you have are different methods, and no indication which is more reliable. In that case, why would you choose to trust the complicated method that you don't understand, but reject a simple methods that you *do* understand? The only reason for doing so is that you have more faith that whoever introduced the complicated method actually got everything right, the method and the calculations and the logic.

I don't think that's justified. My experience leads me to think that it's very, very risky to give that kind of blind trust without understanding the method pretty darn well.

Labels: aging, peer review

14 Comments:

At Tuesday, January 19, 2010 11:30:00 AM, Hawerchuk said...: Phil,

Very good commentary.

One thing I'd add about the peer review process. Having reviewed journal articles, even at best, I am just making a cursory assessment of a paper. I am not going to have time to repeat the authors' experiment (usually they don't submit their dataset - and if I can repeat it in a short period of time, then the work isn't original enough to justify publication). So I have to make a lot of leaps of faith in reading their paper.

But put the paper out there for thousands of people with different competencies to analyze, and you will get a much better review. It's like playing Jeopardy by yourself or with 100 people on your team!

It would make for better papers, but it exposes flaws in peoples' work and delays publication - even though the peer reviewers don't care. That's understandably why some people would avoid this process.
At Tuesday, January 19, 2010 8:03:00 PM, Nick Steiner said...: You get your choice: experts or evidence. If you choose evidence, you can't cite the experts. If you choose experts, you can't claim to be impartially evaluating the evidence, at least that part of the evidence on which you're deferring to the experts.

I somewhat disagree with this. In a Bayesian sense, having an expert agree with your opinion makes it more likely to be correct than if the expert disagreed with it.

The statement would only be true if you were 100% sure of your opinion, and I don't think anyone ever is.

But overall, I agree with the post. Good job.
At Tuesday, January 19, 2010 9:35:00 PM, Phil Birnbaum said...: Nick,

I agree. If you're trying to use Bayesian methods to decide how likely you are to be correct, then the opinions of experts do indeed matter.

But that's not usually the case. Usually, we pretend we can be 100% sure. If my math exam asks me to demonstrate that there's no highest prime, I have to give a proof. I can't substitute "I don't know, but I know Euclid proved it, and he's an expert."

And if I do give the proof, it's not typical for someone to say, "yeah, it looks good, but the Bayesian probability that it's correct is only 99.999%, because we might both be wrong in thinking the proof is valid. Let's ask another professor, to get it to 99.9999%."

You probably understand what I'm saying here ... there's probably a better way of expressing what I'm trying to get across. I think I'm saying that this just isn't a case where Bayesian methods can be used, because, by convention, we're assuming we can get close enough to 100% that it doesn't matter. I think.
At Tuesday, January 19, 2010 11:24:00 PM, Anonymous said...: Very good post (as usual).

I agree with Nick (and we probably don't really disagree with you, Phil). In practice, evaluating a thesis usually involves a combination of "appealing to" authority and an independent analysis of evidence. It is usually not a strict dichotomy ("either/or" as you wrote) and I am not sure I would have characterized it as you have.

Also, the thing about peer review is that with the advent of the internet (and other means of rapid and widespread communication and sharing of information), it is a whole new ballgame. It is understandable that academia is lagging behind. At one time, traditional peer review was the best and only way to ensure the integrity of scientific research. Not so any more, but nothing formal has really taken it's place - yet.

In addition, we are talking about something fairly unique (sports economists writing papers about sports analytics). Traditionally, for example, a medical scientist writes a paper about the efficacy of a certain drug and it is peer reviewed by other medical scientists. Whether that takes place in a traditional manner or through the internet and whether it is done before or after publication doesn't make much difference. The important thing is that similar experts are able to review the material.

Here (sports economists) we are talking about crossovers in disciplines. Having other economists reviewing sports analysis by economists is clearly leaving out a piece of the pie. Most economists are not experts in sports analysis. It just so happens that those who are, are not yet entrenched in academics. If they were (say, for example, if there were a department of sabermetrics at some or most Universities), then these papers by the economists could be reviewed by other economists and sabermetricians in an academic setting and the problem (of incomplete or imperfect peer review) would be partially solved. But because there are no bona fide sabermetrians in academia, we are left with only one venue to provide proper review of a sports analysis by an economist - and that is the internet. Again, this a relatively rare phenomenon although it may become more commonplace.

MGL
At Tuesday, January 19, 2010 11:39:00 PM, Phil Birnbaum said...: I'll try to figure out a way to better explain what I mean about the "either/or".

Agreed that we have a bigger peer review problem in sabermetrics because of the factors you mention. But I think the same factors hold outside our field. There are papers challenged all the time, and, even when the paper is not challenged, there are debates over what the results mean and what weight should be given to them.

In those cases, it remains true that the consensus after public debate is more accurate than whether or not the paper passed peer review.

Remember the controversial paper a few years ago that showed a statistically significant difference in prognosis for patients who were prayed over, versus patients who were not? That's the perfect example of the debate after the fact doing more for the cause of science than the paper that sparked it, no?
At Tuesday, January 19, 2010 11:53:00 PM, Phil Birnbaum said...: Okay, how's this?

Good:

A: That paper is incorrect.
B: I don't believe it is, because it was peer reviewed.
A: OK, I'll take up my beef with the peer reviewers if you don't want to discuss the paper itself.

Good:

A: That paper is incorrect.
B: What's wrong with it?
A: It does such and such.
B: But such and such is OK!
A: No, you can't do such and such under these circumstances ...
B: Yes, you can, because ...
(and so forth)

Bad:

A: That paper is incorrect.
B: What's wrong with it?
A: It does such and such.
B: But such and such is OK!
A: No, you can't do such and such under these circumstances. Look, if you do that, you come up with 1+1=3.
B: Er ... the peer reviewers said it was OK, so I'm just going to say you're wrong.

That's what I mean by "either/or". You can't pretend that you're arguing on the facts, then defer to authority when the evidence goes against you.

This Bayesian version might be OK:

A: That paper is incorrect.
B: What's wrong with it?
A: It does such and such.
B: But such and such is OK!
A: No, you can't do such and such under these circumstances. Look, if you do that, you come up with 1+1=3.
B: Er ... hmmm, you're right. Before you said that, my subjective probability of being right was .9. But now it's .2. But wait! I remind you that the peer reviewers thought it was OK! And they're more expert than you. So I modify my probability back up to .7.

But in this case, you're not trying to understand the paper: you're trying to estimate the probability that it's correct. Maybe that's the difference I mean.

It's like in court: "Bob Smith thinks he's guilty" is not permitted as evidence. The jury has to decide based on the "base" evidence only, not on the "meta" evidence of what other people think about the "base evidence."

Anyway, I've probably stopped making sense at this point, if not before. :)
At Wednesday, January 20, 2010 12:18:00 AM, Nick said...: I don't think your Bayesian example is what I (and possibly MGL) mean. It's more like:

A) I think my aging method is better than MGL's because it doesn't have a bias towards players who got unlucky one year and weren't allowed to play the next.

B) I think that my aging method is better because it controls for that bias by creating a projection for such players, even if the numbers in the projection aren't "real". You're study also is biased towards players who peaked late and biased against those who peaked early, which will artificially raise the peak age.

A) I think my bias is less than your bias.

B) I disagree. Furthermore, almost every serious baseball analyst agrees with my POV and disagrees with your's, so mine is more likely to be right.

In A and B are both equally confident in their opinions, which I assume was the case (although, we can never know because nobody is going to reveal weakness when arguing their case!), than the fact that most other analysts (who carry a high weight because they are "experts") agree with B's point of view, puts it over the top.
At Wednesday, January 20, 2010 12:30:00 AM, Nick said...: This would be easier to explain if JC and MGL were completely up front with eachother on how confident their beliefs were. (I'm not saying that anyone was being disingenuous, it's just that nobody is going to say that, "I feel there is a 75% chance I'm right").

If, say, MGL estimated his odds of being correct, based on the merits of the argument, at 40% and JC at 60%, and JC agreed, that would be our prior.

If then, after a conducting a survey of all sabermetricians on their opinions on the matter, and weighting them by their reputation or what have you, they feel that there is a 70% chance that MGL is right.

You would do the proper Bayesian calculations, for which I have no idea how to do, and get a final probability for MGL and JC.
At Wednesday, January 20, 2010 12:32:00 AM, Phil Birnbaum said...: Nick (the A/B post),

OK, I see what you're getting at. It makes sense. You're right, your Bayesian example does allow for both types of argument.

Let me try to figure out what it is I'm trying to say that's different.
At Wednesday, January 20, 2010 12:37:00 AM, Phil Birnbaum said...: The probability calculation is fine for someone who only cares about who's right. But we want to care about *why* one argument is right and why the other is wrong, no?

If a thousand randomly picked experts, including you and MGL and Tango and everyone, unanimously told me I was wrong and JC was right, I might accept that the probability was .9999. But that would be unsatisfying unless I knew why I was wrong.

I'd use the Bayesian method for that dispute between Stephen Hawking and whoever it was ... because I have no other way of figuring out who was right except by meta-evidence. But if it's a dispute about something I can do, like sabermetrics or a simple five-digit prime factorization, I'd rather figure it out myself.
At Wednesday, January 20, 2010 12:39:00 AM, Nick said...: Ok, I just realized my AB post was exactly the same as yours. I thought you were saying something different for some reason, my mistake.
At Wednesday, January 20, 2010 12:47:00 AM, Nick said...: Ok, I see what you mean. So in the case with JC, you think there is a very high probability that selective sampling is biasing his study. So high, in fact, that there is no real point in doing a Bayesian type calculation (IE, the prior is so strong that you'd need Tango, MGL, Nate Silver and David Gassko to convince you that you were wrong).

Is that what you mean?
At Wednesday, January 20, 2010 12:53:00 AM, Phil Birnbaum said...: Hmmm ... no, that's not really what I mean. What I mean is that even if you did convince me, somehow, that I should re-estimate my probability of being correct down to (say) .01, I still wouldn't be happy.

That's because I would be basing my probability ONLY on the meta-evidence of what you guys had to say, and not on my own, first-hand knowledge of the actual problem.

I would know I was wrong, but not why. My logic would still look good to me.

Are you a programmer? Have you ever stared at code that you swear should work, but doesn't, and you just can't find the flaw? You KNOW the probability that you're right is zero (or close to zero -- there could be a bug in the compiler), but that doesn't fix your frustration, until you finally find the bug you were so sure wasn't there, but was.

It's like that. You and Tango and everyone would be like the compiler. You'd be telling me my logic is wrong, but I still wouldn't know how to fix it.
At Wednesday, January 20, 2010 1:09:00 AM, Nick said...: That makes sense Phil. The programmer analogy is perfect! I'm not a programmer, but I use a lot of SQL for Pitch f/x work. I was writing a query the other day and it wasn't working, even though it looked perfect. It turns out it was a ( in the wrong place or something.

So yeah, I agree with that. There's no point in knowing that you're wrong without knowing why.

Sabermetric Research

Tuesday, January 19, 2010

Evaluating scientific debates: some ramblings

14 Comments:

About Me

Previous Posts