Andrew Gelman:
Again, I’m not saying that Heckman and his colleagues are doing this. I can only assume they’re reporting what, to them, are their best estimates. Unfortunately these methods are biased. But a lot of people with classical statistics and econometrics training don’t realize this: they thing regression coefficients are unbiased estimates, but nobody ever told them that the biases can be huge when there is selection for statistical significance.
And, remember, selection for statistical significance is not just about the “file drawer” and it’s not just about “p-hacking.” It’s about researcher degrees of freedom and forking paths that researchers themselves don’t always realize until they try to replicate their own studies. I don’t think Heckman and his colleagues have dozens of unpublished papers hiding in their file drawers, and I don’t think they’re running their data through dozens of specifications until they find statistical significance. So it’s not the file drawer and it’s not p-hacking as is often understood. But these researchers do have nearly unlimited degrees of freedom in their data coding and analysis, they do interpret “non-significant” differences as null and “significant” differences at face value, they have forking paths all over the place, and their estimates of magnitudes of effects are biased in the positive direction. It’s kinda funny but also kinda sad, that there’s so much concern for rigor in the design of these studies and in the statistical estimators used in the analysis, but lots of messiness in between, lots of motivation on the part of the researchers to find success after success after success, and lots of motivation for scholarly journals and the news media to publicize the results uncritically. These motivations are not universal—there’s clearly a role in the ecosystem for critics within academia, the news media, and in the policy community—but I think there are enough incentives for success within Heckman’s world to keep him and his colleagues from seeing what’s going wrong.
Again, it’s not easy—it took the field of social psychology about a decade to get a handle on the problem, and some are still struggling. So I’m not slamming Heckman and his colleagues. I think they can and will do better. It’s just interesting, when considering the mistakes that accomplished people make, to ask, How did this happen?
via andrewgelman.com
Thinking about this in my small world, searching for that one willingness to pay (WTP) estimate that delivers statistical significance that favors your hypothesis is the equivalent of p-hacking. The state-of-the-art, NOAA-Panel-endorsed referendum vote question can be quite sensitive to the nonparametric or parametric estimator chosen to summarize the votes into WTP. This is especially true when the data are "difficult" (Haab and McConnell). When a WTP estimate varies by factors of 3 or 4 depending on whether a nonparametric or parametric WTP estimate is employed then researchers should interpret significance tests conducted with these data very carefully (if not reject them outright). I've become very worried that there is "lots of motivation for scholarly journals ... to publicize the results uncritically." Researchers who have "difficult" CVM data need to present the full range of WTP estimates.