What you are about to read is a look inside the sausage factory that is academic publishing. Unfortunately, this is going to be long and technical, but John and I feel that airing our side in public might be useful in the hopes of putting a debate to rest.
First a little background
In 1997, I was two years into a career as an assistant professor in the Department of Economics at East Carolina University. Most of my research at the time focused on the valuation of environmental goods and services and I had a particular interest in a fairly young survey based method known as Contingent Valuation (CV): CV involves asking survey respondents whether they would hypothetically be willing to pay a certain price for a change in an environmental amenity. I was fortunate enough to have two colleagues with equal interest in CV: Ju-Chin Huang and John Whitehead--yeah, that John Whitehead. Around since the 1970's, CV gained popularity in the early 1990's when Exxon was kind enough to run one of its oil tankers aground in Alaska, creating the need to develop readily applicable techniques for valuing damages to pristine environments. After the Exxon Valdez crash, the debate over the use of hypothetical value elicitation questions raged.
A hypothetical problem
In the June, 1997 issue of the Journal of Political Economy (one of the big three (or four) economics journals) Ron Cummings, Steven Elliot, Glenn Harrison and James Murphy (CEHM) published an article titled: Are Hypothetical Referenda Incentive Compatible? In laymens' terms, the article attempts to answer the question: do people respond differently to hypothetical yes/no referendum style value elicitation questions--questions of the form, "Would you vote for a program that cost you $10? If more than half the people in the room vote yes, then everyone has to pay"--than they do when real money is at stake. Are the incentives in the question compatible with truth revelation?
To test for differences between hypothetical and real questions, CEHM perform an experiment wherein half of their test groups are asked the question but are told the question is hypothetical, and the other half are asked to put their money where their mouth is (actually pay).
CEHM find, somewhat unsurprisingly in hindsight, that people tend to overestimate their willingness to pay (say yes more often) in a hypothetical setting than they do in a real setting. As they put it:
The experiments we report represent one of possibly many ways by which the hypothesis that hypothetical referenda are incentive compatible might be empirically tested. Results from our experiments, which involve a relatively simple good that is valued in a simple and controlled setting, suggest that this hypothesis be rejected.
Hypothetical bias was well-known, but this study presented a damning blow to contingent valuation, mainly because the CEHM test was simple, clean and published in a highly respected general economics journal. Unfortunately, CEHM's experiments had a potentially fatal flaw, and we were about to point it out.
The fatal flaw
At the time, we knew of two related, but possibly competing, hypotheses floating around as to why hypothetical responses might differ from real responses (many more have emerged since):
- the average value of respondents' willingness to pay is fundamentally different between real and hypothetical questions, and
- the variability of respondents' willingness to pay is different between real and hypothetical questions (that is, people pay less attention to hypothetical questions).
In simplest form, the debate boiled down to whether the mean and variances of the distribution of willingness to pay are the same between the two types of questions.
CEHM seem to argue in favor of the first explanation for the differences in their results: Something shifts the mean of the distribution of willingness to pay between real and hypothetical questions. This is a particularly troubling result, because accurate measurement of values for environmental goods and services requires the hypothetical questions to at least tell us where the average willingness to pay lies. If the format of the question itself shifts the distribution, then any results from hypothetical techniques, like CV, are questionable at best. But, the single price design that CEHM employ in their experiments--everyone was offered a price of $10--doesn't allow us to tell the difference between differences in the mean of the distribution and differences in the variance. In order to test for differences, CEHM had to unintentionally assume that the variances were the same. But, restricting the variances to be the same can significantly affect the estimates of the means of each distribution. The differences in variances might be confused with differences in means.
Time for some hubris
John, Ju-Chin and I noticed the problem with CEHM's design and decided it was important enough to point out in writing. We got a hold of the original CEHM data from their website to see if we could test the differences in variances. We borrowed a complicated and convoluted technique from the marketing literature for testing variance differences between distributions. We applied the technique and found that there was significantly more variability in the hypothetical responses than the real responses and as a result, the CEHM results could be the result of differences in variances rather than differences in means. The JPE published our findings* in a comment in their February, 1999 edition.
Despite the significant amount of work that went into learning how to test for the variance differences, and the significant portion of the comment dedicated to the tests, the main point of our comment was supposed to be much simpler. As we state in our conclusions:
With referendum data it is possible to identify the scale (variance) parameters between real and hypothetical referenda. The typical dichotomous choice approach to the contingent valuation of public goods varies the $t bids that are offered to respondents, and the percentage of yes responses falls as $t increases...As few as two different bids will suffice to identify the relevant scale parameters.
In other words, the CEHM experiments weren't designed in such a way that we can test between competing hypotheses. It should be noted that at no point in our comment do we claim that hypothetical referenda are incentive compatible. We do claim that the CEHM experiments, when properly adjusted for scale differences, don't really tell us much about the differences between hypothetical and real responses, but our primary intent was to point out that:
Further experiments that examine the incentive compatibility of the hypothetical referendum should employ multiple bid values so the scale parameters can be uniquely estimated.
Apparently our intent wasn't clear enough. To date, our innocent JPE comment has been cited 28 times (not a huge number, but not peanuts either). While most of the citations are in the context of new tests for differences between real and hypothetical responses, some have called into question our results.
Calling us out
In a 2006 Environmental and Resource Economics article, Glenn Harrison (the H in CEHM) writes a lengthy footnote:
Haab et al. (1999) argue for allowing the residual variance of the statistical model to vary with the experimental treatment. They show that such heteroskedasticity corrections can lead the coefficients on the experimental treatment to become statistically insignificant, if one looks only at the coefficient of the treatment on the mean effect. This is true, but irrelevant for the determination of the marginal effect of the experimental treatment, which takes into account the joint effect of the experimental treatment variable on the mean response and on the residual variance. That marginal effect remains statistically significant in the original setting considered by Haab et al. (1999), the referendum experiments of CEHM.
Perhaps true, although I would argue that the coefficients of the willingness to pay function are really what is of interest (not the marginal effect of the treatment on the probability of a yes), but largely irrelevant to our bigger and originally intended point: The experiments don't allow us to identify the differences and a simple fix exists--offer more than one bid. This simple fix has been implemented a number of times in the
literature and each time the results support the conclusions of CEHM (e.g, here and here). In our minds these studies put the matter to rest. Hypothetical bias likely exists in CV studies and something should be done to mitigate the bias (see the studies in the preceding parenthetical).
Now, in 2010, a Journal of Environmental Economics and Management (arguably the top environmental economics journal) article, by Carlsson and Johansson-Stenman is forthcoming. Its intent is to call into question, you guessed it, our 1999 comment. Sheesh.
The paper by Cummings et al. (henceforth CEHM) is an important study that compared two treatments: one hypothetical and one real referendum directed towards people living close to a contaminated area. ...However, in a comment, Haab et al. (henceforth HHW) dispute this conclusion and claim that “the results of the experiments by Cummings et al. do not reject the hypothesis of incentive compatibility of hypothetical referenda” (p. 186). They further claim that if one corrects for a difference in variance between the two treatments, then there is no significant difference between them anymore.
Yes, we said that.
In this note we explore in detail the importance of and problems with correcting for a possible difference in variance between data sets. The point raised by HHW is indeed potentially very important.
However, as we will show, the way they identify and correct for the relative scale factor is inappropriate, and it may indeed be difficult or even impossible to identify a difference in variance when the informational basis for such an estimation is weak, such as in the case with the CEHM data.
THAT WAS OUR POINT IN 1999! CEHM's experiments don't allow for a very simple test. Offer more than one bid, and none of this is necessary. To make matters worse, Carlsson and Johansson-Stenman point out, yet another study questioning our comment:
...Stefani and Scarpa also question the conclusions by HHW, albeit from a somewhat different perspective; they propose an alternative Bayesian approach to deal with the identification problem.
In a working version of the Stefani and Scarpa paper that Ric Scarpa sent me in 2008 for comment, (I can't seem to find a link to an on-line version), Stefani and Scarpa state:
HHW claimed that CEHM results relied on the hypothesis of homoskedasticity between the real and hypothetical treatment. Failing to account for possible differences of variability of responses across treatments would have resulted in inconsistent parameter inference.
HHW also stress that in order to test for the presence of hypothetical bias the referendum data employed by CEHM are not suitable since the scale parameters cannot be uniquely estimated. To overcome this problem they suggest employing multiple bids designs to examine the incentive compatibility issue [emphasis added].
In this context we propose a Bayesian estimator for the heteroskedastic probit, based on plausible priors, could mitigate the problems encountered by maximum likelihood estimators thereby providing an alternative tool for the analysis of this type of data [emphasis added].
That sounds like a really complicated solution to a simple problem: This type of data--that is, single bid data--shouldn't exist. Offer more than one bid and all of this is avoided--We don't get a 14 page JPE comment and, quite possibly, I get depressed, forego a move to Ohio State and John and I never achieve our stellar reputation as bloggers.
Putting the matter to rest (hopefully)
In all seriousness, in 2006, Carlsson and Johansson were kind enough to send us a draft of their note prior to submitting it for publication. Their note focuses on our procedure for testing for scale differences between hypothetical and real responses. In comments back to the authors, we made it abundantly clear that that procedure was not intended to be the main focus of our comment (although it obviously has become just that). Below I have printed the comments we sent back to Carlsson and Johansson on April 17, 2006. If I haven't bored you to tears, and you are still interested in this debate, I encourage you to read the Carlsson and Johansson article first and then take a look at our comments below (slightly edited for length and emphasis). Enjoy.
Comments on Carlsson and Johansson-Stenman (by Haab, Huang and Whitehead; April 17, 2006)
1) In the original comment we show that scale differences might lead to the wrong conclusion. IF there is NO hypothetical bias, we can identify the relative scale. Failing to do so, may lead to scale differences being mistaken for hypothetical bias. If hypothetical bias exists, we state in the appendix that our procedure will result in inconsistent estimates of the relative scale [ed. note: an appendix to a comment?].
2) About  papers cite our comment. They seem to be split on their interpretation. Some claim it is just part of the debate on whether hypothetical and real data differ (wrong interpretation), others focus on scale differences and that the hypothetical/real differences might be attributable to differences in scale (right interpretation). Bottom line: we were commenting on the poor quality of the CEHM data and not whether hypothetical bias does or does not exist.
3) Carlsson and Johansson-Stenman show that if there is hypothetical bias, then the relative scale parameter cannot be identified. Our original comment does not contradict this and in fact supports it from a different angle. We start from the assumption that hypothetical bias does not exist and then show that scale differences might lead to the wrong conclusion. We make no general claims—or at least never intended to—about the case where hypothetical bias does exist. We were simply trying to point out that scale differences may be mistaken for hypothetical bias in some cases.
4) We tried to duplicate the Carlsson and Johansson-Stenman results using our procedure with scaled down simulations (2000 observations) and scaled up scale differences. When no hypothetical bias exists, the non-scaled probit detects hypothetical bias but the properly scaled probit does not. This is what we showed in the original comment. Also, the grid-search produces an estimate of the scale parameter that is very close to the ‘true’ value. When hypothetical bias does exist—i.e., the simulated means shift—the non-scaled probit produces biased estimates of all parameters. Also, the grid search seems to produce a biased estimate of the scale factor as Carlsson and Johansson-Stenman predict. Hypothetical bias is still detected, but the magnitude is wrong.
5) If hypothetical bias does not exist, failing to account for scale differences might lead to a conclusion that there is hypothetical bias where it does not exist (this is what we said in the original comment). But, if hypothetical bias does exist, we cannot identify the relative scale parameter and the grid search results in an inconsistent estimate (what we alluded to in the original comment, but the Carlsson and Johansson-Strenman comment makes it very clear). We never claim to be able to identify hypothetical bias if scales are different -- just that one might falsely classify scale differences as hypothetical bias.
6) With respect to the CEHM data, with no variation in the bid value, we cannot uniquely identify both β and σ. Consequently we cannot compare between the hypothetical and real data. In a sense, CEHM attributed all differences between hypothetical and real data to β holding σ ratio=1, and our comment provided the opposite extreme and attributed all differences to σ ratio holding β=0. What we showed was a possibility that CEHM could have been wrong but they could be right if the truth was somewhere in between. This is what the Carlsson and Johansson-Stenman note shows and they lean toward believing that there is a difference between the real and hypothetical data. It is not clear why Carlsson and Johansson-Stenman focus on the case of a single bid—in their case the bid is normalized to zero. The literature has moved away from this practice and offering more than one bid readily solves the identification problems.
Econometrically, the issue is data pooling. In the simulation, we can assume the coefficients of other explanatory variables to be the same for both hypothetical and real data. In reality, we need to test the equality of all coefficients. According to what we outlined in our comment, we tested all coefficients to be the same between hypothetical and real data, and that is the correct way to do it, although the identification issue remains when there is no variation in the bid value. Further analysis of the CEHM data will not resolve this issue. We need data with multiple bid values.
Multiple bid values allow us to focus on the real issue here: do scale differences AND hypothetical bias exist or are we confusing one for the other?
I hope this puts this matter to rest.
But somehow I doubt it.
*Kerry Smith simultaneously published a related but different critique of the CEHM experimental procedure in the same issue of the JPE.