Desvousges, Mathews and Train (Land Economics, 2015) use the contingent valuation method (CVM) to conduct an adding-up test (i.e., does WTP_{A} + WTP_{B} = WTP_{A+B}?). They use the nonparametric Turnbull estimator and find that the data do not pass the adding-up test. This suggests that the CVM lacks internal validity.

In September 2016 I began writing a comment on this paper by first posting a series of blog posts questioning the validity of the underlying data and their implementation of the survey. The comment went through several rounds of review, was submitted, reviewed, revised and rejected at Land Econ (due to concerns about the DMT reply), submitted, reviewed, revised and then withdrawn from Economics E-Journal, and submitted, reviewed and accepted for publication at Ecological Economics. The comment goes further than the blog posts by showing that the adding-up test, though flawed in implementation and another hypothesis test is more appropriate, is actually supported in some tests using parametric WTP estimators.

Desvousges, Mathews and Train (Ecological Economics, forthcoming) have now replied to my comment by describing 12 mistakes (12!) that I made. I agree that I made one of the mistakes on their list. I conducted an adding-up test by examining whether the confidence intervals for two willingness to pay estimates (the whole vs the sum of the parts) overlap. It is well-known that confidence intervals can overlap and yet the t-statistic for the test will indicate that the difference in means is statistically different. The mistake that I made was not checking the t-statistic. This is an embarrassing mistake. The worst part is that I teach this to undergraduates in the business statistics course. I tell them not to make this mistake and I've made it in a published journal article. I'm very embarrassed.

There are a variety of reasons, though not excuses, for this mistake which I will describe in another blog post. But today, let me point out another mistake that I made that concerns me almost as much as the t-statistic: *I used the wrong confidence intervals*. In Whitehead (2020) I used the confidence intervals from the Delta Method (a first-order Taylor Series expansion from the variance-covariance matrix) which are symmetric. It is well-known that the distribution of a ratio of parameters (such as WTP) is not necessarily symmetric. The asymmetry gets more severe when the parameter in the denominator is imprecisely estimated as in Desvousges, Mathews and Train (Land Economics, 2015). Another approach that is common is the Krinsky-Robb (KR) confidence intervals. These are based on a simulation from the variance-covariance matrix of the estimated parameters. In a forthcoming blog post I'll show that the KR confidence intervals are very wide. So wide that the WTP for the sum of the parts lies within the confidence interval for the WTP for the whole, supporting the conclusions of Whitehead (2020). I'm embarrassed that I made this mistake too.

My biggest concern, other than my big mistake (and the Delta Method confidence interval mistake) with the Desvousges, Mathews and Train "Reply to Whitehead" is that they do not take the problems with their own research very seriously. In contrast, when I've had papers that have received comments I've tried to learn from the comment and then tried to fix my paper (e.g., see Whitehead, Land Econ, 2004). Desvousges, Mathews and Train instead adopt the strategy that the best defense is a good offense. Their attitude seems to be that their data is no worse than any other CVM data set (in particular, they point to my own data from 15-30 years ago in footnote 3). I don't believe that this approach is the best way to advance economic science.

My comment on Desvousges, Mathews and Train (Land Economics, 2015) addresses three main issues: (1) the data are flawed/low quality, (2) implementation of the adding-up test in the survey is flawed and (3) additional statistical tests for adding-up do not support the DMT (2015) results. None of these issues are refuted by Desvousges, Mathews and Train (forthcoming). Instead, each of these issues has been confused by the Desvousges, Mathews and Train "Reply to Whitehead".

First, I provide a correction to my mistaken describe above. Second, here is a response to my 12 "mistakes":

(1) The log-linear models are not meaningless as claimed by DTM. The log-linear model and median WTP is a simple way of addressing negative WTP. The fact that the mean WTP from these models is infinite is not a functional form problem, it is a data problem. The median of the sum of WTP estimates provided by DMT (2020) lies within the 95% Krinsky-Robb confidence interval for the median WTP of the whole scenario. [more here]

(2) The linear-in-bid model that allows negative willingness to pay is not inappropriate. Negative WTP can arise from this functional form if the percentage of yes responses is less than 50% at the lowest bid or if the WTP estimate is statistically imprecise. The point estimate of mean WTP for this model provides positive WTP estimates in four out of the five scenarios. The negative WTP estimate is from the troublesome second scenario data (see (4)). The third scenario generates negative WTP values from the statistical distribution. Accounting for these in a statistical adding-up test supports my result that the sum of the WTP parts can not be statistically distinguished from the whole scenario. [more here]

(3) Following the approach taken in the correction, the adding-up test passes when respondents with missing demographics are dropped when the more appropriate confidence intervals (KR) are used. [more here]

(4) The weighted data does not support the results in Desvousges, Mathews and Train (Land Economics, 2015) as DMT claim. The weighted data with the whole and second scenarios are "roller coaster" and "Nike swoosh+ shaped instead of downward sloping as required by theory. This suggests that the weighted data reveals some irrationality amongst respondents. DMT's approach is to impose respondent rationality across the scenarios. They constrain the cost coefficients to be equal across scenarios in order to impose a downward sloping cost effect. This is inappropriate when it is done to hide statistically insignificant (roller coaster) and wrong-signed (Nike swoosh) slope coefficient. [more here]

(5) DMT notice that I conducted an adding-up test with the Kristrom nonparametric estimator in a 2016 blog post (here). They claim that I "inadvertently dropped observations" when conducting these calculations. Dropping these observations was not "inadvertant." In the blog post at issue I used a sample size of n=950 which is the same sample size that DMT (2015) used in their Table 5 (dropping observations with a missing age variable).

DMT (2020) report that the adding-up test fails with the Kristrom estimator and I "failed to report relevant findings" because I did not include this in Whitehead (2020). This begs the question: how many additional tests should be conducted in a comment on a paper? In Whitehead (2020) I provided three parametric tests using some the standard models in the literature. I then consider the robustness of these tests with (a) weighted data and (b) the complete case data set (n=934 after dropping those with missing age and income).

(6) Claims that the Chapman et al. (2009) data and a number of my own data sets (circa 1992 - 2011) are of the same low quality as the Desvousges, Mathews and Train (Land Economics, 2015) data are overstated. I showed in an Appendix in Whitehead (2020) that the Chapman et al. (2009) are far superior in quality to the Desvousges, Mathews and Train (Land Economics, 2015) data. Using the length of the upper tail as a measure of quality, I find that my own data mostly ranges between the Chapman and DMT data (one of my data sets is a literal "off the chart" low quality outlier). Quality is an increasing function of sample size. [more here]

(7) Desvousges, Mathews and Train (2015) have not provided their internet survey for review. I asked twice. The first time Bill Desvousges had his assistant send me the Chapman et al. (2009) report containing their in-person surveys. The second time I asked he notified the Economics: E-Journal editor about my request. The editor told me that he thought I had everything I needed to write my replication paper and not to email Bill Desvousges again (I won't). Claims that their survey conveys information about substitution effects to survey respondents are simply assertions. It would be forthright to provide the survey for review.

(8) DMT (2020) are correct by pointing out that "implicit claim" may be poor word choice. In DMT (2015) they have an empirical finding that there are no income effects. But, in mistake (9) they acknowledge that there is a statistically significant income coefficient when they use the weighted data. They have not explained why they chose to impose this "external" income constraint instead of incorporating income effects "internally" in the survey scenarios. Internal/external may be better word choice than implicit/explicit.

(9) My statistically significant income coefficient was found using the models with weighted data. Desvousges, Mathews and Train (2020) state that they re-ran their simulations with the weighted income coefficient and found similar results. But, if they re-ran their simulations with the weighted income coefficient they should have done the test with the weighted WTP models, which lack validity (see (4) above). The "external" income test can not be conducted in a model with consistent assumptions made about the data unless one constrains the cost coefficients to be equal (which is done to hide statistically insignificant and wrong-size cost coefficients).

In Whitehead (2020) also doubt that income is the correct budget constraint. I suspect that survey respondents have some environmental contribution budget in mind when answering CVM questions. In footnote 4 DMT state that this is a violation of microeconomic theory. I assume that they are referring to neoclassical microeconomic and ignore behavioral economics. Even then, a two-stage budgeting decision is consistent with two-stage budgeting where a household first allocates income to different budget categories and then maximizes subutility functions subject to the budget constraint (Deaton and Muellbauer 1980 -- this theory led to the development of the Almost Ideal Demand System econometric model).

(10) My proposed hypothesis, based on my read of Desvousges, Mathews and Train (Land Economics, 2015) and the lack of the survey instrument (see (7)), is a one-tailed scope test. It is not a one-tailed adding-up test. Note also, that any information provided in a CVM survey about "substitute" environmental goods can be interpreted by respondents as complements (see Whitehead and Blomquist, WRR, 1991).

(11) Arrow et al. (1994) regret using the term adequate (in reference to the size of scope effects) in the NOAA Panel report. Instead they suggested the appropriate word is plausible scope effects. I pointed this out in Whitehead (Ecol. Econ 2016) and proposed scope elasticity as a measure. Scope elasticity is a more useful measure of plausibility than the adding-up test is for adequacy given difficulties in conducting an adding-up test.

(12) My Turnbull standard error estimates differ from DMT's (2015) standard errors. I applied the formulas in Haab and McConnell (2002) with pooled (smoothed) data. DMT (2020) report that they used the raw data to construct confidence intervals with the smoothed data WTP estimate. My estimates of the standard errors are larger than DMT's (2020). But, it seems like standard errors with the raw data (not smoothed) should be larger than standard errors from the smoothed data. DMT (2020) do not provide much information on this estimation so it is difficult to say more.

Comments welcome!