Yoram Bauman:

This is the fifth such review I’ve been involved in and it is almost certainly the last review I’ll be doing, for the simple reason that the vast majority of textbooks now have excellent content on climate change! (If desired you can skip directly to the report card, or read on for some context and big-picture thoughts.)

The state of affairs today is very different from that of 10 years ago—my previous reviews were in 2010, 2012, 2014, and 2017—much less 20 years ago, when I had an astonishing and hilarious email exchange with University of Houston professors Roy Ruffin and Paul Gregory about the wacky climate-skeptic claims (“no matter how much contrary evidence is presented, it just doesn’t matter”) in their now-defunct textbook.

In past years I have given out a Ruffin and Gregory Award for the Worst Treatment of Climate Change in an Economics Textbook, and I am pleased to say that

no book merits that award this year.

This is good news ... the economics profession won't be participating (as much) in the training of undergraduates in climate skepticism.

DMT (2020) draw attention to my treatment of the weighted WTP estimates. The regression model for the second scenario has a negative sign for the constant and a positive sign for the slope. When I "mechanically" calculate WTP for the second scenario it is a positive number which adds weight to the sum of the WTP parts. This is in contrast to the unweighted data for which WTP is negative. Inclusion of the data from this scenario biases the adding-up tests in favor of the conclusion that the WTP data does not pass the adding-up test.

The motivation for my consideration of the weighted data was DMT's (2015) claim that they found similar results with the weighted data. My analysis uncovered validity problems with two of the five scenarios which, when included in a adding-up test, led to a failure to reject adding-up. At this point in the conversation it will be instructive to visually examine the weighted data to see if it even passes the "laugh" test. In my opinion, it doesn't.

Below are the weighted votes and theTurnbull for the whole scenario (note that the weights are scaled to equal to sub-sample sizes). The dots and dotted lines represent the raw data. Instead of a downward slope, these data are "roller-coaster" shaped (two scary hills with a smooth ride home). The linear probability model (with weighted data) has a constant equal to 0.54 (t=9.73) and a slope equal to -0.00017 (t=-0.69). This suggests to me that the whole scenario data, once weighted, lacks validity. While lacking validity, the solid line Turnbull illustrates how a researcher can obtain a WTP estimate with data that does not conform to rational choice theory. The Turnbull smooths the data over the invalid stretches of the bid curve (the "non-monoticities" using the CVM jargon) and the WTP estimate is the area of the rectangles. In this case WTP = $191 which is very close to the unweighted Turnbull estimate. But, a researcher should consider this estimate questionable since the underlying data does not conform to theory. As a reminder, the WTP for the whole scenario is key to the adding up test as it is compared to the sum of the parts. The WTP estimate from linear logit model is $239 with the Delta Method [-252, 731] and Krinsky-Robb [-8938, 9615] confidence intervals. Given the statistical uncertainty of the WTP estimate, it is impossible to conduct any sort of hypothesis test with these data.

Below are the weighted votes and the (pooled) Turnbull for the second scenario. The dots and dotted lines represent the raw data. Instead of a downward slope, these data are "Nike swoosh" shaped. The linear probability model (with weighted data) has a constant equal to 0.13 (t=2.46) and a slope equal to 0.00107 (t=4.19). This suggests to me that the second scenario data, once weighted, lacks validity. Again, the Turnbull estimator masks the weakness of the underlying data. In this case, the Turnbull is essentially a single rectangle. With pooling the probability of a vote in favor is equal to 28.06% for the lower bid amounts. With pooling the probability is 27.56% for the higher bids. The Turnbull WTP estimate is $112 which appears to be a reasonable number, hiding the problems with the underlying data.

DMT reestimated the full data model with the cost coefficients constrained to be equal. In a utility difference model the cost coefficient is the estimate for the marginal utility of income. There is no reason for marginal utility of income to vary across treatments unless the clean-up scenarios and income are substitutes or complements. This theoretical understanding does not explain why the weighted models for the whole and second scenarios are not internally valid (i.e., the cost coefficient is not negative and statistically different from zero). The model that DMT refer to passes a statistical test, i.e., the model that constains the cost coefficient to be equal is not worse statistically than an unconstrained model, but it should be considered inappropriate due to the lack of validity in the weighted whole and second scenario data sets. Use of the model with a constrained cost coefficient amounts to hiding a poor result. The reason that the weighted model with the full data set takes the correct sign is because the scenarios with correct signs outweigh the scenarios with incorrect or statistically insignificant signs. The reader should attach little import to DMT's (2015) claim that their result is robust to the use of sample weights.

When dichotomous choice CVM data is of low quality, the measure of central tendancy is sensitive to assumptions. As I showed in a paper presented earlier this year (Landry and Whitehed 2020), with the highest quality data it makes no difference the WTP estimator that is used. The Turnbull, Kristrom, linear logit (under both zero WTP assumptions), and linear probability models all produce the same estimate.

As data quality falls, however, the choice of WTP estimate can matter a great deal. In this situation, so as to avoid sponsor and other biases, it is important for the CVM researcher to present the full range of WTP estimates and avoid the impression that results have been cherry picked. This range of WTP provides a more complete depiction of analyst uncertainty and allows for sensitivity and other analyses.

I have grown accustomed to intense suspicion whenever I see hypothesis tests conducted with only the Turnbull WTP estimate. First, it is a lower bound WTP estimate and potential differences are minimized. Second, its standard errrors are smaller (relative to the mean) than parametric WTP estimates. This second observation is due to the way that the standard errors are calculated and to the fact that the data are smoothed when there are non-monotonicities. As Haab and McConnell (1997, p 253) explained (emphasis added): "We demonstrate that the Turnbull model ... provides a straightforward alternative to parametric models, **so long as one simply wants to estimate mean willingness to pay**." When hypothesis tests are being conducted, a range of WTP estimates should be used to determine if the results are robust to estimation method.

So, is it reasonable to include the linear-in-bid parametric model in this collection of WTP estimates? Hanemann (1984, 1989) showed that in a linear utility model, U = a(Q) + bY where Q is a good and Y is income, the mean (and median) willingness to pay is WTP = -a*/b, where a* is the change in utility from changes in Q and b is the marginal utility of income. One benefit of this estimate is that it is insensitive to fat tails. However, this estimate allows for negative WTP values unless the probability of a yes response to a dichotomous choice question is 100% when the bid amount is zero. Negative WTP values can enter into the analysis in two ways. First, the WTP estimate itself can be negative. This will occur when the probability of a yes response at the lowest bid amount is less than 50% (it is this possibility that, I think, motivated Haab and McConnell). The second possibility is that the empirical distribution of WTP can include negative values. This is of little consequence to the analysis unless the confidence interval includes zero. Both circumstances arise with the DMT (2015) data.

DMT (2020) dismiss outright the possibility of negative WTP. Their dismissal is consistent with Haab and McConnell's argument that since public goods are freely disposable, negative WTP is only an empirical artifact of a distributional assumption. But, with government policy free disposal is not always possible. In the case of a clean up of natural resource damages, the clean up could be considered a wasteful intrusion into a private business decision. Bohara, Kerkvliet and Berrens (2001) discuss how and why negative WTP values might arise, along with empirical examples. Considering this, I would not be surprised if some of the respondents to CVM scenarios demanded compensation for environmental cleanup.

There have been a number of suggestions about how to handle negative willingness to pay. Many of these involve obtaining more data with follow-up questions (Landry and Whitehead). Unfortunately, the DMT (2015) survey data does not have any of this supplemental information. In that case, in my opinion, an assumption that negative WTP is a possibility can not be ruled out. Inclusion of the linear model allowing for negative WTP, as long as it is presented along with other estimates, should not be dismissed outright.

DMT (2020) state: "This means that adding-up passed in his calculations on linear models not because of the data but because of his implausible additional assumption that many people have a negative WTP for the environmental programs." It is not true that the linear model finds that "many" people have negative willingness to pay in each of the scenarios. According to the Krinsky-Robb WTP simulation, the percentage of negative WTP values for the whole, first, second, third, and fourth scenarios in the DMT (2020) data are 2%, 0.01%, 77%, 25% and 0.83%. The WTP from the second scenario is negative (situation 1 above). The WTP from the third scenario has a Delta method confidence interval that includes zero (situation 2 above).

If the negative mean WTP from the second scenario is set equal to zero then the difference in WTP for the whole and the sum of its parts is statistically significant at the p=0.088 level with the Delta Method confidence intervals. The Krinsky-Robb confidence interval is [68, 788] which includes the sum of the WTP for the parts with WTP from the second scenario set equal to zero ($467) indicating that the adding-up test is supported. It is still my contention that the adding-up passed in the (untruncated) linear model not because of the data.

My conclusion is that the negative WTP values do not have an important effect on the adding-up tests. Dismissing these tests because negative WTP values are implausible ignores the literature and the empirical evidence.

References

Bohara, Alok K., Joe Kerkvliet, and Robert P. Berrens. "Addressing negative willingness to pay in dichotomous choice contingent valuation." Environmental and Resource Economics 20, no. 3 (2001): 173-195.

Landry, Craig, and John Whitehead, "Estimating Willingness to Pay with Referendum Follow-up Multiple-Bounded Payment Cards," paper presented at the 2020 W-4133, Athens, GA, February.

Haab, Timothy C., and Kenneth E. McConnell. "Referendum models and negative willingness to pay: alternative solutions." Journal of Environmental Economics and Management 32, no. 2 (1997): 251-270.

Hanemann, W. Michael. "Welfare evaluations in contingent valuation experiments with discrete responses." American journal of agricultural economics 66, no. 3 (1984): 332-341.

Hanemann, W. Michael. "Welfare evaluations in contingent valuation experiments with discrete response data: reply." American journal of agricultural economics 71, no. 4 (1989): 1057-1061.

When dichotomous choice CVM data has a negative WTP problem, one of the standard corrections is to estimate a log-linear model and present the median WTP. With many estimated log-linear models the mean WTP is undefined. This is because the log-linear model flattens the estimated survival curve and, in contrast to a linear model, the probability of a no response does not approach zero at any reasonable bid amount. The median WTP estimate from the log-linear model tends to be a useful supplement to the welfare measures available from the linear model.

Desvousges, Mathews and Train (2020) rightly argue that (1) the sum of medians is not equal to the median of the sums and (2) the mean WTP estimates for each of their five scenarios are basically infinity (or in the millions of dollars when the data are estimated with a log-linear probit). Given this empirical fact, there seems to be no median estimate available for the sum of the four individual WTP amounts (that could be compared to the median of the whole). However, DMT (2020) are able to estimate the "median of the sum of WTPs through simulation" to be $4904. They then explain that since $4904 is 24 times that of the median of the whole scenario, $201, is "clearly a violation of adding up". Given that the confidence interval from the Delta Method, [-47, 449], does not include $4904 we could reject the notion that the willingness to pay estimates pass this version of the adding-up test.

Previously however, I've argued that these standard errors are likely the wrong ones to use since the cost parameter is measured without much precision. In this case the Krinsky-Robb confidence intervals are more appropriate. The Krinsky-Robb confidence interval for the median WTP estimate for the whole scenario is [54, 9558]. Since the median of the sum of the WTP estimates from the four adding-up scenarios, estimated by DMT (2020) to be $4904, lies within the 95% Krinsky-Robb confidence interval then we fail to reject the adding-up hypothesis at the 95% confidence level. In contrast, the 95% Krinsky-Robb confidence interval estimated with the whole scenario data from Chapman et al. (2009), which I've argued is higher quality data, is relatively tight around the median of $167: [134, 217].

The only other adding-up test that can be conducted with median WTP estimates is to compare the median for the whole with the sum of the medians for the four parts. In this test I found that the median WTP estimates passed the adding up test using the confidence intervals from the Delta Method (Whitehead 2020). It still seems to me that this test is a useful supplement when one is inclined to consider the robustness of the adding-up test conducted with only the Turnbull estimator (as in DMT 2015). Otherwise, we're treating mean WTP estimates in the millions as if they are meaningful.

To answer the question in the title of this post, the log-linear model is not meaningless. In fact, it is a better model statistically than the linear model for the first and third WTP scenarios in DMT (2015). Information gleaned from the log-linear model provides insights into the quality of the DMT (2015) data.

While I argue with DMT over the minutiae of these tests and different estimators, the reader shouldn't lose site of how silly the debate over DMT (2015) has become. The bottom line is that the DMT (2015) data is of low quality data and do not rise to the threshold that is needed to support an adding-up test, which requires estimates of willingness to pay as ratios of coefficients.

In Whitehead (2020) I describe the problems in the DMT (2015) data. It is full of non-monotonicities, flat portions of bid curves and fat tails. A non-monotonicity is when the percentage of respondents in favor of a policy increases when the cost increases. In other words, for a pair of cost amounts it appears that respondents are irrational when responding to the survey. This problem could be due to a number of things besides irrationality. First, respondents may not be paying close attention to the cost amounts. Second, the sample sizes may be simply too small to detect a difference in the correct direction. Whatever the cause, non-monotonicities increase the standard errors of the slope coefficient in a parametric model.

Flat portions of the bid curve exist when the bid curve may be downward sloping but the slope is not statistically different from zero. This could be caused by small differences in cost amounts and/or it is due to sample sizes that are too small to detect a statistically significant difference. For example, there may be little perceived difference between a cost amount of $5 and $10 compared to $5 and $50. And, even if the percentage of responses in favor of a policy is economically different between two cost amounts, this difference may not be statistically different due to small sample sizes.

Fat tails may exist when the percentage of respondents who are in favor of a policy is high at the highest cost amount. However, this is only a necessary condition. A sufficient condition for a fat tail is when the percentage of respondents who are in favor of a policy is high at two or more of the highest cost amounts. In this case, the fat tail will cause a parametric model to predict a very high cost amount that drives the probability that respondents are in favor of a policy to (near) zero. A fat tail will bias a willingness to pay estimate upwards because much of the WTP estimate is derived from the portion of the bid curve when the cost amount is higher than the cost amount in the survey.

DMT (2020) state that these problems also occur in Chapman et al. (2009) and a number of my own CVM data sets. They are correct. But, DMT (2020) are confusing the existence of the problem, in the case of non-monotonicity and flat portions, with the magnitude of the problem. And, they are assuming that if the necessary condition for fat tails exists then the sufficient condition also exists. Many, if not most, CVM data sets will exhibit non-monotonicities and flat portions of the bid curve. But, these issues are not necessarily an empirical problem. The extent of the three problems in DMT (2015) is severe -- so severe that it makes their attempt to conduct an adding up test (or any test) near impossible.

To prove this to myself I estimated the logit model, WTP value and 95% Krinsky-Robb confidence intervals for 20 data sets. Five of the data sets are from DMT (2015), 2 are from Chapman et al. (2009) and 13 are from some of my papers published between 1992 and 2009 (DMT (2020) mention 15 data sets but two of the studies use the same data as in another paper). The average sample size for these 20 data sets is 336 and the average number of cost amounts is 5.45. The average sample size per cost amount is 64, which is typically sufficient to avoid data quality problems (a good rule of thumb is that the number of data points for each cost amount should be n > 40 in the most poorly funded study).

These averages obscure differences across study authors. The average sample size for the DMT (2015) data sets is 196. With 6 cost amounts the average sample size per cost amount is 33. The Chapman et al. (2009) study is the best funded and the two sample sizes are 1093 and 544. With 6 cost amounts the sample sizes per cost amount are 182 and 91. The Whitehead studies have an average sample size of 317 and with an average of 5 cost amounts, the sample size per cost amount is 65 (the variance of these means are large). Already, differences across these three groups of studies emerge.

There are a number of dimensions over which to compare the logit models in these studies. My preferred measure is the ratio of the upper limit of the 95% Krinsky-Robb confidence interval for WTP to the median WTP estimate. This ratio will be larger the more extensive is the three empirical problems mentioned above. As this problem worsens, hypothesis testing with the WTP estimates (again, a function of the the ratio of coefficients) becomes less feasible. It is very difficult to find differences in WTP estimates when the confidence intervals are very wide. To suggest that this measure has some validity, the correlation between the ratio and the p-value on the slope coefficient is r = 0.96.

The results of this analysis are shown below. The ratio of the upper limit of the confidence interval to the median is sorted from lowest to highest. The DMT (2015) values are displayed as orange squares, the Chapman et al. (2009) values are displayed as green diamonds and the Whitehead results are displayed as blue circles and one blue triangle. The blue triangle is literally "off the chart" so I have divided the ratio by 2. This observation, one of three data sets from Whitehead and Cherry (2007), does not have a statistically significant slope coefficient.

Considering the DMT data, observation 19, with a ratio of 4.82 (i.e., the upper limit of the K-R confidence interval is about 5 times greater than the median WTP estimate), is the worst data set. Observation 8, the best DMT data set, has their largest sample size of n=293. The Chapman et al. (2009) data sets are two of the three best in terms of quality. The Whitehead data sets range from good to bad in terms of quality. Overall, four of the five DMT data sets are in the lower quality half of the sample (defined by DMT*).

Of course, data quality should also be assessed by the purpose of the study. About half of the Whitehead studies received external funding. The primary purpose of these studies was to develop a benefit estimate. The other studies were funded internally with a primary purpose of testing low stakes hypotheses. In hindsight, these internally funded studies were poorly designed with sample sizes per bid amount too small and/or poorly chosen bid amounts. With the mail surveys the number of bid amounts was chosen with optimistic response rates in mind. With the A-P sounds study a bid amount lower than $100 should have been included. Many of the bid amounts are too close together to obtain much useful information.

In contrast, considering the history of the CVM debate and the study's funding source (Maas and Svorenčík 2017), the likely primary purpose of the DMT (2015) study is to discredit the contingent valuation method in the context of natural resource damage assessment. In that context, the study is very high stakes and, therefore, its problems should receive considerable attention. The DMT (2015) study suffers from some of the same problems that my older data suffers from. The primary problem with the DMT (2015) study is that the sample sizes are too low. It is not clear why the authors chose to pursue valuation of 5 samples instead of 3 to conduct their adding up test (DMT (2012) describe a 3 sample adding up test with the Chapman et al. (2009) study). Three samples may have generated confidence intervals tight enough to conduct a credible test.

In the title of this post I ask "Are the DMT data problems typical in other CVM studies?" This subtitle should really be 'Are the DMT data problems typical of Whitehead's CVM data problems in a different era? The survey mode for my older studies was either mail of telephone. Both survey modes were common back in the old days but they have lost favor relative to internet surveys. The reasons are numerous but one is that internet surveys are much more cost-effective and the uncertainty about a response rate is non-existent. Another reason is that internet survey programming is much more effective (with visual aids, piping, ease of randomization, etc). Many of the problems with my old data was due to small sample sizes. This was a result of either poor study design (in hindsight, many CVM studies with small samples should have reduced their bid amounts) or unexpectedly low mail response rates.

It is not clear why DMT (2020) chose to compare their data problems to those that I experienced 15-30 years ago. Unless, in a fit of pique at my comment on their paper, they decided it would be a good idea to accuse me of hypocrisy. I've convinced myself that my data compares favorable to the DMT (2015) data. Especially considering the goals of the research. My goals were more modest than testing whether the CVM method passes an adding-up test for which a WTP estimate (the ratio of two regression coefficients) is required (as opposed to considering the sign, sign and significance of a regression coefficient).

*****

*Note that there are more Whitehead data sets than are called out by DMT. I haven't had time to include all of these into this analysis. But, my guess is that the resulting circles would be no worse than those displayed in the picture below.

Reference

Maas, Harro, and Andrej Svorenčík. "“Fraught with controversy”: organizing expertise against contingent valuation." *History of Political Economy* 49, no. 2 (2017): 315-345.

I recently read an article in the journal Economics and Philosophy, written by Lisa Herzog, which has nothing whatsoever to do with environmental economics but nonetheless I think has interesting implications for it and for Pigouvian pricing in particular.

In case you are unfamiliar with it, the journal Economics and Philosophy is a scholarly journal publishing articles on topics related to (wait for it) economics and philosophy. And, often, the philosophy of economics. This is an area that I have become interested in recently, influenced no doubt by my years as an undergraduate philosophy major.

In the article, Herzog addresses and basically dismantles Hayek's model of the price mechanism in markets as being an efficient way of processing and distilling information necessary for buyers and sellers. Hayek's idea is basically that the world is a complicated place, and markets have lots of complicated information about costs, benefits, demands, etc. But, once markets determine a price for something, that price contains all of the information that buyers and sellers need to make their optimal decision. Herzog quotes Hayek: markets process "dispersed bits of incomplete and frequently contradictory knowledge which all the separate individuals possess."

Herzog's argument is that Hayek is ignoring important pieces of knowledge that buyers and sellers need to make morally informed choices and that are missing from the price signal. Without this information, or "epistemic infrastructures, market participants do not act as morally responsible agents: when acting in such markets, it often seems fair to say that they do not know what they are doing." The motivating example is buying clothes made in sweatshops overseas. The price mechanism contains lots of information about production and transportation costs, etc., but it is missing important moral information, like the conditions of the workers and their outside opportunities. Without this information buyers cannot make morally complete (and therefore optimal) decisions.

What does this have to do with environmental economics? Of course, when there are negative externalities (like pollution), the unregulated market price will not reflect those externalities and thus will not provide buyers and sellers the necessary information to make optimal choices. This is not new; this is Pigou, and even the most die-hard Hayekian would admit that market failures like externalities need to be incorporated somehow into the price signal to achieve efficiency. (Aside: maybe it's not actually Pigou, according to this working paper by my colleague Spencer Banzhaf.)

But I think there is a deeper and less obvious point than that. Supposing that there is a Pigouvian price on pollution that accounts for its negative externalities. Does this price still contain the necessary "epistemic infrastructure" for all buyers and sellers to make morally responsible decisions? If the Pigouvian price is "right" then it accounts for the spillover costs, but it's not obvious that these costs are all that is morally necessary to make informed choices. Just like with sweatshops, there potentially are moral considerations above and beyond any cost considerations (even spillover costs) that need to be incorporated. It it not clear that a Pigouvian price would incorporate any of this.

Anyways, some Pigouvian price on pollution is still no doubt better than no Pigouvian price on pollution, but there is potentially a strong argument to be made here based on Herzog's article that any Pigouvian price can't get us enough information to make morally optimal choices.

As described in the introduction of my (draft) "Reply to 'Reply to Whitehead'", I suspect that I have used the incorrect confidence intervals when analyzing the Desvousges, Mathews and Train (2015) data. Park, Loomis and Creel (1991) introduced the Krinsky-Robb approach for estimating confidence intervals for willingness to pay estimates from dichotomous choice contingent valuation models. Cameron (1991) introduced the Delta Method approch. As indicated by their Google Scholar citations, 461 and 229 respectively, they have both been used extensively in the applied CVM literature. Hole (2007) compares the two approaches (along with Fieller and bootstrap approaches) and finds little difference in the approaches for well-behaved simulated data. However, Hole (2007) points out that the Delta Method requires that the willingness to pay be normally distributed for the confidence interval to be accurate. He states that "... it is likely that WTP is approximately normally distributed when the model is estimated using a large sample and the estimate of the coefficient for the cost attribute is sufficiently precise." (p. 830) In Whitehead (2020) I used the Delta Method confidence intervals in my statistical tests. This is very likely an inappropriate approach due to the imprecision of the estimate of the parameter on the cost amount.

When working on Whitehead (2020) I used NLogit (www.limdep.com) software to estimate the confidence intervals. NLogit allows for both the Delta Method and Krinsky-Robb approaches to be used. But the Krinsky-Robb confidence intervals may require the assumption of normality. Hole (2007): "The [Krinsky-Robb] confidence interval could also be derived by using the draws to calculate the variance of WTP ..., but this approach, like the delta method confidence interval, hinges on the assumption that WTP is symmetrically distributed." (p. 831) Almost all of the Krinsky-Robb confidence intervals estimated by NLogit "blew up" when using the DMT (2015) data, in other words the upper and lower limits were in the 10s and 100s of thousands (positive and negative). This made little sense to me at the time but now my guess is that when the WTP normality assumption is violated the NLogit software can not handle the estimation. Typically, Delta Method and Krinsky-Robb confidence intervals are not very different when estimated in NLogit (as shown below).

Following my reading of Desvousges, Mathews and Train (forthcoming) I thought through the above (obviously, I should have thought through it before) and estimated WTP using with the Krinsky-Robb intervals in SAS (my program is available upon request). My Krinsky-Robb intervals are akin to what Hole (2007) calls the Monte Carlo percentile approach. I take one million draws from the variance-covariance matrix and trip the α/2 highest and lowest WTP values, where α=0.05. Hole's (2007) Krinsky-Robb intervals are based on a resampling approach, but he finds little difference in the resampling and Monte Carlo Krinsky-Robb intervals.

For this analysis I am only using the whole scenario from DMT (2015) since this is sufficient to show that WTP for the whole can not be statistically distinguished from WTP for the sum of the parts with the Krinsky-Robb Monte Carlo percentile intervals. The logit models are presented below for the full sample (n=172), the sample with observations with missing demographics deleted (n=163) and the Chapman et al. (2009) data. In each model the constant and the coefficient on the cost amount are statistically different from zero. But, the precision of the cost coefficients with the DMT (2015) data are low relative to other CVM studies. Combined with small samples, the Desvousges, Mathews and Train WTP estimate may not be normally distributed. The Chapman et al. (2009) study, on the other hand, has a large sample size and a precisely estimated coefficient on the cost amount.

The WTP estimates (restricting WTP to be positive) and confidence intervals are presented below. The Delta Method confidence intervals are estimated in NLogit and the Krinsky-Robb percentile intervals are estimated in SAS. The appropriateness of the Delta Method with the DMT (2015) data is questionnable. First, the Krinsky-Robb lower bound on the DMT (2015) full sample (n=172) estimate is less than 50% of the Krinsky-Robb lower bound. Second, the Krinsky-Robb upper bound is 269% larger than the Delta Method upper bound. The imprecision of the coefficient on the cost amount is driving the asymmetry. The cost estimate in the less than full sample (n=163) is estimated even more imprecisely. The Krinsky-Robb confidence interval includes zero.

These results should be considered in contrast to the WTP estimate from the Chapman et al. (2009) data. The Delta Method and Krinsky-Robb intervals are very close. The symmetric Krinsky-Robb confidence interval estimated in NLogit is [236.90, 320.37] which is also very close to the Delta Method. One benchmark for symmetric confidence intervals in CVM studies, therefore, is a sample size greater than 1000 and a t-statistic on the coefficient for the cost coefficient of -9.5. Of course, sensitivity around these benchmarks should be assessed since not many CVM studies have these characteristics. (note: I'll do some of this sort of work when I go back to my past and analyze some of the CVM data from the old days that Desvousages, Mathews and Train (forthcoming) assert is just as bad as their own data.)

The point estimate of the sum of the WTP parts for the full sample is $1114.36. The WTP for the sum of the parts is within the Krinsky-Robb interval for the whole suggesting that we can not reject the hypothesis that WTP for the whole is equal to WTP for the sum of the parts at the p<0.05 level. The 90% interval is [264.84, 1314.14] which indicates that any statistical equality is at a confidence level below p=0.10. The point estimate of the sum of the WTP parts for the trimmed sample is $1079.73. Again, the WTP for the sum of the parts is within the Krinsky-Robb 95% interval for the whole. Note that the WTP for the whole is not different from zero with this sample so any statistical inference makes less sense than if WTP was different from zero. These results are consistent with my (erroneous) conclusion (Whitehead 2020) that the data in Desvousges, Mathews and Train (2015) are not sufficient to conclude that contingent valuation does not pass the adding up test.

References

Cameron, Trudy Ann. "Interval estimates of non-market resource values from referendum contingent valuation surveys." Land Economics 67, no. 4 (1991): 413-421.

Hole, Arne Risa. "A comparison of approaches to estimating confidence intervals for willingness to pay measures." *Health economics* 16, no. 8 (2007): 827-840.

Park, Timothy, John B. Loomis, and Michael Creel. "Confidence intervals for evaluating benefits estimates from dichotomous choice contingent valuation studies." Land economics 67, no. 1 (1991): 64-73.

Desvousges, Mathews and Train (Land Economics, 2015) use the contingent valuation method (CVM) to conduct an adding-up test (i.e., does WTP_{A} + WTP_{B} = WTP_{A+B}?). They use the nonparametric Turnbull estimator and find that the data do not pass the adding-up test. This suggests that the CVM lacks internal validity.

In September 2016 I began writing a comment on this paper by first posting a series of blog posts questioning the validity of the underlying data and their implementation of the survey. The comment went through several rounds of review, was submitted, reviewed, revised and rejected at Land Econ (due to concerns about the DMT reply), submitted, reviewed, revised and then withdrawn from Economics E-Journal, and submitted, reviewed and accepted for publication at Ecological Economics. The comment goes further than the blog posts by showing that the adding-up test, though flawed in implementation and another hypothesis test is more appropriate, is actually supported in some tests using parametric WTP estimators.

Desvousges, Mathews and Train (Ecological Economics, forthcoming) have now replied to my comment by describing 12 mistakes (12!) that I made. I agree that I made one of the mistakes on their list. I conducted an adding-up test by examining whether the confidence intervals for two willingness to pay estimates (the whole vs the sum of the parts) overlap. It is well-known that confidence intervals can overlap and yet the t-statistic for the test will indicate that the difference in means is statistically different. The mistake that I made was not checking the t-statistic. This is an embarrassing mistake. The worst part is that I teach this to undergraduates in the business statistics course. I tell them not to make this mistake and I've made it in a published journal article. I'm very embarrassed.

There are a variety of reasons, though not excuses, for this mistake which I will describe in another blog post. But today, let me point out another mistake that I made that concerns me almost as much as the t-statistic: *I used the wrong confidence intervals*. In Whitehead (2020) I used the confidence intervals from the Delta Method (a first-order Taylor Series expansion from the variance-covariance matrix) which are symmetric. It is well-known that the distribution of a ratio of parameters (such as WTP) is not necessarily symmetric. The asymmetry gets more severe when the parameter in the denominator is imprecisely estimated as in Desvousges, Mathews and Train (Land Economics, 2015). Another approach that is common is the Krinsky-Robb (KR) confidence intervals. These are based on a simulation from the variance-covariance matrix of the estimated parameters. In a forthcoming blog post I'll show that the KR confidence intervals are very wide. So wide that the WTP for the sum of the parts lies within the confidence interval for the WTP for the whole, supporting the conclusions of Whitehead (2020). I'm embarrassed that I made this mistake too.

My biggest concern, other than my big mistake (and the Delta Method confidence interval mistake) with the Desvousges, Mathews and Train "Reply to Whitehead" is that they do not take the problems with their own research very seriously. In contrast, when I've had papers that have received comments I've tried to learn from the comment and then tried to fix my paper (e.g., see Whitehead, Land Econ, 2004). Desvousges, Mathews and Train instead adopt the strategy that the best defense is a good offense. Their attitude seems to be that their data is no worse than any other CVM data set (in particular, they point to my own data from 15-30 years ago in footnote 3). I don't believe that this approach is the best way to advance economic science.

My comment on Desvousges, Mathews and Train (Land Economics, 2015) addresses three main issues: (1) the data are flawed/low quality, (2) implementation of the adding-up test in the survey is flawed and (3) additional statistical tests for adding-up do not support the DMT (2015) results. None of these issues are refuted by Desvousges, Mathews and Train (forthcoming). Instead, each of these issues has been confused by the Desvousges, Mathews and Train "Reply to Whitehead".

First, I provide a correction to my mistaken describe above. Second, here is a response to my 12 "mistakes":

(1) The log-linear models are not meaningless as claimed by DTM. The log-linear model and median WTP is a simple way of addressing negative WTP. The fact that the mean WTP from these models is infinite is not a functional form problem, it is a data problem. The median of the sum of WTP estimates provided by DMT (2020) lies within the 95% Krinsky-Robb confidence interval for the median WTP of the whole scenario. [more here]

(2) The linear-in-bid model that allows negative willingness to pay is not inappropriate. Negative WTP can arise from this functional form if the percentage of yes responses is less than 50% at the lowest bid or if the WTP estimate is statistically imprecise. The point estimate of mean WTP for this model provides positive WTP estimates in four out of the five scenarios. The negative WTP estimate is from the troublesome second scenario data (see (4)). The third scenario generates negative WTP values from the statistical distribution. Accounting for these in a statistical adding-up test supports my result that the sum of the WTP parts can not be statistically distinguished from the whole scenario. [more here]

(3) Following the approach taken in the correction, the adding-up test passes when respondents with missing demographics are dropped when the more appropriate confidence intervals (KR) are used. [more here]

(4) The weighted data does not support the results in Desvousges, Mathews and Train (Land Economics, 2015) as DMT claim. The weighted data with the whole and second scenarios are "roller coaster" and "Nike swoosh+ shaped instead of downward sloping as required by theory. This suggests that the weighted data reveals some irrationality amongst respondents. DMT's approach is to impose respondent rationality across the scenarios. They constrain the cost coefficients to be equal across scenarios in order to impose a downward sloping cost effect. This is inappropriate when it is done to hide statistically insignificant (roller coaster) and wrong-signed (Nike swoosh) slope coefficient. [more here]

(5) DMT notice that I conducted an adding-up test with the Kristrom nonparametric estimator in a 2016 blog post (here). They claim that I "inadvertently dropped observations" when conducting these calculations. Dropping these observations was not "inadvertant." In the blog post at issue I used a sample size of n=950 which is the same sample size that DMT (2015) used in their Table 5 (dropping observations with a missing age variable).

DMT (2020) report that the adding-up test fails with the Kristrom estimator and I "failed to report relevant findings" because I did not include this in Whitehead (2020). This begs the question: how many additional tests should be conducted in a comment on a paper? In Whitehead (2020) I provided three parametric tests using some the standard models in the literature. I then consider the robustness of these tests with (a) weighted data and (b) the complete case data set (n=934 after dropping those with missing age and income).

(6) Claims that the Chapman et al. (2009) data and a number of my own data sets (circa 1992 - 2011) are of the same low quality as the Desvousges, Mathews and Train (Land Economics, 2015) data are overstated. I showed in an Appendix in Whitehead (2020) that the Chapman et al. (2009) are far superior in quality to the Desvousges, Mathews and Train (Land Economics, 2015) data. Using the length of the upper tail as a measure of quality, I find that my own data mostly ranges between the Chapman and DMT data (one of my data sets is a literal "off the chart" low quality outlier). Quality is an increasing function of sample size. [more here]

(7) Desvousges, Mathews and Train (2015) have not provided their internet survey for review. I asked twice. The first time Bill Desvousges had his assistant send me the Chapman et al. (2009) report containing their in-person surveys. The second time I asked he notified the Economics: E-Journal editor about my request. The editor told me that he thought I had everything I needed to write my replication paper and not to email Bill Desvousges again (I won't). Claims that their survey conveys information about substitution effects to survey respondents are simply assertions. It would be forthright to provide the survey for review.

(8) DMT (2020) are correct by pointing out that "implicit claim" may be poor word choice. In DMT (2015) they have an empirical finding that there are no income effects. But, in mistake (9) they acknowledge that there is a statistically significant income coefficient when they use the weighted data. They have not explained why they chose to impose this "external" income constraint instead of incorporating income effects "internally" in the survey scenarios. Internal/external may be better word choice than implicit/explicit.

(9) My statistically significant income coefficient was found using the models with weighted data. Desvousges, Mathews and Train (2020) state that they re-ran their simulations with the weighted income coefficient and found similar results. But, if they re-ran their simulations with the weighted income coefficient they should have done the test with the weighted WTP models, which lack validity (see (4) above). The "external" income test can not be conducted in a model with consistent assumptions made about the data unless one constrains the cost coefficients to be equal (which is done to hide statistically insignificant and wrong-size cost coefficients).

In Whitehead (2020) also doubt that income is the correct budget constraint. I suspect that survey respondents have some environmental contribution budget in mind when answering CVM questions. In footnote 4 DMT state that this is a violation of microeconomic theory. I assume that they are referring to neoclassical microeconomic and ignore behavioral economics. Even then, a two-stage budgeting decision is consistent with two-stage budgeting where a household first allocates income to different budget categories and then maximizes subutility functions subject to the budget constraint (Deaton and Muellbauer 1980 -- this theory led to the development of the Almost Ideal Demand System econometric model).

(10) My proposed hypothesis, based on my read of Desvousges, Mathews and Train (Land Economics, 2015) and the lack of the survey instrument (see (7)), is a one-tailed scope test. It is not a one-tailed adding-up test. Note also, that any information provided in a CVM survey about "substitute" environmental goods can be interpreted by respondents as complements (see Whitehead and Blomquist, WRR, 1991).

(11) Arrow et al. (1994) regret using the term adequate (in reference to the size of scope effects) in the NOAA Panel report. Instead they suggested the appropriate word is plausible scope effects. I pointed this out in Whitehead (Ecol. Econ 2016) and proposed scope elasticity as a measure. Scope elasticity is a more useful measure of plausibility than the adding-up test is for adequacy given difficulties in conducting an adding-up test.

(12) My Turnbull standard error estimates differ from DMT's (2015) standard errors. I applied the formulas in Haab and McConnell (2002) with pooled (smoothed) data. DMT (2020) report that they used the raw data to construct confidence intervals with the smoothed data WTP estimate. My estimates of the standard errors are larger than DMT's (2020). But, it seems like standard errors with the raw data (not smoothed) should be larger than standard errors from the smoothed data. DMT (2020) do not provide much information on this estimation so it is difficult to say more.

Comments welcome!

I first wrote about this in September 2016. I then submitted the comment to Land Economics. The editor sent me the results of an internal review and I revised it accordingly. Then he sent it out for external review and it received a favorable review in February 2017. But, the referee took issue with the reply to my comment. Apparently, the referee suggested that s/he would write a comment on the reply if it was published. The editor decided to reject the comment/reply because he said it is Land Economic's policy to only publish comments AND replies. That policy seems strange to me as I think it creates an incentive to write an objectionable reply to comments at Land Economics.

Next I sent it to the Economics: The Open Access, Open Assessment E-journal replication section in 2017. Again, I revised the comment before it was sent out for review. Once it was sent out for review I received three supportive reviews (based on my experience, I'm fairly sure they all supported a revision based on tone and suggestions). I revised and resubmited the paper several times from December 2018 through 2019 in response to referees and a few rounds of editor comments. Finally, I felt that the editor, who was "unconvinced by [my] argument as [I] were presenting it, was pushing me in directions that I didn't want to take the comment and I withdrew it from review in November 2019.

I next submitted the paper to Ecological Economics in December 2019. I received three reviews, each of which were thorough and supportive of publication (again, if my experience reading reviews is correct). I revised the paper and it was accepted for publication. Then, the editors sent it to Desvousges et al. for their reply. I have not yet read the reply. I imagine I'll have more to say when I have read it.

This has been a very frustrating process. It is difficult to get a comment published, especially after it was rejected at the journal where the original article was published. But, I'm glad that the paper is published since I don't think the Desvousges et al. data supports their conclusions.

The link to the paper is https://authors.elsevier.com/a/1bTdB3Hb~0IaFx.

- Alden, Dave. "Experience with scripted role play in environmental economics." The Journal of Economic Education 30, no. 2 (1999): 127-132.
- Anderson, Soren T., and Michael D. Bates. "Hedonic prices and equilibrium sorting in housing markets: A classroom simulation." National Tax Journal 70, no. 1 (2017): 171-183.
- Anderson, Lisa R., and Sarah L. Stafford. "Choosing winners and losers in a classroom permit trading game." Southern Economic Journal (2000): 212-219.
- Ando, A. W., & Harrington, D. R. (2006). Tradable discharge permits: A student-friendly game. The Journal of Economic Education, 37(2), 187-201.
- Andrews, Thomas P. "The paper river revisited: A common property externality exercise." The Journal of Economic Education 33, no. 4 (2002): 327-332.
- Andrews, Thomas, “A Contingent Valuation Survey of Improved Water Quality in the Brandywine River: An Example of Applied Economics in the Classroom," Pennsylvania Economic Review, 10:1, 2001, pp. 1 - 13.
- Boulatoff, Catherine, and Carol Boyer. "Using contingent valuation with undergraduate students to elicit a community's preferences for wind farm development." Applied Economics Letters 17, no. 14 (2010): 1361-1366.
- Brauer, Jurgen, and Greg Delemeester. "Games economists play: A survey of non‐computerized classroom‐games for college economics." Journal of Economic Surveys 15, no. 2 (2001): 221-236.
- Carattini, Stefano, Eli P. Fenichel, Alexander Gordan, and Patrick Gourley. "For want of a chair: teaching price formation using a cap and trade game." Journal of Economic Education 51, no. 1 (2020): 52-66.
- Carter, Walter. "Teaching environmental economics." The Journal of Economic Education 4, no. 1 (1972): 36-42.
- Caviglia-Harris, Jill L. "Introducing undergraduates to economics in an interdisciplinary setting."
*The Journal of Economic Education*34, no. 3 (2003): 195-203. - Caviglia-Harris, Jill L., and Richard T. Melstrom. "Airing your dirty laundry: A quick marketable pollution permits game for the classroom."
*The Journal of Economic Education*46, no. 4 (2015): 412-419. - Cheo, Roland. "Teaching contingent valuation and promoting civic mindedness in the process." International Review of Economics Education 5, no. 2 (2006): 81-97.
- Christainsen, Gregory B. "The natural environment and economic education." The Journal of Economic Education 19, no. 2 (1988): 185-197.
- Corrigan, Jay R. "The pollution game: A classroom game demonstrating the relative effectiveness of emissions taxes and tradable permits." Journal of Economic Education 42, no. 1 (2011): 70-78.
- Decker, Christopher S. "Illustrating the Difference Between a Pigovian Tax and Emissions Fee Using Isoquant and Isocost Geometry." The American Economist 64, no. 2 (2019): 282-292.
- Dissanayake, S. T. M. (2016). Using STELLA simulation models to teach natural resource economics. The Journal of Economic Education, 47(1), 40-48.
- Dissanayake, S.T.M., Jacobson, S.A., (2016). Policies with varying costs and benefits: A land conservation classroom game. The Journal of Economic Education, 47(2), 142-160.
- Duke, Joshua M., and David M. Sassoon. "A classroom game on a negative externality correcting tax: Revenue return, regressivity, and the double dividend." The Journal of Economic Education 48, no. 2 (2017): 65-73.
- Fortmann, L., Beaudoin, J., Rajbhandari, I., Wright, A., Neshyba, S., & Rowe, P. (2020). Teaching modules for estimating climate change impacts in economics courses using computational guided inquiry. The Journal of Economic Education, 51(2), 143-158.
- Fouquet, Roger. "The carbon trading game."
*Climate Policy*3, no. sup2 (2003): S143-S155. - Fuller, Dan, and Doris Geide-Stevenson. "Consensus among economists: revisited." The journal of economic education 34, no. 4 (2003): 369-387.
- Giraud, Kelly L., and Mark Herrmann. "Classroom games: The allocation of renewable resources under different property rights and regulation schemes." The Journal of Economic Education 33, no. 3 (2002): 236-253.
- Ghent, Linda S., Alan Grant, and George Lesica. "The economics of Seinfeld." The Journal of Economic Education 42, no. 3 (2011): 317-318.
- Halteman, James. "Externalities and the Coase theorem: A diagrammatic presentation." The Journal of Economic Education 36, no. 4 (2005): 385-390.
- Hammer, Monica, and Tore Söderqvist. "Enhancing transdisciplinary dialogue in curricula development." Ecological Economics 38, no. 1 (2001): 1-5.
- Hazlett, Denise. "A common property experiment with a renewable resource." Economic Inquiry 35, no. 4 (1997): 858-861.
- Henderson, Amy. "Growing by getting their hands dirty: Meaningful research transforms students." The Journal of Economic Education 47, no. 3 (2016): 241-257.
- Holahan, William L., and Charles O. Kroncke. "Teaching the Economics of Non-renewable Resources to Undergraduates." International Review of Economics Education 3, no. 1 (2004): 77-87.
- Holt, Charles, Erica Myers, Markus Wråke, Svante Mandell, and Dallas Burtraw. "Teaching opportunity cost in an emissions permit experiment." International Review of Economics Education 9, no. 2 (2010): 34-42.
- Hoyt, Gail M., Patricia L. Ryan, and Robert G. Houston Jr. "The Paper River: A demonstration of externalities and Coase's theorem." The Journal of Economic Education 30, no. 2 (1999): 141-147.
- Johnson, Donn M. "The economics of stock pollutants: A graphical exposition." The Journal of Economic Education 26, no. 3 (1995): 236-244.
- Leet, Don, and Scott Houser. "Economics goes to Hollywood: Using classic films and documentaries to create an undergraduate economics course." The Journal of Economic Education 34, no. 4 (2003): 326-332.
- Lewis, Lynne Y. "A virtual field trip to the real world of cap and trade: Environmental economics and the EPA SO2 allowance auction." The Journal of Economic Education 42, no. 4 (2011): 354-365.
- Lewis, Lynne Y. "Environmental and Natural Resource Economics: Teaching the Non-Major and Major Simultaneously." In
*International Handbook on Teaching and Learning Economics*. Edward Elgar Publishing, 2011. - McPherson, Michael A., and Michael L. Nieswiadomy. "The tradable pollution permit exercise: Three additional tools." International Review of Economics Education 15 (2014): 51-59.
- Murphy, James J., and Juan-Camilo Cardenas. "An experiment on enforcement strategies for managing a local environment resource." The journal of economic education 35, no. 1 (2004): 47-61.
- Nguyen, To N. and Richard T. Woodward. 2008. “NutrientNet: An Internet-based Approach to Teaching Market-Based Policy for Environmental Management.” Journal of Economic Education. 42(2):38-54.
- Secchi, S., & Banerjee, S. (2019). A dynamic semester-long social dilemma game for economic and interdisciplinary courses. The Journal of Economic Education, 50(1), 70-85.
- Sexton, Robert L. "Using short movie and television clips in the economics principles class." The Journal of Economic Education 37, no. 4 (2006): 406-417.
- Tsigaris, Panagiotis, and Joel Wood. "A simple climate-Solow model for introducing the economics of climate change to undergraduate students." International Review of Economics Education 23 (2016): 65-81.
- Turner, Robert W. "Market failures and the rationale for national parks." The Journal of Economic Education 33, no. 4 (2002): 347-356.
- Weber, David W. "Pollution permits: a discussion of fundamentals." The Journal of Economic Education 33, no. 3 (2002): 277-290.
- Wooten, Jadrian J. "Economics media library." The Journal of Economic Education 49, no. 4 (2018): 364-365.