Official youngest (and only) son of Env-Econ is a junior in college. He is currently taking his first econometrics class. I got the text below on Friday:
Knowing full-well this is likely to get me in trouble, here's my explanation of heteroskedasticity:
Mark Thoma:
There's a version of this in econometrics, i.e. you know the model is correct, you are just having trouble finding evidence for it. It goes as follows. You are testing a theory you came up with, but the data are uncooperative and say you are wrong. But instead of accepting that, you tell yourself "My theory is right, I just haven't found the right econometric specification yet. I need to add variables, remove variables, take a log, add an interaction, square a term, do a different correction for misspecification, try a different sample period, etc., etc., etc." Then, after finally digging out that one specification of the econometric model that confirms your hypothesis, you declare victory, write it up, and send it off (somehow never mentioning the intense specification mining that produced the result).
Too much econometric work proceeds along these lines. Not quite this blatantly, but that is, in effect, what happens in too many cases. I think it is often best to think of econometric results as the best case the researcher could make for a particular theory rather than a true test of the model.
via economistsview.typepad.com
... cast the first stone.
I apologize ahead of time for the rant that's ahead. The rant is motivated by daily interactions with my colleagues who mostly have PhD's in statistics, computer science and mathematics. This gives me a different and less-economisty view of model building and validating in general, so bare with me.
When you sit down and read your average economics paper you generally know what's coming in terms of statistics. You're going to see a regression (often a variation of least squares), there's going to be some fancy talk about the error term and the author will likely base most of the conclusions on what the coefficients look like. This is actually true of many social science papers and it has been the primary means of social science modeling for quite a while.
However, when I see some of the cool new tools coming out of statistics or computer science I ask myself "why can't economists use these?" Those who follow finance relatively closely are aware of neural nets and some boosting techniques for time series forecasting but as a general rule many economists seem to turn a blind eye to these cutting edge techniques. I'm sure I'll get some comments mentioning Hal Varian's paper on machine learning and economics or Matthew Jackson's work with networks, but these seem to be exceptions rather than a trend. I've come up with a list of reasons why this may be, along with some comments:
1) Economists are obsessed with error terms. Anecdotally true - I've heard that it's hard to get a paper accepted to a decent journal that doesn't contain error terms for coefficients. With many machine learning models, you're generally concerned with some sort of cross-validation error rather than coefficient standard errors.
2) There isn't much interaction between economists and statisticians/mathematicians/computer scientists. This also is (sadly) a factor. I've seen a decent amount of interaction between econometricians and statisticians, but it's generally only frequentist statisticians. Not the new-fangled statisticians that are interested in some interesting Bayesian classifiers or machine learning techniques. Additionally, I rarely see economists publishing with authors from any of these three fields. There's a lot we could learn from them (and them from us!) so I see no reason why this should be binding.
3) It's harder to explain results from models more complicated than regression techniques. I can see this being an issue, particularly when some of these new techniques are first introduced and economists are not used to them. I can imagine the early seminars for these papers with no standard errors will lead to a lot of grumbling in the audience. However, I've seen economists explain some fairly dense material in a user-friendly way. It will take some practice, but I see nothing wrong with economists getting better at explaining models - it will add to our audience!
4) Economists don't have access to large enough data sets to use these techniques. I'm not sure about this one. After spending time in the tech field it seems like everyone has "big data," but I realize for many important topics, that's just not true. Some of it is due to not having the data "all in one place," but that is what research assistants are for! Particularly for revealed preference data, it seems like there should be some low-hanging fruit in terms of larger data sets where these new modeling methods could be used.
5) The math is too hard. No! Well, I guess it's different math, but that's nothing to be afraid of. For many methods such as trees and model boosting and bagging, there's not a lot of math involved that would be too different from existing econometrics.
6) Economists aren't always trying to predict something. Yes and no - sometimes we're trying to describe a system and sometimes we're trying to predict what the system will do. Decision trees are a perfect example of a method that lends itself relatively well to descriptive studies. However, I can understand why neural nets might not be as useful. It's a fair scientific question: what are our models capable of and what should they be used for? I don't think changing economics to a discipline that produces only predictive studies would be a good progression (and then we'd get even more comparisons to weathermen!) but I do think that we should take the time to evaluate whether existing models or new models can be used for the purposes we intend.
I'd love to hear comments - I would be happy to be proven wrong about this trend.
"Type I" and "Type II" errors, names first given by Jerzy Neyman and Egon Pearson to describe rejecting a null hypothesis when it's true and accepting one when it's not, are too vague for stat newcomers (and in general). This is better. [via]
via flowingdata.com
To this day I must think hard to figure out Type I and II errors.
Hat tip: Jayjit Roy
Last April I posted some results supplementing a recently published paper comparing approaches to handle panel data in Limdep. A referee asked for clustered standard errors, which Limdep doesn't do on top of a random effects panel Poisson estimator. Bill Greene provided some explanation for why on the Limdep listserv.
Eric Duquette (who, I seem to recall, won our NCAA tournament one year) left some good comments and via email offered to estimate some comparison models with Stata (thanks Eric!). His results are below (note that I've deleted 12 coefficients that are not statistically signficant in any of the models). The random effects Poisson results are in column (3) and the random effects Poisson with clustered standard errors results are in column (4). The biggest difference between the two is the standard error on the MISSICK coefficient, which is 72% higher in column (4). The other standard errors are, on average, 27% lower in column (4) compared to column (3).
The more troubling thing is the difference in the Limdep and Stata coefficient estimates. The consumer surplus for each seafood meal is about $27 (1/.0372) with Limdep and $1.72 (1/.580) with Stata. This would seem to have policy implications. Any ideas on why the coefficient estimates differ so much?
Update: I deleted "Poisson" from the title. Eric reports that the results below are from the continuous dependent variable regression. The RE Poisson results are the same in both Limdep and Stata (whew!). Also, it appears that Stata does not have an option to cluster standard errors in a RE Poisson, so referees who suggest that are mistaken. I've edited the post to reflect this (underlines are added and strikethroughs are cut).
US scientists found that even small changes in temperature or rainfall correlated with a rise in assaults, rapes and murders, as well as group conflicts and war.
The team says with the current projected levels of climate change, the world is likely to become a more violent place.
The study is published in Science.
Marshall Burke, from the University of California, Berkeley, said: "This is a relationship we observe across time and across all major continents around the world. The relationship we find between these climate variables and conflict outcomes are often very large."
The researchers looked at 60 studies from around the world, with data spanning hundreds of years.
They report a "substantial" correlation between climate and conflict.
Their examples include an increase in domestic violence in India during recent droughts, and a spike in assaults, rapes and murders during heatwaves in the US.
The report also suggests rising temperatures correlated with larger conflicts, including ethnic clashes in Europe and civil wars in Africa.
Mr Burke said: "We want to be careful, you don't want to attribute any single event to climate in particular, but there are some really interesting results."
The researchers say they are now trying to understand why this causal relationship exists.
"The literature offers a couple of different hints," explained Mr Burke.
"One of the main mechanisms that seems to be at play is changes in economic conditions. We know that climate affects economic conditions around the world, particularly agrarian parts of the world.
via www.bbc.co.uk
Looks like the confusion between correlation iand causation is one of bad reporting. From the Science article itself (page 2):
Reliably measuring an effect of climatic conditions on human conflict is complicated by the inherent complexity of social systems. In particular, a central concern is whether statistical relationships can be interpreted causally or if they are confounded by omitted variables. To address this concern, we restrict our attention to studies with research designs that are a scientific experiment or that approximate one...
Bill Greene, author of Econometric Analysis and developer of NLOGIT, describes the differences between robust, cluster and panel estimators on the Limdep listserv (reprinted with permission):
(1) "robust covariance matrix." Computed using inv(-H) * G'G * inv(H) where H is the second derivatives matrix and G is the matrix whose each row is the first derivatives of logL. Rarely clear what this matrix is robust to. In is favor, it rarely matter much in practice. If the theory of the model is correct, this matrix estimates the same thing as inv(-H), so it is harmless. Important point, this matrix treats the data as if it were a cross section. It is making no use of any panel aspects of the sample, or correlation across observations.(2) "Cluster correction" Computed using inv(-H) * Gc ' Gc * inv(-H) where H is as before, Gc has rows that are each equal to the within group sums of the first derivatives. This is implicitly attempting to account for correlations across observations of the score vectors. Resembles corrections such as Newey West. Rarely clear what the source of the correlation is. Generally, if the data are actually a panel and the panel aspect of the data is ignored by the estimator, this cluster estimator will pick up something. In the same situation, (2) is usually larger than (1), often much larger. In a panel data situation, fitting a probit model or a Poisson model by pooling the data instead of using a panel data estimator, I have seen standard errors rise by a factor of 2 or 4.
(3) "Panel estimator," Computed using inv(-H) or inv(G'G) where the hessians or first derivatives are appropriately computed using the likelihood for the panel data model. (3) often resembles (2), but frequently not because the estimator in (3) is the FIML panel estimator and the one in (2) is typically a "pooled" estimator that does not account for the panel nature of the data. ...
This was in response to a question I posed while trying to satisfy a referee who wanted us to useclustered standard errors on top of a random effects panel estimator. Dr. Greene's argument is that there is no need to cluster when the random effects panel estimator is used. This is reflected in NLOGIT where a cluster correction is not an option with the panel estimators. Apparently, and note that I don't use it so I don't know, Stata allows clustering on top of the panel estimator. It seems like overkill to me but I'm no econometrician (in case you are wondering, Tim is an [applied?] econometrician, I certainly asked him about it, and he agreed that the random effects model was adequately addressing the issue).
To illustrate the differences between clustered and random effects here is a supplemental table developed for our, now published, EARE paper.
In the pooled model where we treat each observation as a different consumer, 21 out of 24 coefficient estimates are statistically significant. In the clustered model that accounts for correlation among respondents but not the panel nature of the data, only four coefficients are statistically signficant. In the random effects panel model 15 out of 24 coefficients are statistically significant. In the random parameters model with only a constant random parameter, which I'm told is equivalent to the random effects model, 17 coefficients are statistically significant. I don't know what the results would be if we clustered the random effects panel model since I don't have Stata, but would be happy to share the data if anyone wants to try it out.
Consumer surplus per meal is economically different in the clustered and random effects models, $39 and $27, respectively. The clustered model finds no evidence of hypothetical bias (i.e., the coefficient on the stated preference data [SP] is not statistically significant) while there is evidence of hypothetical bias in the random effects model. Looking at the raw data, I'm suspicious of that non-result.
I'll let readers judge if our standard errors are too small. Any comments that would help me understand this issue would be great, since it seems like journal referees who use Stata hate the idea of statistically significant coefficient estimates. :)
An example for your econometrics course (source: Chronicle of Higher Education):
If one would like to create a spreadsheet for one's use, here are links to the president and faculty salary data:
My publisher has notified me that I can purchase hard copies of my Climatopolis book for $2.26 each. This isn't good news in terms of my expected future royalties but demand curves do slope down. I am purchasing 200 copies and giving them away for free to my UCLA students. So, this should provide you with a lower bound on how much I care about my students! The only good news here is that Climatopolis continues to be talked about in unusual places such as the National Review and random Chinese blogs.
On the broad topic of climate change adaptation, I plan to do two things;
1. I will be writing a historical migration paper with Leah Boustan and Paul Rhode on how U.S migrants responded to past disasters.
2. I plan to write a short overview paper listing the open micro economic research agenda (both in terms of reduced form slop and fancier structural work) on how to pin down the costs of climate adaptation.
A few comments: