As evidence that I haven't totally kept my nose to the grindstone during my sabbatical, consider the following statistical evidence of the effects of dependent variable topcoding on regression results (click the image on the right).
The dependent variable is days spent at the beach by North Carolina residents (DAYS) including zeros (n=1086). The data is the most recent National Survey of Recreation and the Environment.
Excluding zeros, the mean number of days spent at the beach among CAMA county residents (DAYS) is 29 while the median is 7 (n=91).
As is typical with recreation data, some folks say they go to the recreation site every day. Weird. To handle the weird people I topcoded the DAYS variable at the 90th percentile. In other words I recoded 180, 300, 300, 365 and 365 days to 50 days. After topcoding DAYS at the 90th percentile, the mean number of days at the beach (DAYS2) is 15 and the median is 7 for CAMA county residents.
Analyzing everyone in NC (n=1086), those in close proximity (CAMA=1) to the beach spend more days there (duh, but that's an important economic result! i.e., price matters). Income doesn't matter (although it does predict whether someone participates in beach recreation). Race, education, sex and age also help explain the number of days. Topcoding can have significant impacts on the magnitudes and statistical significance of the coefficient estimates. In other words, making outliers more than regular people (instead of discarding them) reduces the influence of outliers.
I like the todcoded DAYS2 model best, but that may be just me and my distaste of five weird people and the large impact they have on regression results. Typically, I topcode the heck out of my recreation data.
Note: Check out the cool table feature that collects results from up to 3 models in LIMDEP. The only problem is that it is not displaying my title (Negative Binomial Model of ...) in the first row as advertised on page R8-19 of the manual.
Download: Ncbeach.lpj | Ncbeach.lim