Wednesday, October 20, 2010

Safety in Numbers in Jacobsen's paper

EDITED on October 21, 2010

An interesting discussion on "Safety in Numbers" (SIN) at the Washcycle forced me to think more clearly about the Jacobsen paper supporting the hypothesis.  Long story short, by my assessment, the regressions in Jacobsen's paper are consistent with SIN but falls short of convincing evidence.  That is, given the form of his model one could easily estimate a slope parameter less than 1 without more cyclists leading to decreased risk of accidents/injuries.  

Let's consider Jacobsen's model of California bicyclists where he regresses ...

log ACC - log POP = a + b ( log CYC - log COM )

where ACC is number of accidents/injuries, POP is total population, CYC is number of cyclists, and COM is number of commuters.  Jacobsen claims that an estimated b parameters above means that as the proportion of cyclists in a city increases, proxied by the proportion of cycling commuters among all commuters, the risk of an accident/injury decreases.  A simple model of the underlying data can be described in the following manner:

log POPi ~ N ( u , s )

log CYCi = a1 + b1 log POPi + e1
log ACCi = a2 + b2 log CYCi + e2
log COMi = a3 + b3 log POPi + e3

For simplicity, lets assume that everything is distributed normal -- i.e., the linear transformation of the variables are lognormal -- and independent.  Suppose we are in a world where everything is proportional to population such that b1 = b2 = b3 = 1.  Applied to the California regression in the Jacobsen paper and dropping subscripts for cities we get ...

Left Hand Side ...
log ACC - log POP = a2 + a1 + e1 + e2

Right Hand Side ...
log CYC - log COM = a1 - a3 + e1 - e3

Now suppose we follow Jacobsen's paper and do a regression of ...

left-hand side = A + B right-hand side

... and consider the effects of the variability the error terms on an estimate of B.  The accident error term, e2, functions just like the error term of classic regression.  Consequently, increasing/decreasing the variability of that parameter should do little to the estimated regression coefficient B.

This leaves us with e1 and e3.  Increasing the relative variability of e1 and e3 will affect the estimate of B.  Increasing relative variability of e1 increases the efficiency of the regression since the term appears on both sides we will get estimates of B close to 1.  Increasing the relative variability of e3 and we get something synonymous with a classic errors-invariables problem where the estimate of B is biased towards zero.  This is straightforward to simulate and have done so with an EXCEL spreadsheet.  Unfortunately, I don't have an opportunity to play with the Google docs spreadsheet software to upload it and make it directly available.  However, I am more than happy to share with anyone that contacts me.  In the case where the variance -- the percentage variability since we're dealing with logarithms -- is equal, I observe estimates of B approximately 0.5. 

I've put some effort into working this out analytically and allowing more complex relationships, but at this rate the boy will be mashing a 52/11 chainring/cog combination by the time I get something sensible.  Consequently, I produced the example above to demonstrate that we should have some skepticism regarding the estimates.  Just to be clear, there are other ways one could (reasonably) produce a biased estimate signalling SIN.  But the given example is analytically straight forward, easy to simulate, and, in my opinion, quite plausible in the real world. 

Some general comments ...

I want to think Jim Titus and Jonathan Krall for continuuing to object.  It made me use some brain matter that has been getting dusty.  Interestingly, as I worked through the issue, I'm less convinced by John Forester's argument that the regression is tainted (inherantly) by the relationship between population and cyclists.  Interestingly, a paper on pedestrian SIN in Oakland addressed the regression issue too -- they cited Brindle (1994) as presenting the argument that randomly generated data with no relationship could produce an apparent SIN effect -- and found that when properly weighted, the "faulty" regression yielded consistent parameter estimates of the simulated data.  I'm convinced that if done carefully, one could estimate a ratio model and get appropriate results.  Although I think that there are better ways to get at SIN. 

With that in mind, I've downloaded the California data in the Jacobsen paper.  We -- Robin Fisher and I -- are still wrapping up re-analyzing the Wachtel and Lewiston paper on intersections.  But my over-the-weekend thoughts suggest that this should be straight forward. 

Special thanks to ...

Robin Fisher.  Some of the work here is the result of his insights.

Footnote ...

The the predictor's nubmerator in Jacobsen's model is actually the number of commuter cyclists.  But as one can see, adding another equation that conditions cycling commuters to all commuters would make the model more complex without adding any insights.

If one is wondering, when I simulated my model ...

a1 = a2 = log(0.1)
a3 = log(0.6)

all of the variance terms were ~ N( 0 , 0.01 )