Wednesday, October 20, 2010

Safety in Numbers in Jacobsen's paper

EDITED on October 21, 2010

An interesting discussion on "Safety in Numbers" (SIN) at the Washcycle forced me to think more clearly about the Jacobsen paper supporting the hypothesis.  Long story short, by my assessment, the regressions in Jacobsen's paper are consistent with SIN but falls short of convincing evidence.  That is, given the form of his model one could easily estimate a slope parameter less than 1 without more cyclists leading to decreased risk of accidents/injuries.  

Let's consider Jacobsen's model of California bicyclists where he regresses ...

log ACC - log POP = a + b ( log CYC - log COM )

where ACC is number of accidents/injuries, POP is total population, CYC is number of cyclists, and COM is number of commuters.  Jacobsen claims that an estimated b parameters above means that as the proportion of cyclists in a city increases, proxied by the proportion of cycling commuters among all commuters, the risk of an accident/injury decreases.  A simple model of the underlying data can be described in the following manner:

log POPi ~ N ( u , s )

log CYCi = a1 + b1 log POPi + e1
log ACCi = a2 + b2 log CYCi + e2
log COMi = a3 + b3 log POPi + e3

For simplicity, lets assume that everything is distributed normal -- i.e., the linear transformation of the variables are lognormal -- and independent.  Suppose we are in a world where everything is proportional to population such that b1 = b2 = b3 = 1.  Applied to the California regression in the Jacobsen paper and dropping subscripts for cities we get ...

Left Hand Side ...
log ACC - log POP = a2 + a1 + e1 + e2

Right Hand Side ...
log CYC - log COM = a1 - a3 + e1 - e3

Now suppose we follow Jacobsen's paper and do a regression of ...

left-hand side = A + B right-hand side

... and consider the effects of the variability the error terms on an estimate of B.  The accident error term, e2, functions just like the error term of classic regression.  Consequently, increasing/decreasing the variability of that parameter should do little to the estimated regression coefficient B.

This leaves us with e1 and e3.  Increasing the relative variability of e1 and e3 will affect the estimate of B.  Increasing relative variability of e1 increases the efficiency of the regression since the term appears on both sides we will get estimates of B close to 1.  Increasing the relative variability of e3 and we get something synonymous with a classic errors-invariables problem where the estimate of B is biased towards zero.  This is straightforward to simulate and have done so with an EXCEL spreadsheet.  Unfortunately, I don't have an opportunity to play with the Google docs spreadsheet software to upload it and make it directly available.  However, I am more than happy to share with anyone that contacts me.  In the case where the variance -- the percentage variability since we're dealing with logarithms -- is equal, I observe estimates of B approximately 0.5. 

I've put some effort into working this out analytically and allowing more complex relationships, but at this rate the boy will be mashing a 52/11 chainring/cog combination by the time I get something sensible.  Consequently, I produced the example above to demonstrate that we should have some skepticism regarding the estimates.  Just to be clear, there are other ways one could (reasonably) produce a biased estimate signalling SIN.  But the given example is analytically straight forward, easy to simulate, and, in my opinion, quite plausible in the real world. 

Some general comments ...

I want to think Jim Titus and Jonathan Krall for continuuing to object.  It made me use some brain matter that has been getting dusty.  Interestingly, as I worked through the issue, I'm less convinced by John Forester's argument that the regression is tainted (inherantly) by the relationship between population and cyclists.  Interestingly, a paper on pedestrian SIN in Oakland addressed the regression issue too -- they cited Brindle (1994) as presenting the argument that randomly generated data with no relationship could produce an apparent SIN effect -- and found that when properly weighted, the "faulty" regression yielded consistent parameter estimates of the simulated data.  I'm convinced that if done carefully, one could estimate a ratio model and get appropriate results.  Although I think that there are better ways to get at SIN. 

With that in mind, I've downloaded the California data in the Jacobsen paper.  We -- Robin Fisher and I -- are still wrapping up re-analyzing the Wachtel and Lewiston paper on intersections.  But my over-the-weekend thoughts suggest that this should be straight forward. 

Special thanks to ...

Robin Fisher.  Some of the work here is the result of his insights.

Footnote ...

The the predictor's nubmerator in Jacobsen's model is actually the number of commuter cyclists.  But as one can see, adding another equation that conditions cycling commuters to all commuters would make the model more complex without adding any insights.

If one is wondering, when I simulated my model ...

a1 = a2 = log(0.1)
a3 = log(0.6)

all of the variance terms were ~ N( 0 , 0.01 )


  1. You analyzed Jacobsen's statement of theory using four variables: accident number, population number, cyclists number, commuters number. Jacobsen present graphs which he claims demonstrate his theory, and all of these graphs are based on three variables forming two ratios. Many people have been persuaded by the persuasive prettiness of these declining curves. But there are two problems. Any set of data from three variables will produce such pretty declining curves when presented in Jacobsen's manner. The curves are not graphs of Jacobsen's theory at all, for the four variables in the theory cannot be condensed into only three. Therefore, the graphs are not evidence that supports Jacobsen's theory; indeed, they are evidence only of mathematical ineptitude.

  2. Hi John,

    I'm terribly sorry for the delay in response. I expected a notice in my inbox when someone posted.

    Jacobsen does *two* things.

    First, he does regressions in the manner I describe. He then claims that one can simply interpret those coefficients as evidence of Safety in Numbers if the estimated slope parameter is significantly less than one.

    Second, to demonstrate the effects, he converts both sides of the regression to the linear scale, divides both sides by the regressor, and then plots the data. I believe the line that he graphs is from the regression. So certainly the data points are subject to the natural bias you describe. Jacobsen's documentation has gaps; but my interpretation is that the line graphed on the plots is from the regression where the slope parameter is the estimated coefficient minus one. So the line could be unbiased assuming he does the variance correction for the log to linear transformation.

    Mind you, there is a disconnect between the population groups when he converts the regressions to graphs. He uses population on the left hand side and commuters on the right hand side. But when he creates his "accident rate", he divides both sides by proportion of commuters that cycled to work. So there is a potentially important difference there as well.

    More generally, whether the regressions are biased depends on, among other things, the connection between total population and the true number of cyclists. Some reasonable specifications would lead to unbiased estimates in the absence of other effects.

    Anyway, I think we both agree that Jacobsen's paper falls far short of a SIN proof.

  3. Whoops ... my reply was interrupted by the babies. But I think that the general message is consistent.