Massachusetts Cities and Coronavirus

Last week, I put together a few graphs of Massachusetts covid-19 county cases and correlates. Unfortunately, working at the county level didn’t create many data points, so there wasn’t much insight to be gained. I ended the blog post with a wishy-washy pledge to maybe try to compile a city-level data set.

Well, I didn’t do it. But the Massachusetts Department of Public Health did! Starting April 15, the DPH began tracking coronavirus cases in Massachusetts’s 351 towns and cities. I took those case counts and paired them with demographic and economic data from the 2018 American Community Survey to see if any patterns emerged.

Due to some quirks of the Census, it would have been really tedious to get the data I wanted for every town and city. I ended up opting for a shortcut of sorts — using cities and towns that were their own Census Designated Place. (Don’t ask.) After all was said and done, I was left with 54 cities and towns that I could pair with data from the 2018 American Community Survey.

This data set is biased toward larger areas, which probably also means younger and more diverse areas too. And, of course, all cities and towns in the set are from Massachusetts, which probably introduces other biases.

Anyway, here’s what I’ve found so far:

(Note: I’m using log-scale to condense case counts and, in the case below, population. Boston, for example, has over 4,000 cases [log(4,000) = 3.602], while Springfield has about 600 [log(600) = 2.778].)

Log cases are best predicted by log population. This is pretty much common sense: more people means more vectors for disease, and usually denser population. A regression summary is in the caption below the graph.

log(Cases) = 1.3472 * log(population) – 4.0471
R^2 = .7526
p-value < 2.2e-16

After population, race is the next-best indicator of case counts. I’d expected — based on media reports and the word of local officials — to find a relationship between the percentage of black or Hispanic residents and log cases. But that didn’t really show up. Instead, the proportion of residents that are non-Hispanic whites has the best linear relationship to log cases — and the only one with a negative slope. The regression summary statistics in the caption are only for the white-log case relationship.

log(cases) = -1.8623 * white + 3.5474
R^2 = .545
p-value: 1.887e-10

This does actually fit the narrative pretty nicely if we lump all non-white ethnic groups together mentally. But as I noted in the last post, race is collinear with so many other variables that it’s hard to know what we’re seeing.

There is, however, at least some indication that what we’re observing might really be about race. Other variables we would imagine to correlate to race and population have much weaker relationships to case counts. (“Public transport” is the percentage of people who take public transport to work, and “Poverty line” is the percentage of people below the poverty line.) I grabbed a bunch of variables like these, but so far, none of them are very helpful in explaining what’s going on.

Combining log population and the proportion of residents who are white gives us an adjusted R^2 of .8209, which is nice. But when I tried to use that model to predict case counts for four municipalities that weren’t among the 54 I’d defaulted into working with, it only did okay.

The problem, I think, is that the selection of cities in the data set I’m working with is biased toward larger areas. It’s also possible (in fact, likely) that there’s both more to the story and that an element of randomness is at play, too. Aside from age, it’s been hard for professionals to isolate significant risk factors.