Massachusetts Cities and Coronavirus

Last week, I put together a few graphs of Massachusetts covid-19 county cases and correlates. Unfortunately, working at the county level didn’t create many data points, so there wasn’t much insight to be gained. I ended the blog post with a wishy-washy pledge to maybe try to compile a city-level data set.

Well, I didn’t do it. But the Massachusetts Department of Public Health did! Starting April 15, the DPH began tracking coronavirus cases in Massachusetts’s 351 towns and cities. I took those case counts and paired them with demographic and economic data from the 2018 American Community Survey to see if any patterns emerged.

Due to some quirks of the Census, it would have been really tedious to get the data I wanted for every town and city. I ended up opting for a shortcut of sorts — using cities and towns that were their own Census Designated Place. (Don’t ask.) After all was said and done, I was left with 54 cities and towns that I could pair with data from the 2018 American Community Survey.

This data set is biased toward larger areas, which probably also means younger and more diverse areas too. And, of course, all cities and towns in the set are from Massachusetts, which probably introduces other biases.

Anyway, here’s what I’ve found so far:

(Note: I’m using log-scale to condense case counts and, in the case below, population. Boston, for example, has over 4,000 cases [log(4,000) = 3.602], while Springfield has about 600 [log(600) = 2.778].)

Log cases are best predicted by log population. This is pretty much common sense: more people means more vectors for disease, and usually denser population. A regression summary is in the caption below the graph.

log(Cases) = 1.3472 * log(population) – 4.0471
R^2 = .7526
p-value < 2.2e-16

After population, race is the next-best indicator of case counts. I’d expected — based on media reports and the word of local officials — to find a relationship between the percentage of black or Hispanic residents and log cases. But that didn’t really show up. Instead, the proportion of residents that are non-Hispanic whites has the best linear relationship to log cases — and the only one with a negative slope. The regression summary statistics in the caption are only for the white-log case relationship.

log(cases) = -1.8623 * white + 3.5474
R^2 = .545
p-value: 1.887e-10

This does actually fit the narrative pretty nicely if we lump all non-white ethnic groups together mentally. But as I noted in the last post, race is collinear with so many other variables that it’s hard to know what we’re seeing.

There is, however, at least some indication that what we’re observing might really be about race. Other variables we would imagine to correlate to race and population have much weaker relationships to case counts. (“Public transport” is the percentage of people who take public transport to work, and “Poverty line” is the percentage of people below the poverty line.) I grabbed a bunch of variables like these, but so far, none of them are very helpful in explaining what’s going on.

Combining log population and the proportion of residents who are white gives us an adjusted R^2 of .8209, which is nice. But when I tried to use that model to predict case counts for four municipalities that weren’t among the 54 I’d defaulted into working with, it only did okay.

The problem, I think, is that the selection of cities in the data set I’m working with is biased toward larger areas. It’s also possible (in fact, likely) that there’s both more to the story and that an element of randomness is at play, too. Aside from age, it’s been hard for professionals to isolate significant risk factors.

A few MA covid graphs

This is a low-stakes post.

Massachusetts has been releasing county-level coronavirus case counts, which I paired with data from the US Census to look for patterns. I actually didn’t end up finding anything particularly interesting, but some of the graphs are nice, so I thought I’d share.

On case growth

A few days ago, it looked like the growth of covid cases in Massachusetts might be flattening. But as of yesterday, it seems like that’s not quite the case across the board. Here are the total case counts per 1,000 residents of each county county since March 15:

Dukes and Nantucket counties omitted.

And here are cases per 1,000 residents on April 7, with the geometric growth rate of cases over the last week indicated by color:

Growth rate calculated as (x_1/x_0)^(1/7)-1

Berkshire, Barnstable, and Franklin counties have the lowest case growth rates, ranging from 6.2% to 7.3% on average per day over the last week. These counties have some common characteristics:

  • They’re geographically remote;
  • They are the only MA counties to have experienced population decline over the last decade;
  • They have the highest non-Hispanic white populations per capita and the least foreign residents;
  • They have the greatest proportion of residents over 65 (at least 22% in each case!);
  • Franklin and Berkshire counties have the lowest population densities, at 102 and 141 people per square mile, respectively.

To me, the above is consistent with the idea that economic activity is a vector for the spread of coronavirus (not literally, but it gets people in contact which causes person-person transmission).

Plymouth, Hampden, and Bristol are the counties with the fastest-growing case counts, each of them averaging an increase of over 11% daily over the last week. These counties don’t have much in common, so I’m having trouble putting together a potential unifying narrative.

Race and population density as correlates

It’s starting to look like black Americans might be more susceptible to coronavirus than other racial and ethnic groups. At first glance, that appears to show up in county-level data. But upon closer examination, that doesn’t appear to be the case — first because population density and the percentage of black residents are collinear, and population per square mile has a higher correlation coefficient; and second because Suffolk county (Boston) is influencing the linear relationship in both cases. Adjusted R-squared drops heavily if we exclude Suffolk county from the data set. (Race and population density were the best predictors I found of cases per 1,000 residents.)

This isn’t to say race and its many correlates aren’t good predictors. I think it speaks more to the (severe) limits of the data set I’m working with. If I have time, I may try to build a city-level data set. If anyone knows of one (or something better), link me!