library(janitor)
library(readxl)
library(sf)
library(stargazer)
library(tidyverse)
COVID-19 Deaths in North Carolina: Exploratory Data Analysis
In this project, I am going to be conducting an exploratory data analysis working with COVID-19 data from Johns Hopkins University. The data can be found on GitHub with some discussion and context on the JHU Coronavirus Resource Center website. I will be plotting that data geographically and bringing in other data sources to explore the relationship between COVID death rates and other demographic, socioeconomic, and health factors. This type of analysis could go on indefinitely until I run out of variables, so I will limit myself to just a few. Note that the purpose of this post is to explore the data and a few relationships and trends, and to demonstrate with R code how to summarize, visualize, and quantify these patterns. I am not attempting to build a predictive model or train an algorithm on this data.
I’ll go ahead and load in the packages I will be using.
Deaths By County
Deaths Data Set
First, read the data on death counts.
<- read_csv("time_series_covid19_deaths_US.csv", show_col_types = FALSE) deaths
I need to convert this very wide dataset, which has a column for each day, into a long dataset with a date
variable and a deaths
variable. I know that I will only be working with North Carolina data, so I can filter down to that. I also convert date
into an actual date type and trim off some extraneous columns, as well as cleaning up variable names. I also need to make FIPS into a character type.
<- deaths |>
deaths_long filter(Province_State == "North Carolina") |>
pivot_longer(
cols = matches(r"(\d{1,2}/\d{1,2}/\d{2})"),
names_to = "date",
values_to = "deaths"
|>
) mutate(
date = as_date(date, format = "%m/%d/%y"),
FIPS = as.character(FIPS)
|>
) clean_names() |>
rename(
county = admin2,
state = province_state
|>
) select(fips, county, state, population, date, deaths)
To get a bit better sense of what this data set looks like, I first zoom in on my hometown’s Henderson County, NC.
<- deaths_long |>
henderson_county filter(county == "Henderson")
I can make a quick plot of deaths over time.
ggplot(henderson_county, aes(x = date, y = deaths)) +
geom_line() +
labs(x = "Date", y = "Deaths")
This shows us that the deaths
variable is the cumulative total deaths over time, not new deaths reported at each data point.
County Borders
I got the GeoJSON file with the state and county borders from NC One Map.
<- read_sf("nc_borders.geojson") |>
borders clean_names()
Just a quick look at the map that provided:
ggplot(borders) + geom_sf() + theme_bw()
In the COVID deaths data frame, FIPS codes are listed with state concatenated with county, but the borders
data frame only lists the county. In order to be able to join the two together, I add the state FIPS code onto the borders
data frame:
<- borders |>
borders mutate(fips = paste0("37", fips))
Next, I want to get the overall total deaths for each county. I take the most recent date of data, because the death numbers reported are cumulative. I then calculate the number of deaths per thousand residents.
<- deaths_long |>
total_deaths slice_max(date) |>
mutate(deaths_per_thousand = (deaths / population) * 1000)
Joining the deaths data frame with the borders data frame:
<- borders |>
joined_deaths_and_borders inner_join(
total_deaths,by = "fips"
)
Results
Finally, I can plot the deaths per thousand over the map of NC counties.
ggplot(joined_deaths_and_borders, aes(fill = deaths_per_thousand)) +
geom_sf() +
scale_fill_distiller(palette = "YlOrRd", direction = 1) +
labs(fill = "Deaths per\nthousand residents") +
theme_bw()
Which counties had the most deaths per thousand residents?
|>
total_deaths slice_max(deaths_per_thousand, n = 10) |>
select(county, deaths_per_thousand)
# A tibble: 10 × 2
county deaths_per_thousand
<chr> <dbl>
1 Rutherford 6.46
2 Surry 5.86
3 Montgomery 5.45
4 Columbus 5.30
5 Jones 5.20
6 Northampton 4.98
7 Washington 4.92
8 Cherokee 4.79
9 Chowan 4.73
10 Richmond 4.73
Which counties had the fewest?
|>
total_deaths slice_min(deaths_per_thousand, n = 10) |>
select(county, deaths_per_thousand)
# A tibble: 10 × 2
county deaths_per_thousand
<chr> <dbl>
1 Dare 0.892
2 Camden 0.920
3 Orange 1.06
4 Wake 1.18
5 Durham 1.23
6 Franklin 1.38
7 Watauga 1.39
8 Currituck 1.44
9 Pitt 1.59
10 Mecklenburg 1.70
Median Income
I got data on unemployment and household income from the USDA Economic Research Service. I downloaded it in .xlsx
format because the csv somehow didn’t contain the same data. Even though I have the data for unemployment available, I am not going to consider that in this exploratory analysis. COVID lockdowns and the economic fallout of the pandemic had very obvious impacts on unemployment, but the relationship is more complicated than what I think I can accomplish here with the data that I have. So I will just be looking at median income by county for 2021. I will filter out the statewide total.
<- read_excel("UnemploymentReport.xlsx", skip = 2) |>
income clean_names() |>
select(fips, name, median_household_income_2021) |>
rename(median_income = median_household_income_2021) |>
filter(name != "North Carolina")
I can first get some summary statistics to get a sense of the distribution of the data..
summary(income$median_income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
38613 45556 52680 55008 60517 91558
For the histogram, I include a reference line for the average median income.
ggplot(income, aes(x = median_income)) +
geom_histogram(binwidth = 2500, color = "black") +
geom_vline(aes(xintercept = mean(median_income), color = "mean")) +
scale_color_manual(name = "Reference", values = c(mean = "red")) +
labs(x = "Median Income", y = "Count")
Next, I can plot median income by county.
full_join(borders, income, by = "fips") |>
ggplot(aes(fill = median_income)) +
geom_sf() +
labs(fill = "Median Income") +
theme_bw()
Which counties have the highest median income?
|>
income slice_max(median_income, n = 10) |>
select(name, median_income)
# A tibble: 10 × 2
name median_income
<chr> <dbl>
1 Wake County, NC 91558
2 Union County, NC 87553
3 Chatham County, NC 82764
4 Currituck County, NC 82759
5 Orange County, NC 79814
6 Camden County, NC 79162
7 Cabarrus County, NC 79148
8 Mecklenburg County, NC 75138
9 Lincoln County, NC 73319
10 Durham County, NC 71436
And the lowest?
|>
income slice_min(median_income, n = 10) |>
select(name, median_income)
# A tibble: 10 × 2
name median_income
<chr> <dbl>
1 Robeson County, NC 38613
2 Halifax County, NC 38944
3 Tyrrell County, NC 39970
4 Hertford County, NC 40461
5 Northampton County, NC 40524
6 Anson County, NC 40773
7 Edgecombe County, NC 41157
8 Columbus County, NC 41206
9 Bertie County, NC 41280
10 Richmond County, NC 42158
I am curious whether there is a relationship between death counts and median income. To examine that, I will need to join the data sets.
<- inner_join(
deaths_and_income
total_deaths,
income,by = "fips"
)
I can do a plot of income versus death rate:
ggplot(deaths_and_income, aes(x = median_income, y = deaths_per_thousand)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Median Income", y = "Deaths per Thousand Residents")
There is a pretty clear negative correlation between median income and coronavirus death rate. We can quantify this using a simple regression model:
<- lm(
model_death_income ~ median_income,
deaths_per_thousand data = deaths_and_income
)stargazer(model_death_income, type = "html")
Dependent variable: | |
deaths_per_thousand | |
median_income | -0.0001*** |
(0.00001) | |
Constant | 6.945*** |
(0.443) | |
Observations | 100 |
R2 | 0.413 |
Adjusted R2 | 0.407 |
Residual Std. Error | 0.916 (df = 98) |
F Statistic | 69.060*** (df = 1; 98) |
Note: | p<0.1; p<0.05; p<0.01 |
The results are a bit hard to make sense of because the unit of median_income
is dollar, and so the coefficient on that term tells us the predicted change in deaths_per_thousand
per one additional dollar of income. That is a very small scale. Just to make it a little easier to parse, I will adjust the model to regress on median income in tens of thousands of dollars. That way, the coefficient will represent predicted change in death rate per additional $10,000 dollars.
<- lm(
model_death_income_adjusted ~ I(median_income / 10000),
deaths_per_thousand data = deaths_and_income
)
stargazer(model_death_income_adjusted, type = "html")
Dependent variable: | |
deaths_per_thousand | |
I(median_income/10000) | -0.654*** |
(0.079) | |
Constant | 6.945*** |
(0.443) | |
Observations | 100 |
R2 | 0.413 |
Adjusted R2 | 0.407 |
Residual Std. Error | 0.916 (df = 98) |
F Statistic | 69.060*** (df = 1; 98) |
Note: | p<0.1; p<0.05; p<0.01 |
We can interpret this as saying that for a change in median income of $10,000, the average death rate per thousand residents decreases by 0.65. Given that the entire range of the death rate variable covers just 0.9 to 6.5, that’s a substantial association. This is obviously a very simple model that doesn’t control for other variables, but it is an interesting first pass at that correlation.
Conclusion
This has been a really interesting and fun exercise. In general, the relationships that I examined were not particularly surprising, but I was shocked to see that there was no significant relationship between physician concentration and COVID death rate. It’s worth thinking about why that might be the case and exploring the relationship between primary care and severe acute health outcomes. In the future, I may revisit these data sets to build a machine learning model to predict COVID death rate from these and other covariates. Thanks for reading!