knitr::opts_chunk$set(message = FALSE, warning = FALSE)
The package renv
is installed to the project to maintain
a reproducible workflow in RStudio. renv
allows for the
creation of a local environment with a private library to manage the
project’s R dependencies.
# install.packages("renv")
# Initialize new local environment with private library
# renv::init()
The tidyverse
is an opinionated collection of R packages
designed for data science. All of its packages share an underlying
design philosophy, grammar, and data structures.
Each one has tools created specifically for different points
throughout the process. Utilizing tidyverse
, specific steps
are followed in undertaking a data science project.
# install.packages("tidyverse")
library(tidyverse)
Many functions used throughout the programming of the project to
manipulate data are provided by the dplyr
package. It is
called often to extract, modify and combine tibbles based on specified
conditions.
The package odbc
is installed to setup connections to
the databases in RStudio via Microsoft ODBC (Open Database Connectivity)
drivers. It serves as a backend which the DBI
package uses
to define a frontend communication interface in RStudio.
# install.packages("odbc")
library(odbc)
# List available ODBC User data sources on system
# odbcListDataSources()
odbc
requires the installation of MySQL Connector/ODBC.
ODBC User data sources are added through Microsoft ODBC Data Source
Administrator using MySQL ODBC 8.0 ANSI Driver. Connection parameters
are specified in MySQL Connector/ODBC.
Connections are created from User DSNs (Data Source Name) by specifying its name.
# Create connection to simplemaps counties database
con_us_counties <- dbConnect(odbc(), "simplemaps")
# Create connection to 2021spfdb COVID-19 deaths database
con_covid19_deaths <- dbConnect(odbc(), "2021spfdb")
Tables fetched from the submitted SQL queries are imported as
tibbles. Tibbles are a table format provided by the tibble
package. They inherit R’s traditional data.frame class, but have
improved behaviors.
# Create SQL query string
query_us_counties <- "SELECT * from uscounties"
# Fetch queried table from as tidyr tibble
us_counties <- dbGetQuery(con_us_counties,query_us_counties) %>%
as_tibble()
us_counties
The map_id
column in the county_data table contain FIPS
(Federal Information Processing Standards) codes of each corresponding
county. It is renamed as county_fips to match that of the uscounties
table.
Rows are retrieved where the date value is 2022-05-13
to
maintain accuracy. This is because the data in the simplemaps United
States Counties Database was updated as of May 13, 2022.
# Create SQL query string
query_covid19_deaths <- "SELECT map_id as county_fips, cases, deaths FROM county_data WHERE date='2022-05-13'"
# Fetch queried table as tidyr tibble
covid19_deaths <- dbGetQuery(con_covid19_deaths, query_covid19_deaths) %>%
as_tibble()
covid19_deaths
dbDisconnect(con_us_counties)
dbDisconnect(con_covid19_deaths)
rm(con_us_counties, con_covid19_deaths, query_us_counties, query_covid19_deaths)
We also utilize the the national-level median household income from the U.S. Census Bureau as a means of categorizing a county’s own median household income in comparison to that of the whole country.
library(readxl)
us_states_income <- read_xlsx(
"us_census_median_household_income.xlsx",
col_names = c("state_name","median_income_state"),
range = "A10:B60"
)
us_states_income
The simplemaps United States Counties Database contains seventy-eight fields for each county. However, a number of these are considered irrelevant and are not included.
us_counties_income <- us_counties %>%
select(
county_fips,
county,
state_id,
state_name,
population,
density,
income_household_median
)
us_counties_income
The simplemaps United States Counties Database includes data from the US territories - American Samoa, Guam, Northern Mariana Islands, Puerto Rico, and Virgin Islands.
This is excluded because the county_data
table contains
data excluding that of US territories. Additionally, the data for US
territories in the uscounties
table is filled with missing
values.
# Create list of state_id values of US territories
territory_ids <- c("AS","GU","MP","PR","VI")
# Create tibble of US territories
us_counties_income %>%
filter(state_id %in% territory_ids)
# Create %notin% function
`%notin%` <- Negate(`%in%`)
# Create tibble of US states
us_counties_income <- us_counties_income %>%
filter(state_id %notin% territory_ids)
us_counties_income
rm(territory_ids)
Handling missing values is mandatory in any data science project. The
tidyr
package offers different approaches to this. However,
through inspecting the tibbles, neither of them contain any NA
values.
# Display tibble of missing values in counties_states
us_counties_income %>%
filter(if_any(everything(), is.na))
# Display tibble of missing values in covid_deaths
covid19_deaths %>%
filter(if_any(everything(), is.na))
After tidying the tibbles, both will be combined on the
county_fips
column. This is executed through
dplyr
’s inner_join
function which retains only
rows with matches between both.
# Merge us_counties and covid19_deaths
us_counties_income_deaths <-
inner_join(us_counties_income, covid19_deaths, by = "county_fips") %>%
arrange(county_fips) # Arrange by county_fips column
us_counties_income_deaths
rm(us_counties, covid19_deaths)
# Calculate the death rate per 100,000 population
us_counties_income_deaths <- us_counties_income_deaths %>%
mutate(death_rate = round(((deaths / population) * 100000), digits = 2)) %>%
# Move the 'death_rate' column next to 'deaths'
relocate(death_rate, .after = deaths)
us_counties_income_deaths
# Retrieve the national median household income in the 'us_states_income' dataframe
us_income_household_median <- us_states_income$median_income_state[1]
# Calculate the total population across all counties in the 'us_counties_income_deaths' dataframe
us_population <- us_counties_income_deaths %>%
pull(population) %>% sum()
# Calculate the total number of COVID-19 deaths across all counties in the 'us_counties_income_deaths' dataframe
us_covid19_deaths <- us_counties_income_deaths %>%
pull(deaths) %>% sum()
# Calculate the COVID-19 death rate per 100,000 population for the entire dataset
us_covid19_death_rate <- (us_covid19_deaths / us_population) * 100000
rm(us_states_income)
# Add a new column 'category' with initial NA values to the 'us_counties_income_deaths' dataframe
us_counties_income_deaths <- us_counties_income_deaths %>%
add_column(category = NA)
# Loop through each row in the 'us_counties_income_deaths' dataframe
for (i in 1:nrow(us_counties_income_deaths)) {
# Check if the median household income is less than the overall median income
if (us_counties_income_deaths$income_household_median[i] < us_income_household_median) {
# Check if the death rate is less than the overall COVID-19 death rate
if (us_counties_income_deaths$death_rate[i] < us_covid19_death_rate) {
# Assign the category "Low Income, Low Death Rate" if both conditions are met
us_counties_income_deaths$category[i] <- "Low Income, Low Death Rate"
}
else {
# Assign the category "Low Income, High Death Rate" if the income condition is met but not the death rate condition
us_counties_income_deaths$category[i] <- "Low Income, High Death Rate"
}
}
else {
# Check if the death rate is less than the overall COVID-19 death rate
if (us_counties_income_deaths$death_rate[i] < us_covid19_death_rate) {
# Assign the category "High Income, Low Death Rate" if the income condition is not met but the death rate condition is met
us_counties_income_deaths$category[i] <- "High Income, Low Death Rate"
}
else {
# Assign the category "High Income, High Death Rate" if neither condition is met
us_counties_income_deaths$category[i] <- "High Income, High Death Rate"
}
}
}
rm(i, us_income_household_median, us_covid19_death_rate)
# install.packages("plotly")
library(plotly)
# Create a scatter plot using Plotly
us_counties_scatter <- plot_ly(
us_counties_income_deaths,
x = ~income_household_median / 1000,
y = ~death_rate,
type = 'scatter',
mode = 'markers',
text = paste(us_counties_income_deaths$county, us_counties_income_deaths$state_id, sep = ", ")
)
# Customize the layout of the scatter plot
us_counties_scatter <- us_counties_scatter %>%
layout(
# title = "Median Household Income vs COVID-19 Death Rate in U.S. Counties (05-13-2022)",
xaxis = list(title = 'Median Household Income ($K)'), # X-axis label
yaxis = list(title = 'COVID-19 Mortality Rate (per 100K)') # Y-axis label
)
us_counties_scatter
# Perform a correlation test between 'income_household_median' and 'death_rate'
cor.test(
us_counties_income_deaths$income_household_median,
us_counties_income_deaths$death_rate
)
##
## Pearson's product-moment correlation
##
## data: us_counties_income_deaths$income_household_median and us_counties_income_deaths$death_rate
## t = -31.099, df = 3132, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5120385 -0.4585188
## sample estimates:
## cor
## -0.4857337
# Create a box plot for 'income_household_median'
plot_ly(
y = us_counties_income_deaths$income_household_median, # Data for the Y-axis
text = paste(us_counties_income_deaths$county, us_counties_income_deaths$state_id, sep = ", "),
type = "box", # Box plot type
name = "Median Household Income", # Trace name for the legend
)
# Create a box plot for 'death_rate'
plot_ly(
y = us_counties_income_deaths$death_rate, # Data for the Y-axis
text = paste(us_counties_income_deaths$county, us_counties_income_deaths$state_id, sep = ", "),
type = "box", # Box plot type
name = "COVID-19 Death Rate", # Trace name for the legend
)
# Compute the IQR of the household median income
q1_income <- quantile(us_counties_income_deaths$income_household_median, 0.25) # 1st Quartile
q3_income <- quantile(us_counties_income_deaths$income_household_median, 0.75) # 3rd Quartile
iqr_income <- q3_income - q1_income # Interquartile Range (IQR)
# Compute the IQR of the death rate
q1_death <- quantile(us_counties_income_deaths$death_rate, 0.25) # 1st Quartile
q3_death <- quantile(us_counties_income_deaths$death_rate, 0.75) # 3rd Quartile
iqr_death <- q3_death - q1_death # Interquartile Range (IQR)
# Identify potential outliers based on the IQR method
us_counties_outliers_death <- us_counties_income_deaths %>%
filter(
death_rate < q1_death - 1.5 * iqr_death | # Lower bound for death rate outliers
death_rate > q3_death + 1.5 * iqr_death | # Upper bound for death rate outliers
income_household_median < q1_income - 1.5 * iqr_income | # Lower bound for income outliers
income_household_median > q3_income + 1.5 * iqr_income # Upper bound for income outliers
)
us_counties_outliers_death
rm(q1_death, q3_death, iqr_death)
# Filter the 'us_counties_death_income' dataframe to select specific counties which are significant outliers
us_counties_outliers_significant <- us_counties_income_deaths %>%
filter(county_fips %in% c(36061, 48301, 48311))
us_counties_outliers_significant
# Create a scatter plot for all counties in blue
us_counties_scatter <- plot_ly(
data = us_counties_income_deaths,
x = ~income_household_median / 1000,
y = ~death_rate,
type = 'scatter',
mode = 'markers',
name = "County", # Legend label for counties
showlegend = FALSE # Hide the legend for this trace
)
# Add red markers for significant outliers
us_counties_scatter <- us_counties_scatter %>%
add_markers(
data = us_counties_outliers_significant,
marker = list(color = "red"), # Mark outliers in red
name = "Outlier" # Legend label for outliers
)
# Customize the layout of the scatter plot
us_counties_scatter <- us_counties_scatter %>%
layout(
# title = "Median Household Income vs COVID-19 Death Rate in U.S. Counties (05-13-2022)",
xaxis = list(title = 'Median Household Income ($K)'), # X-axis label
yaxis = list(title = 'COVID-19 Mortality Rate (per 100K)') # Y-axis label
)
us_counties_scatter
# install.packages("rjson")
library(rjson)
# Load GeoJSON data from a URL
us_counties_geojson <- fromJSON(file = "https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json")
# Configure settings for the choropleth map
us_counties_choropleth_config <- list(
scope = 'usa', # Set the map scope to USA
projection = list(type = 'albers usa'), # Set the map projection type
showlakes = TRUE, # Show lakes on the map
lakecolor = toRGB('white') # Set the color for lakes
)
library(RColorBrewer)
# Convert the 'category' column to a factor
us_counties_income_deaths <- us_counties_income_deaths %>%
mutate(category = factor(category)) %>%
# Create a new column 'category_numeric' and store the numeric representation of 'category'
mutate(category_numeric = as.numeric(category))
us_counties_income_deaths
# Generate a color palette using Brewer palette 'Set1'
foo <- brewer.pal(n = 6, name = "Set1")[c(6,3,1,2)]
# Assign names to the colors based on the levels of the 'category' column in 'us_counties_income_deaths'
names(foo) = levels(us_counties_income_deaths$category)
# Define a function to calculate breaks for the color scale
Z_Breaks = function(n) {
CUTS = seq(0, 1, length.out = n + 1)
rep(CUTS, ifelse(CUTS %in% 0:1, 1, 2))
}
# Create a data frame for the color scale
colorScale <- data.frame(
z = Z_Breaks(4), # Breaks for the color scale
col = rep(foo, each = 2), # Repeating the colors
stringsAsFactors = FALSE # Ensure columns are not treated as factors
)
# Create a choropleth map
us_counties_choropleth <- plot_ly(
data = us_counties_income_deaths, # Data source
type = "choropleth", # Choropleth map type
geojson = us_counties_geojson, # GeoJSON data for US counties
locations = ~county_fips, # Locations to map
z = ~category_numeric, # Color scale values
colorscale = colorScale, # Custom color scale
colorbar = list(
title = list(text = "Category", font = list(size = 16)), # Colorbar title
tickvals = c(1.37, 2.13, 2.88, 3.63), # Tick values
ticktext = names(foo), # Tick text labels
orientation = "h", # Horizontal colorbar
y = -0.2, # Y-position of the colorbar
reversescale = TRUE # Reverse the color scale
),
marker = list(line = list(width = 0)) # Set marker line width
)
# Set a title for the choropleth map
# us_counties_choropleth <- us_counties_choropleth %>%
# layout(title = "U.S. Counties by Median Household Income and COVID-19 Death Rates (05-13-2022)")
# Apply the geo configuration to the choropleth map
us_counties_choropleth <- us_counties_choropleth %>%
layout(geo = us_counties_choropleth_config)
us_counties_choropleth
rm(us_counties_geojson, us_counties_choropleth_config)
rm(foo, colorScale)