Government, Maps

Working with the new 2020 Census data

Like a lot of data nerds I was excited to check out the new US Census apportionment data. It only comes once per decade after all. I thought it would be a good chance to work on something I really enjoy: plotting maps.

In this post we’ll take a simple statistic, state population growth from 2010 to 2020, and focus most of our energy on producing the plot. The full code will be available at the end of the post.

Data sources:

  1. 2010 Census
  2. 2020 Census
  3. Shapefile

1. Prepare the data.

Start with the imports. To produce this plot we’ll use GeoPandas, which is a package that (not surprisingly) looks and feels a lot like pandas. It keeps our data in a nice GeoDataFrame object on which we can eventually call .plot().

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes
from matplotlib import cm

Since we’re plotting population change we’ll want the POPULATION2010 and POPULATION2020 columns in the same dataframe. Read both datasets and concatenate the dataframes together. Because axis=1 the dataframes are placed side-by-side rather than stacked one below the other. Fortunately our datasets have the same structure with state names in the same order, so we don’t have to worry about matching up rows.

df2010 = pd.read_csv("data/2010.csv")
df2020 = pd.read_csv("data/2020.csv")

df = pd.concat([df2010, df2020], axis=1)
df.set_index("STATE2010", inplace=True)

Create a column containing population change, then rate of change. This is the statistic we’ll use to color states on the map.

df.loc[:, "CHANGE"] = df["POPULATION2020"] - df["POPULATION2010"]
df.loc[:, "CHANGE_RATE"] = df["CHANGE"] / df["POPULATION2010"]

At this point df.head(3) looks like this:

           POPULATION2010 STATE2020  POPULATION2020  CHANGE  CHANGE_RATE
STATE2010                                                               
Alabama           4802982   Alabama         5030053  227071     0.047277
Alaska             721523    Alaska          736081   14558     0.020177
Arizona           6412700   Arizona         7158923  746223     0.116366

Since we’re plotting Alaska and Hawaii on separate inset axes we can’t use the built-in cmap parameter. We’ll have to write our own function to generate colors. State colors will be placed in a COLOR column that can be passed into plot(). Higher population growth is represented by darker greens and if a population shrank from 2010 to 2020, the function returns a shade of red.

def get_color(rate, max_rate):
    if rate >= 0:
        greens_cmap = cm.get_cmap("Greens", 100)
        return greens_cmap(rate / max_rate)
    else:
        reds_cmap = cm.get_cmap("Reds", 100)
        return reds_cmap(abs(rate) / max_rate)


df.loc[:, "COLOR"] = df["CHANGE_RATE"].apply(get_color, max_rate=df["CHANGE_RATE"].max())

Notice we haven’t yet created a GeoDataFrame. Our shapefile uses the EPSG:4326 coordinate system, i.e. normal latitude/longitude pairs. Although it isn’t necessary here, I’m in the habit of reprojecting my CRS to EPSG:3857. Remember to filter out any shapefile elements that aren’t in df (Puerto Rico and D.C.).

shapefile_path = "shapefile/cb_2018_us_state_20m.shp"
gdf = gpd.read_file(shapefile_path, epsg=4326)
gdf = gdf.to_crs(epsg=3857)
gdf = gdf[gdf["NAME"].isin(df["STATE2020"])

Next we need to merge the pandas dataframe, which contains color information, with the GeoDataFrame. To accomplish this we’ll first rename a column so that GeoPandas knows which rows to match.

gdf.rename(columns={"NAME": "STATE2020"}, inplace=True)
gdf = gdf.merge(df, on="STATE2020")

2. Plot the data on a map.

Now for the interesting part: calling gdf.plot(). On this axis we need to set x- and y-limits and limit our map to the continental United States. Alaska and Hawaii often present a problem when plotting a US map. The extreme west (east!) Aleutian Islands actually extend into the eastern hemisphere. So if we didn’t limit the window, GeoPandas would plot a zoomed-out figure containing all 360° of longitude. But I don’t want to pick on them too much. We’ll add Alaska and Hawaii inset axes next.

fig, ax = plt.subplots(figsize=(12, 7.5))
fig.subplots_adjust(left=0.0, right=1.0, bottom=0.0, top=0.925)

border_thickness = 0.4
border_color = "#0F0F0F"

us_map = gdf.plot(ax=ax, color=gdf["COLOR"], categorical=True,
                  edgecolor=border_color, linewidth=border_thickness)

us_map.set_xlim(-14.1e6, -7.24e6)
us_map.set_ylim(2.5e6, 6.47e6)

After a bit of tedious work the inset axes fall into place. This procedure is very similar to plotting the continental US above. We simply create two slices of gdf restricted to Alaska and Hawaii respectively, then again call plot().

alaska_inset = zoomed_inset_axes(ax, zoom=0.21, bbox_to_anchor=(0.25, 0.29), bbox_transform=plt.gcf().transFigure)
alaska_inset.set_xlim(-20.0e6, -14.36e6)
alaska_inset.set_ylim(6.53e6, 11.75e6)
gdf_alaska = gdf[gdf["STATE2020"] == "Alaska"]
gdf_alaska.plot(ax=alaska_inset, color=gdf["COLOR"], categorical=True,
                edgecolor=border_color, linewidth=border_thickness)

hawaii_inset = zoomed_inset_axes(ax, zoom=1.05, bbox_to_anchor=(0.36, 0.22), bbox_transform=plt.gcf().transFigure)
hawaii_inset.set_xlim(-17.87e6, -17.20e6)
hawaii_inset.set_ylim(2.09e6, 2.57e6)
gdf_hawaii = gdf[gdf["STATE2020"] == "Hawaii"]
gdf_hawaii.plot(ax=hawaii_inset, color=gdf["COLOR"], categorical=True,
                edgecolor=border_color, linewidth=border_thickness)

Finally add a title and attribution text and turn off any axis labels that would clutter the map.

us_map.set_title("State Population Change  |  2010-2020", {"fontname": "Ubuntu Condensed", "fontsize": "26"})
us_map.annotate(text="Data:  US Census Bureau", xy=(-7.35e6, 2.54e6), size=12, family="Ubuntu Condensed",
                horizontalalignment="right", verticalalignment="bottom")

us_map.set_axis_off()
alaska_inset.set_axis_off()
hawaii_inset.set_axis_off()
plt.savefig("population_2010-2020.png", facecolor="#FEFEFE", dpi=300)

I usually suggest saving maps in vectorized SVG format rather than PNG or JPG whenever possible. This avoids blurriness when zooming in. If you do output a raster image consider increasing dpi from the default of 100

The output:


TODO:

  • Add a legend so the audience know what these colors mean!
  • Label smaller states with annotate() or arrow().
  • Specify a few extreme values to help communicate scale.

Download the data.

Full code:

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes
from matplotlib import cm


def get_color(rate, max_rate):
    if rate >= 0:
        greens_cmap = cm.get_cmap("Greens", 100)
        return greens_cmap(rate / max_rate)
    else:
        reds_cmap = cm.get_cmap("Reds", 100)
        return reds_cmap(abs(rate) / max_rate)


df2010 = pd.read_csv("data/2010.csv")
df2020 = pd.read_csv("data/2020.csv")

df = pd.concat([df2010, df2020], axis=1)
df.set_index("STATE2010", inplace=True)

df.loc[:, "CHANGE"] = df["POPULATION2020"] - df["POPULATION2010"]
df.loc[:, "CHANGE_RATE"] = df["CHANGE"] / df["POPULATION2010"]
df.loc[:, "COLOR"] = df["CHANGE_RATE"].apply(get_color, max_rate=df["CHANGE_RATE"].max())

shapefile_path = "shapefile/cb_2018_us_state_20m.shp"
gdf = gpd.read_file(shapefile_path, epsg=4326)
gdf = gdf.to_crs(epsg=3857)
gdf = gdf[gdf["NAME"].isin(df["STATE2020"])]
gdf.rename(columns={"NAME": "STATE2020"}, inplace=True)
gdf = gdf.merge(df, on="STATE2020")

fig, ax = plt.subplots(figsize=(12, 7.5))
fig.subplots_adjust(left=0.0, right=1.0, bottom=0.0, top=0.925)

border_thickness = 0.4
border_color = "#0F0F0F"

us_map = gdf.plot(ax=ax, color=gdf["COLOR"], categorical=True,
                  edgecolor=border_color, linewidth=border_thickness)

us_map.set_xlim(-14.1e6, -7.24e6)
us_map.set_ylim(2.5e6, 6.47e6)

alaska_inset = zoomed_inset_axes(ax, zoom=0.21, bbox_to_anchor=(0.25, 0.29), bbox_transform=plt.gcf().transFigure)
alaska_inset.set_xlim(-20.0e6, -14.36e6)
alaska_inset.set_ylim(6.53e6, 11.75e6)
gdf_alaska = gdf[gdf["STATE2020"] == "Alaska"]
gdf_alaska.plot(ax=alaska_inset, color=gdf["COLOR"], categorical=True,
                edgecolor=border_color, linewidth=border_thickness)

hawaii_inset = zoomed_inset_axes(ax, zoom=1.05, bbox_to_anchor=(0.36, 0.22), bbox_transform=plt.gcf().transFigure)
hawaii_inset.set_xlim(-17.87e6, -17.20e6)
hawaii_inset.set_ylim(2.09e6, 2.57e6)
gdf_hawaii = gdf[gdf["STATE2020"] == "Hawaii"]
gdf_hawaii.plot(ax=hawaii_inset, color=gdf["COLOR"], categorical=True,
                edgecolor=border_color, linewidth=border_thickness)

us_map.set_title("State Population Change  |  2010-2020", {"fontname": "Ubuntu Condensed", "fontsize": "26"})
us_map.annotate(text="Data:  US Census Bureau", xy=(-7.35e6, 2.54e6), size=12, family="Ubuntu Condensed",
                horizontalalignment="right", verticalalignment="bottom")

us_map.set_axis_off()
alaska_inset.set_axis_off()
hawaii_inset.set_axis_off()
plt.savefig("population_2010-2020.png", facecolor="#FEFEFE", dpi=300)