Who was in the room on July 4, 1776?
With Independence Day coming this weekend I thought I’d take a closer look at the men who signed the Declaration of Independence. How old were they? Where were they from? What did they do? What did their families look like? I found a great dataset at archives.gov to help answer those questions.
1. The Signer’s ages.
Begin with the imports. Like usual we’ll primarily use pandas to process the data. Seaborn and GeoPandas generate plots, which both use Matplotlib on the backend. One module I haven’t previously used on the blog is GeoPy. It can interface with Google’s Maps API and convert city-state-country information into latitude-longitude pairs.
import pandas as pd from geopy.geocoders import GoogleV3 from numpy import timedelta64 import matplotlib.pyplot as plt import seaborn as sns import geopandas as gpd from collections import Counter
Read the dataset into a dataframe and convert dates to Timestamp
objects. A few of the dates are approximations so we can’t use pd.to_datetime
on whole columns. We’ll write our own function and assume those dates fall in the middle of the year.
def parse_circa_dates(text): if "c." in text: return pd.Timestamp(f'July 1 {text.strip("c.")}') else: return pd.Timestamp(text) df = pd.read_csv("declaration_signers.csv") df.loc[:, "birth_date"] = df["birth_date"].apply(parse_circa_dates) df.loc[:, "death_date"] = df["death_date"].apply(parse_circa_dates)
Now that birth and death columns are in datetime form, we can calculate everyone’s age at the time they signed the Declaration of Independence. I want the resulting histograms to have units of years on the x-axis so each value is divided by a numpy timedelta64
. We’ll use the official Old Glory colors according to the US State Department’s style guide.
def plot_age_histogram(dframe, filename): sns.set(font="Ubuntu Condensed", font_scale=1.4) fig, ax = plt.subplots(1, 2, figsize=(15, 6)) # 1 row, 2 columns fig.subplots_adjust(left=0.04, right=0.986, bottom=0.105, top=0.944, wspace=0.12) year_hist = sns.histplot(x=dframe["birth_year"], bins=range(1710, 1755, 5), color="#b31942", alpha=1.0, ax=ax[0]) x_ticks = range(1710, 1755, 5) year_hist.set_xticks(x_ticks) year_hist.set_xticklabels(x_ticks, size=13) y_ticks = range(0, 14, 2) year_hist.set_yticks(y_ticks) year_hist.set_yticklabels(y_ticks, size=13) year_hist.set_ylim(0, 12.5) year_hist.set_title("Birth Year", size=15) year_hist.set_xlabel("Year", size=14, labelpad=6) year_hist.set_ylabel("Count", size=14, labelpad=4) age_hist = sns.histplot(x=dframe["age_at_signing"], bins=range(25, 80, 5), color="#0a3161", alpha=1.0, ax=ax[1]) x_ticks = range(25, 80, 5) age_hist.set_xticks(x_ticks) age_hist.set_xticklabels(x_ticks, size=13) y_ticks = range(0, 14, 2) age_hist.set_yticks(y_ticks) age_hist.set_yticklabels(y_ticks, size=13) age_hist.set_ylim(0, 12.5) age_hist.set_title("Age on July 4, 1776", size=15) age_hist.set_xlabel("Age (Years)", size=14, labelpad=6) age_hist.set_ylabel("Count", size=14, labelpad=4) plt.savefig(filename) return df.loc[:, "birth_year"] = df["birth_date"].apply(lambda x: int(x.strftime("%Y"))) df.loc[:, "age_at_signing"] = df["birth_date"].apply( lambda x: (pd.Timestamp("July 4 1776") - x) / timedelta64(1, "Y")) plot_age_histogram(df, "age_histogram_1x2.png")
The output:
2. The Signer’s birth places.
Next I want to plot the signers’ birth places on a US map. To do that we’ll need to convert plain-English addresses into latitude-longitude data that GeoPandas can understand. This process is called “geocoding.” There are lots of APIs available to do this—some less expensive than others. Here I’ll interface with Google’s free-tier Maps API.
Note that if you want to recreate this plot you’ll have to register for the service and generate your own API key. Alternatively you can geocode using other free services like GeoNames or Nominatim. You can find examples on GeoPy‘s Github.
The dataset contains separate city
and country
columns so I’ll concatenate those columns and place the resulting lookup_address
into a new column. This is the information that will be sent to Google’s API. It’s a little tricky because some rows in the city
column are empty, so first replace any NaN values with empty strings.
def geocode(address): api_key = "XXXXXXXXXXXXXXXXX" geo = GoogleV3(api_key=api_key) loc = geo.geocode(address) lat, lon = loc.latitude, loc.longitude return (lat, lon) df.loc[:, "lookup_address"] = df["birth_city"].fillna("").astype(str) + \ " " + df["birth_country"].fillna("").astype(str) df.loc[:, "lat_lon"] = df["lookup_address"].apply(geocode)
With a lat_lon
column in hand it’s time to plot these locations on a US map shapefile. There’s a lot going on here and it might help to read my previous post about GeoPandas. In short, a shapefile is a format for storing geographic data. We can plot a birthplace scatter plot on top of a US map shapefile and get a nice look at the data.
The main real-world caveat is that state (colony) boundaries looked somewhat different in 1776. Vermont didn’t yet exist independently, for example. We’ll also limit this plot to the eastern US, which will omit 8 of the 56 signers who were born in Europe.
def plot_birth_places(dframe, filename): gdf = gpd.read_file("shapefile/cb_2018_us_state_20m.shp", epsg=4326) fig, ax = plt.subplots(figsize=(5.25, 7.5)) fig.subplots_adjust(left=0.0, right=1.0, bottom=0.015, top=0.954) us_map = gdf.plot(ax=ax, color="#fdf2d9", edgecolor="black", linewidth=0.7) lat, lon = zip(*dframe["lat_lon"].tolist()) us_map.scatter(lon, lat, color="#b31942", s=60, alpha=0.6) us_map.set_xlim(-83.8, -66.9) us_map.set_ylim(31.5, 48) us_map.set_title("D.O.I. Signers' Birth Places", size=16) us_map.annotate(text="Not Pictured:\nEngland (2)\nIreland (2)\nScotland (2)\nNorthern Ireland (1)\nWales (1)", xy=(-73, 36), size=10) us_map.annotate(text="Source: https://www.archives.gov/founding-docs/signers-factsheet", xy=(-67, 31.5), size=8, ha="right") us_map.set_axis_off() plt.savefig(filename, facecolor="#c2efff", dpi=300) return plot_birth_places(df, "signers_birth_places.png")
The output:
3. Discussion.
I want to shine some light on just a couple more corners of the data. I’ll resist the urge to create more histograms.
I wondered how many of the 56 signers didn’t live to see the end of the Revolutionary War in 1783. Remember there were 7 full years between the Declaration of Independence and England finally relenting.
died = df[df["death_date"] < pd.Timestamp("September 3, 1783")].shape[0] print(f"{died}/{df.shape[0]} signers died before the end of the Revolutionary War.\n") # 9/56 signers died before the end of the Revolutionary War.
The above code creates a separate dataframe by filtering out all the signers who were still alive, then checks how many rows remain. Another approach would be to create a new column of boolean values that indicate whether the signer died by the cutoff date and count how many True values exist.
Next let’s take inventory of the signers’ occupations. Many of them had multiple occupations so what’s the best way to dump all those comma-separated strings into a flat list? Let’s change out of our official Pandas Ambassador™ uniforms for a moment.
occupations = df["occupation"].str.split(",").tolist() all_occupations = [job.strip() for item in occupations for job in item]
After this list comprehension we have a regular Python list, not a pandas Series. Instead of using the value_counts
Series method we can accomplish the same thing using collections.Counter
from the standard library. Its most_common
method sorts values in descending order.
for item in Counter(all_occupations).most_common(): print(f"{item[1]:>2} {item[0]}")
The output:
25 Lawyer 17 Merchant 14 Plantation Owner 4 Physician 3 Scientist 2 Land Speculator 2 Minister 2 Farmer 1 Surveyer 1 Printer 1 Land owner 1 Musician 1 Military Officer
As you can see there was never a shortage of lawyers in politics.
Finally let’s check number of kids, marriages, and the median signer lifespan. pandas makes it very easy to calculate these column-wise descriptive statistics.
avg_kids = df["children"].mean() avg_marriages = df["marriages"].mean() df.loc[:, "lifespan"] = (df["death_date"] - df["birth_date"]) / timedelta64(1, "Y") median_lifetime = df["lifespan"].median() print(f"\nThe signers had an average of {avg_marriages:.2f} marriages and {avg_kids:.2f} kids.") print(f"\nThe median lifetime was {median_lifetime:.2f} years.") # The signers had an average of 1.27 marriages and 6.13 kids. # The median lifetime was 65.27 years.
I hope you learned something new about the signers. Happy 4th of July!
Source: www.archives.gov
Full code:
import pandas as pd from geopy.geocoders import GoogleV3 from numpy import timedelta64 import matplotlib.pyplot as plt import seaborn as sns import geopandas as gpd from collections import Counter def parse_circa_dates(text): if "c." in text: return pd.Timestamp(f'July 1 {text.strip("c.")}') else: return pd.Timestamp(text) def plot_age_histogram(dframe, filename): sns.set(font="Ubuntu Condensed", font_scale=1.4) fig, ax = plt.subplots(1, 2, figsize=(15, 6)) # 1 row, 2 columns fig.subplots_adjust(left=0.053, right=0.978, bottom=0.13, top=0.935) year_hist = sns.histplot(x=dframe["birth_year"], bins=range(1710, 1755, 5), color="#b31942", alpha=1.0, ax=ax[0]) x_ticks = range(1710, 1755, 5) year_hist.set_xticks(x_ticks) year_hist.set_xticklabels(x_ticks, size=13) y_ticks = range(0, 14, 2) year_hist.set_yticks(y_ticks) year_hist.set_yticklabels(y_ticks, size=13) year_hist.set_ylim(0, 12.5) year_hist.set_title("Birth Year", size=15) year_hist.set_xlabel("Year", size=14, labelpad=6) year_hist.set_ylabel("Count", size=14, labelpad=4) age_hist = sns.histplot(x=dframe["age_at_signing"], bins=range(25, 80, 5), color="#0a3161", alpha=1.0, ax=ax[1]) x_ticks = range(25, 80, 5) age_hist.set_xticks(x_ticks) age_hist.set_xticklabels(x_ticks, size=13) y_ticks = range(0, 14, 2) age_hist.set_yticks(y_ticks) age_hist.set_yticklabels(y_ticks, size=13) age_hist.set_ylim(0, 12.5) age_hist.set_title("Age on July 4, 1776", size=15) age_hist.set_xlabel("Age (Years)", size=14, labelpad=6) age_hist.set_ylabel("Count", size=14, labelpad=4) plt.savefig(filename) return def geocode(address): api_key = "XXXXXXXXXXXXXXXXXXXXXXXXXXXX" geo = GoogleV3(api_key=api_key) loc = geo.geocode(address) lat, lon = loc.latitude, loc.longitude return (lat, lon) def plot_birth_places(dframe, filename): gdf = gpd.read_file("shapefile/cb_2018_us_state_20m.shp", epsg=4326) fig, ax = plt.subplots(figsize=(5.25, 7.5)) fig.subplots_adjust(left=0.0, right=1.0, bottom=0.015, top=0.954) us_map = gdf.plot(ax=ax, color="#fdf2d9", edgecolor="black", linewidth=0.7) lat, lon = zip(*dframe["lat_lon"].tolist()) us_map.scatter(lon, lat, color="#b31942", s=60, alpha=0.6) us_map.set_xlim(-83.8, -66.9) us_map.set_ylim(31.5, 48) us_map.set_title("D.O.I. Signers' Birth Places", size=16) us_map.annotate(text="Not Pictured:\nEngland (2)\nIreland (2)\nScotland (2)\nNorthern Ireland (1)\nWales (1)", xy=(-73, 36), size=10) us_map.annotate(text="Source: https://www.archives.gov/founding-docs/signers-factsheet", xy=(-67, 31.5), size=8, ha="right") us_map.set_axis_off() plt.savefig(filename, facecolor="#c2efff", dpi=300) return df = pd.read_csv("declaration_signers.csv") df.loc[:, "birth_date"] = df["birth_date"].apply(parse_circa_dates) df.loc[:, "death_date"] = df["death_date"].apply(parse_circa_dates) df.loc[:, "birth_year"] = df["birth_date"].apply(lambda x: int(x.strftime("%Y"))) df.loc[:, "age_at_signing"] = df["birth_date"].apply(lambda x: (pd.Timestamp("July 4 1776") - x) / timedelta64(1, "Y")) plot_age_histogram(df, "age_histogram_1x2.png") df.loc[:, "lookup_address"] = df["birth_city"].fillna("").astype(str) + " " + df["birth_country"].fillna("").astype(str) df.loc[:, "lat_lon"] = df["lookup_address"].apply(geocode) plot_birth_places(df, "signers_birth_places.png") died = df[df["death_date"] < pd.Timestamp("September 3, 1783")].shape[0] print(f"{died}/{df.shape[0]} signers died before the end of the Revolutionary War.\n") occupations = df["occupation"].str.split(",").tolist() all_occupations = [job.strip() for item in occupations for job in item] for item in Counter(all_occupations).most_common(): print(f"{item[1]:>2} {item[0]}") avg_kids = df["children"].mean() avg_marriages = df["marriages"].mean() df.loc[:, "lifespan"] = (df["death_date"] - df["birth_date"]) / timedelta64(1, "Y") median_lifetime = df["lifespan"].median() print(f"\nThe signers had an average of {avg_marriages:.2f} marriages and {avg_kids:.2f} kids.") print(f"\nThe median lifetime was {median_lifetime:.2f} years.")