Sports

Ryan Crouser is the best shot putter ever

As curtains close on the 2020 Tokyo Olympics I thought I’d take a look at one of my favorite athletes: Ryan Crouser.

After breaking the 31-year-old world record earlier this year (below), Crouser traveled to Tokyo and easily won Olympic Gold. In fact, his first attempt would have been more than enough to beat the field. Few people, at this point, would disagree that he has separated himself from other great shot putters in history. Let’s plot the data and put Crouser’s career in historical context.


We’ll visualize Crouser’s career in two ways:

  1. Create a stacked bar chart that displays Crouser’s share of 22.00+ meter performances.
  2. Using histograms, compare Crouser to Randy Barnes, the previous world record holder.

1. Ryan Crouser’s share of 22.00+ meter throws

Fortunately the folks at Alltime Athletics have been tracking the event for many years. They provide a convenient list of all performances exceeding 21.00 meters. I’ll link the dataset at the end of this post.

First let’s read the CSV file with pandas and parse its date column.

df = pd.read_csv("shotput_alltime_8-5-2021.csv", parse_dates=["date"])

Since the x-axis will be divided into yearly units, we can simplify the work by treating dates as integers. Create a year column:

df.loc[:, "year"] = df["date"].dt.strftime("%Y").astype(int)

Before narrowing the dataframe to Crouser’s performances we should look at all athletes together. Create a view restricted to 22.00+ performances and use the value_counts method to find the number of throws per year. Storing this information as a dictionary will make it easy to plot later.

df = df[df.mark >= 22.00]

yearly_all = df.year.value_counts().to_dict()

The yearly_all dictionary should look like this:

{1974: 1, 1975: 3, 1976: 2, 1978: 1, ...

Now simply restrict the view to rows that include “Ryan Crouser” and repeat the process. Create a second dictionary called yearly_crouser.

df = df[df.name == "Ryan Crouser"]

yearly_crouser = df.year.value_counts().to_dict()

With these two dictionaries we can easily plot a stacked bar chart and visualize Crouser’s contribution of 22.00+ meter throws. On the x-axis will be dictionary keys—years—and the y-axis will display dictionary values—number of throws per year.

We don’t need to worry about manually stacking the bars because Crouser’s throws are a subset of overall throws. Both bars can start at zero. I’ll use two of the five Olympic ring colors to stay on theme.

A couple quick notes about the plot:

  • I use the built-in ggplot style but change its font using rcParams.
  • tick_params is a good way to adjust both axes’ tick labels with a single line.
plt.style.use("ggplot")
plt.rcParams.update({"font.family": "Ubuntu Condensed"})

fig, ax = plt.subplots(figsize=(12, 6.5))
fig.subplots_adjust(left=0.046, right=0.985, top=0.952, bottom=0.085)

ax.bar(yearly_all.keys(), yearly_all.values(), color="#0286C3", label="Other")
ax.bar(yearly_crouser.keys(), yearly_crouser.values(), color="#FBB22E", label="Crouser")

ax.set_xticks(range(1972, 2028, 4))
ax.set_xlim(1971, 2025)
ax.set_xlabel("Season (Indoor & Outdoor Combined)", size=12, labelpad=6)

ax.set_ylim(0, 36)
ax.set_ylabel("Count", size=12, labelpad=6)

ax.tick_params(axis="both", labelsize=12)

ax.set_title("IAAF Men's Shot Put  |  22.00m+", size=14)

plt.legend(loc="upper center", fontsize=11, facecolor="#FFF")

plt.show()

The output:

For readers who aren’t die-hard shot put fanatics, which I would estimate is nearly everyone reading this post, the sport fell off a metaphorical cliff in the early 1990’s with advancements in drug testing. Only in the past few years has it finally surpassed that era. Crouser and several other great athletes—Joe Kovacs and Tom Walsh especially—deserve credit for pushing the sport forward.


2. Ryan Crouser and Randy Barnes

Now let’s create histograms of performances by Crouser and the previous world record holder, Randy Barnes, who threw 23.12 meters in 1990.

Let’s start fresh and create a new dataframe from the CSV file.

df = pd.read_csv("shotput_alltime_8-5-2021.csv")

We’ll use pandas.cut to bin the data. This method groups all values within a given range. For example, because we specified a 10 centimeter bin size, all distances from 22.10 to 22.19 will be grouped together and labeled 22.15. We can then count how many values are in each bin.

Put these labels, e.g. 22.15, into a new column called bin.

bins = arange(20.995, 23.495, 0.1)
labels = arange(21.05, 23.45, 0.1)
df.loc[:, "bin"] = pd.cut(df.mark, bins=bins, labels=labels)

Before counting the bins we need to separate Crouser and Barnes from the rest of the athletes. Create two separate dataframe views.

df_crouser = df[df.name == "Ryan Crouser"]
df_barnes = df[df.name == "Randy Barnes"]

At this point df_crouser.head() and df_barnes.head() should look like this:

    mark          name        date    bin
0  23.37  Ryan Crouser  2021-06-18  23.35
1  23.30  Ryan Crouser  2021-05-08  23.35
5  23.01  Ryan Crouser  2021-05-22  23.05
6  22.92  Ryan Crouser  2021-06-18  22.95
9  22.91  Ryan Crouser  2020-07-18  22.95

     mark          name        date    bin
2   23.12  Randy Barnes  1990-05-20  23.15
3   23.10  Randy Barnes  1990-05-26  23.15
25  22.66  Randy Barnes  1989-01-20  22.65
66  22.42  Randy Barnes  1988-08-17  22.45
69  22.40  Randy Barnes  1996-07-13  22.45

You can see how each mark is designated a bin.

Now we’ll use pandas.groupby to collect all rows according to their bin labels. Set as_index=False because without it, bin would become the new dataframe’s index column.

df_crouser_hist = df_crouser.groupby("bin", as_index=False)["mark"].count()
df_barnes_hist = df_barnes.groupby("bin", as_index=False)["mark"].count()

Note that the count aggregate function could also be written this way:

df_crouser_hist = df_crouser.groupby("bin", as_index=False)["mark"].agg("count")

These are equivalent lines of code.

df_crouser_hist.head() is shown below. It means Crouser has had:

  • 1 throw between 21.00 and 21.09.
  • 4 throws between 21.10 and 21.19.
  • And so on.
     bin  mark
0  21.05     1
1  21.15     4
2  21.25     7
3  21.35     2
4  21.45     1

At this point we’re ready to plot the two histograms and compare.

A few notes about the code:

  • It uses the built-in seaborn style with a couple rcParam changes:
    • Font is changed to Ubuntu Condensed.
    • x-ticks are excluded by default in seaborn style, so they’re brought back.
  • The figure has a 2×1 subplot layout—2 rows, 1 column.
    • A vertical orientation makes it easy to compare the two distributions.
  • The two axes, ax[0] and ax[1], are renamed for readability.
  • matplotlib.ticker is used to format the x-axis tick labels to have 2 decimal places.
  • Vertical grid lines are hidden. I think this often looks cleaner on histograms.
  • It uses the athletes’ college colors—Texas and Texas A&M respectively.
plt.style.use("seaborn")
plt.rcParams.update({"font.family": "Ubuntu Condensed",
                     "xtick.major.size": 4.0})

fig, axs = plt.subplots(2, 1, figsize=(10, 10))
fig.subplots_adjust(hspace=0.18, left=0.047, right=0.983, bottom=0.075, top=0.968)

ax_crouser, ax_barnes = axs

ax_crouser.bar(df_crouser_hist.bin, df_crouser_hist.mark,
               width=0.075, color="#bf5700")
ax_crouser.set(yticks=range(8), ylim=(0, 7.2),
               xticks=arange(21.0, 24.0, 0.5), xlim=(20.9, 23.6))
ax_crouser.set_ylabel("Count", size=12, labelpad=6)
ax_crouser.set_title("Ryan Crouser  |  Distance", size=13)
ax_crouser.xaxis.set_major_formatter(ticker.StrMethodFormatter("{x:.2f}"))
ax_crouser.xaxis.grid(b=True, linewidth=0.1)
ax_crouser.tick_params(axis="both", labelsize=12)

ax_barnes.bar(df_barnes_hist.bin, df_barnes_hist.mark,
              width=0.075, color="#500000")
ax_barnes.set(yticks=range(8), ylim=(0, 7.2),
              xticks=arange(21.0, 24.0, 0.5), xlim=(20.9, 23.6))
ax_barnes.set_ylabel("Count", size=12, labelpad=6)
ax_barnes.set_xlabel("Distance (m)", size=12, labelpad=6)
ax_barnes.set_title("Randy Barnes  |  Distance", size=13)
ax_barnes.xaxis.set_major_formatter(ticker.StrMethodFormatter("{x:.2f}"))
ax_barnes.xaxis.grid(b=True, linewidth=0.1)
ax_barnes.tick_params(axis="both", labelsize=12)

source_message = "Through August 5, 2021. Outdoor & Indoor combined.\nData:  http://www.alltime-athletics.com."
ax_barnes.text(23.7, -1.15, source_message, ha="right", size=8)

plt.show()

The output:

You can see the peak of Barnes’ distribution is somewhere in the mid-21-meter range, while Crouser’s distribution is centered well above 22 meters.

I don’t mean to take anything away from Randy Barnes. The man had an incredible career, including Olympic Gold in Atlanta, and he held the world record for over 30 years. I only want to show that Ryan Crouser stands head and shoulders above his peers—including the former world record holder.


Download the data.

Full code:

import pandas as pd
from numpy import arange
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker


df = pd.read_csv("shotput_alltime_8-5-2021.csv", parse_dates=["date"])

df.loc[:, "year"] = df["date"].dt.strftime("%Y").astype(int)

df = df[df.mark >= 22.00]

yearly_all = df.year.value_counts().to_dict()

df = df[df.name == "Ryan Crouser"]

yearly_crouser = df.year.value_counts().to_dict()

plt.style.use("ggplot")
plt.rcParams.update({"font.family": "Ubuntu Condensed"})

fig, ax = plt.subplots(figsize=(12, 6.5))
fig.subplots_adjust(left=0.046, right=0.985, top=0.952, bottom=0.085)

ax.bar(yearly_all.keys(), yearly_all.values(), color="#0286C3", label="Other")
ax.bar(yearly_crouser.keys(), yearly_crouser.values(), color="#FBB22E", label="Crouser")

ax.set_xticks(range(1972, 2028, 4))
ax.set_xlim(1971, 2025)
ax.set_xlabel("Season (Indoor & Outdoor Combined)", size=12, labelpad=6)

ax.set_ylim(0, 36)
ax.set_ylabel("Count", size=12, labelpad=6)

ax.tick_params(axis="both", labelsize=12)

ax.set_title("IAAF Men's Shot Put  |  22.00m+", size=14)

plt.legend(loc="upper center", fontsize=11, facecolor="#FFF")

plt.savefig("shotput_22m.png")

# ############# Histogram plots #############

df = pd.read_csv("shotput_alltime_8-5-2021.csv")

bins = arange(20.995, 23.495, 0.1)
labels = arange(21.05, 23.45, 0.1)
df.loc[:, "bin"] = pd.cut(df.mark, bins=bins, labels=labels)

df_crouser = df[df.name == "Ryan Crouser"]
df_barnes = df[df.name == "Randy Barnes"]

df_crouser_hist = df_crouser.groupby("bin", as_index=False)["mark"].count()
df_barnes_hist = df_barnes.groupby("bin", as_index=False)["mark"].count()

plt.style.use("seaborn")
plt.rcParams.update({"font.family": "Ubuntu Condensed", "xtick.major.size": 4.0})

fig, axs = plt.subplots(2, 1, figsize=(10, 10))
fig.subplots_adjust(hspace=0.18, left=0.047, right=0.983, bottom=0.075, top=0.968)

ax_crouser, ax_barnes = axs

ax_crouser.bar(df_crouser_hist.bin, df_crouser_hist.mark, width=0.075, color="#bf5700")
ax_crouser.set(yticks=range(8), ylim=(0, 7.2), xticks=arange(21.0, 24.0, 0.5), xlim=(20.9, 23.6))
ax_crouser.set_ylabel("Count", size=12, labelpad=6)
ax_crouser.set_title("Ryan Crouser  |  Distance", size=13)
ax_crouser.xaxis.set_major_formatter(ticker.StrMethodFormatter("{x:.2f}"))
ax_crouser.xaxis.grid(b=True, linewidth=0.1)
ax_crouser.tick_params(axis="both", labelsize=12)

ax_barnes.bar(df_barnes_hist.bin, df_barnes_hist.mark, width=0.075, color="#500000")
ax_barnes.set(yticks=range(8), ylim=(0, 7.2), xticks=arange(21.0, 24.0, 0.5), xlim=(20.9, 23.6))
ax_barnes.set_ylabel("Count", size=12, labelpad=6)
ax_barnes.set_xlabel("Distance (m)", size=12, labelpad=6)
ax_barnes.set_title("Randy Barnes  |  Distance", size=13)
ax_barnes.xaxis.set_major_formatter(ticker.StrMethodFormatter("{x:.2f}"))
ax_barnes.xaxis.grid(b=True, linewidth=0.1)
ax_barnes.tick_params(axis="both", labelsize=12)

source_message = "Through August 5, 2021. Outdoor & Indoor combined.\nData:  http://www.alltime-athletics.com."
ax_barnes.text(23.7, -1.15, source_message, ha="right", size=8)

plt.savefig("crouser_barnes_hist.png")

Title image credit: Simpli Faster.