Sports

Violin plots and the NFL Combine

Following the NFL Draft I thought it would be fun to look at data from the Scouting Combine. For those unfamiliar, the NFL Combine is an annual event where football’s brightest prospects are invited to show off their talents in front of scouts. Players run a gauntlet of tests that includes a 40-yard dash, vertical jump, bench press, and more, along with interviews where team managers and players can get to know each other.

Unfortunately in the past few years many players have begun declining to participate in certain events when they believe it can only hurt their draft stock. And hey, they have my full support, but it makes it difficult to compare today’s Combine results with those from 10 or 20 years ago. Fortunately, at some point during the weekend, players’ height and weight are still measured. That seems like a good place to start.


1. The background.

Often when we see one-dimensional data—repeated measurements of a single variable—it’s plotted on a histogram. The data is grouped into evenly sized bins and we count how many items land in each bin. From there it looks like a bar graph.

We could plot our Combine data that way:

 

But often violin plots provide more insight into a variable’s distribution. Violin plots are like a combination of a boxplot and a kernel density estimate (KDE). They display descriptive statistics—median and quartiles—and also visualize the probability density function.

Of course in this case we have the entire population of data, i.e. measurements of every player who attended the Combine, so we don’t have to calculate the probability of a random player’s size. But now we can estimate the expected size of players at an infinitely large NFL Combine. In other words, what body types are NFL teams most interested in signing? Have their preferences changed over the past decade along with coaching strategy?


2. Prepare the data.

Start by creating 2010 and 2020 dataframes and add a year column to both. We’ll combine the dataframes in a moment.

df2010 = pd.read_csv("data/2010.csv")
df2020 = pd.read_csv("data/2020.csv")

df2010.loc[:, "year"] = 2010
df2020.loc[:, "year"] = 2020

Do a quick check for any NaNs that may have sneaked into the data. Were in luck. Every row includes height and weight values.

weight_nans = df2010["weight"].isna().sum() + df2020["weight"].isna().sum()
height_nans = df2010["height"].isna().sum() + df2020["height"].isna().sum()

Before combining the dataframes I want to print basic descriptive statistics with pd.describe. It’s always a good idea to get a feel for the data this way. It can help you avoid errors once you’re swimming in it. What range do the variables span? Are there any obvious outliers that require special treatment?

print(df2010["weight"].describe())

The output for one dataframe is below. Note that the 50% quartile is equivalent to df2010["weight"].median().

count    326.000000
mean     242.849693
std       43.826428
min      149.000000
25%      209.000000
50%      236.000000
75%      270.750000
max      354.000000
Name: weight, dtype: float64

With that out of the way it’s time to concatenate the dataframes. Do it along axis=0 so they’ll be combined vertically rather than side-by-side.

df = pd.concat([df2010, df2020], axis=0)

Height in the dataset is in feet-inches format, which pandas understands as strings, so we need to convert the height column to inches.

def parse_height(raw):
    feet, inches = raw.split("-")
    return int(feet) * 12 + int(inches)

df.loc[:, "height_inches"] = df["height"].apply(parse_height)

3. Plot the data.

Then it’s time to create violin plots for the height and weight variables. For this we’ll use Seaborn, which is a fantastic wrapper for Matplotlib. Although it trades some level of control, Seaborn supports several specialized plots (like violin plots) and default styles that are much more presentable than pure Matplotlib.

There are usually several ways to set style options but I like to immediately take care of as many as possible with the set method.

sns.set(style="darkgrid", palette="colorblind", font="Ubuntu Condensed", font_scale=1.1)

vplot = sns.violinplot(x=df["year"], y=df["weight"], data=df)

vplot.set_yticks(range(100, 450, 50))
vplot.set_ylim(90, 410)

plt.title("NFL Combine | Player Weight", size=15)
plt.ylabel("Weight (lbs.)", size=13, labelpad=10)
plt.xlabel("Year", size=13)

fig = plt.gcf()
fig.set_size_inches(8, 8)
fig.subplots_adjust(left=0.097, right=0.978, bottom=0.073, top=0.958)

plt.savefig("nfl_combine_weight.png", facecolor="#FEFEFE")

My approach for the height plot is essentially identical. It can be found in the full code at the bottom of this page.

In the plots below, median is represented by a white dot near the center. The thicker vertical line denotes the middle two quartiles, and upper and lower quartiles are on either side. The curved line, which is like a probability density function rotated 90°, represents the likelihood of a measurement falling in a given weight range. In other words, the wider the curve, the more likely that corresponding weight is to appear in the data.

I think it goes without saying why these plots are named for violins.


It’s difficult to spot any difference in height from 2010 to 2020. It’s tempting to see the lower 2020 median weight and imagine some effect, like a shift from run-first to pass-first offenses leading to smaller players.

But a quick two-sample Z-test throws cold water on that idea.

from statsmodels.stats.weightstats import ztest

t_stat, p_val = ztest(df2010["weight"], df2020["weight"],
                      value=0, alternative="larger", ddof=1)

print(f"t = {t_stat:.4f}")
print(f"p = {p_val:.4f}")

# t = 0.6256
# p = 0.2658

So we’ve shown that there is no significant difference in the mean physical size of players invited to the NFL Combine over the last 10 years. Nothing wrong with a negative result.


Source: www.pro-football-reference.com

Download the data.

Full code:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


def parse_height(raw):
    feet, inches = raw.split("-")
    return int(feet) * 12 + int(inches)


df2010 = pd.read_csv("data/2010.csv")
df2020 = pd.read_csv("data/2020.csv")

weight_nans = df2010["weight"].isna().sum() + df2020["weight"].isna().sum()
height_nans = df2010["height"].isna().sum() + df2020["height"].isna().sum()

print(df2010["weight"].describe())
print(df2020["weight"].describe())

print(df2010["height"].describe())
print(df2020["height"].describe())

df2010.loc[:, "year"] = 2010
df2020.loc[:, "year"] = 2020

df = pd.concat([df2010, df2020], axis=0)

df.loc[:, "height_inches"] = df["height"].apply(parse_height)

sns.set(style="darkgrid", palette="colorblind", font="Ubuntu Condensed", font_scale=1.1)

vplot = sns.violinplot(x=df["year"], y=df["weight"], data=df)

vplot.set_yticks(range(100, 450, 50))
vplot.set_ylim(90, 410)

plt.title("NFL Combine  |  Player Weight", size=15)
plt.ylabel("Weight  (lbs.)", size=13, labelpad=10)
plt.xlabel("Year", size=13)

fig = plt.gcf()
fig.set_size_inches(8, 8)
fig.subplots_adjust(left=0.097, right=0.978, bottom=0.073, top=0.958)

plt.savefig("nfl_combine_weight.png", facecolor="#FEFEFE")

vplot2 = sns.violinplot(x=df["year"], y=df["height_inches"], data=df)

vplot2.set_yticks(range(62, 84, 2))
vplot2.set_ylim(61.5, 82.5)

plt.title("NFL Combine  |  Player Height", size=15)
plt.ylabel("Height  (inches)", size=13, labelpad=16)
plt.xlabel("Year", size=13)

fig = plt.gcf()
fig.set_size_inches(8, 8)

plt.savefig("nfl_combine_height.png", facecolor="#FEFEFE")