How important are turnovers in the NFL?
The 2021-22 NFL season kicked off last night and I’m in a football mood. Let’s continue the streak of sports posts!
Football is often characterized as a game of inches. I’ve always taken that to mean that many seemingly small edges are actually crucial to the final score. And few advantages pack a punch like turnovers, which occur when one team takes the ball away from the other.
Exactly how much do turnovers matter? We can find out by implementing a simple linear regression. We’ll calculate:
- How many points each turnover is worth, on average.
- How often teams win when they earn a turnover advantage.
And of course we’ll plot the regression line.
1. Prepare the data.
Begin with the imports. We’ve worked with pandas and Matplotlib many times on the blog. Making a new appearance is SciPy, which we’ll use for the regression.
import pandas as pd from scipy.stats import linregress import matplotlib.pyplot as plt
The dataset can be downloaded from Kaggle. It contains team stats from every NFL game going back to 2002.
Get started by reading the dataset and converting its date
column to datetime format.
df = pd.read_csv("nfl_team_stats_2002-2020.csv", parse_dates=["date"])
Since we’re interested in calculating how turnovers affect the final score, we’ll set up the regression like this:
- Independent variable: turnover margin
- Away minus home.
- Positive is good for the home team.
- Dependent variable: score margin
- Home minus away.
- Positive is good for the home team.
Notice we’ve arbitrarily decided to work from the perspective of the home team. This helps us avoid double-counting games!
Create two new columns to contain said variables.
df.loc[:, "turnover_margin"] = df["turnovers_away"] - df["turnovers_home"] df.loc[:, "score_margin"] = df["score_home"] - df["score_away"]
After that, restrict the dataframe to the most recent 10 years of games. Football fans know how much the sport has changed throughout its history. Limiting analysis to the most recent decade, while not perfect, should better represent the modern game while still providing plenty of data points.
df = df[df["date"] > pd.Timestamp("July 1 2011")]
2. Linear regression.
Now it’s time to perform a simple linear regression. Doing regressions by hand is extremely tedious, but with Python it’s as easy as passing two iterables into scipy.stats.linregress()
. This function returns five values but we’re interested in three:
slope
- The slope (rise over run) of the regression line.
- Coefficient β1 in the equation below.
intercept
- The y-intercept of the regression line.
- Constant β0 in the equation below.
r_value
- This is r, the correlation coefficient.
- We’ll use it to calculate R², which essentially tells us how tightly data points fit the regression line.
Pass the appropriate dataframe columns into scipy.stats.linregress()
:
slope, intercept, r_value, p_value, std_err = linregress(df["turnover_margin"], df["score_margin"])
If you print slope
and intercept
values you’ll find:
slope = 4.53 intercept = 1.96
This means each turnover is worth about 4.5 points to the final score margin.
Be careful not to interpret it as a causal relationship. Although turnovers almost certainly cause a change in the final score, it would be overstating the capabilities of our methods. Strictly speaking we’re only observing the variables’ correlation.
Also notice that intercept
is clearly non-zero. That’s because we’re analyzing the data from the perspective of the home team. In the NFL, home field advantage is worth a couple points. In a game of two equally matched teams with no turnover advantage, you’d expect the home team to be favored by approximately 2 points.
3. Plot the data.
Next we’ll plot all data points along with the regression line. Start by creating a pair of lists to hold the regression line data. A straight line only really needs two ordered pairs. We can use min()
and max()
to easily match a line to the scatter plot.
x_regression = [min(df["turnover_margin"]), max(df["turnover_margin"])] y_regression = [n * slope + intercept for n in x_regression]
The Matplotlib code is fairly straightforward. We create two axes, one each for the scatter plot and line plot.
I like to use a square window for regressions: figsize=(8, 8)
. And whenever possible, locate (0, 0) at the center of the window. In other words since the x-axis extends to +7, it should extend to -7 as well.
A few more notes about the figure:
- The regression equation is displayed by a legend.
- R² is placed manually as text with a bbox to aid visibility. It’s calculated by squaring the correlation coefficient.
- You can pass pandas dataframe columns directly into Matplotlib axes. There’s no need to convert data types.
- I like to add
alpha
(transparency) to scatter plots when there are many overlapping points. - I use my custom
wollen_dark
style. It will be linked at the bottom of this post.
plt.style.use("wollen_dark.mplstyle") fig, ax = plt.subplots(figsize=(8, 8)) ax.scatter(df["turnover_margin"], df["score_margin"], color="#FFF", s=30, alpha=0.5) ax.plot(x_regression, y_regression, color="#D50A0A", linewidth=3.0, label=f"y={slope:.2f}*x{intercept:+.2f}") ax.set(xticks=range(-7, 8), xlim=(-7.25, 7.25), xlabel="Turnover Margin", yticks=range(-60, 70, 10), ylim=(-62, 62), ylabel="Score Margin") ax.set_title("NFL Turnovers & Final Score | 2011–2020") ax.text(4.5, -36, f"R² = {r_value**2:.4f}", {"fontname": "Ubuntu Condensed", "fontsize": 13, "color": "#000"}, bbox={"boxstyle": "round", "facecolor": "#FFF", "linewidth": 0.25, "alpha": 0.9, "pad": 0.25}) plt.legend(loc="upper left") plt.show()
The output:
4. Analysis.
With the regression done, let’s check how often winning the turnover battle leads to victory.
First, filter out all games where turnover margin is zero, then check how often either of the following conditions is true:
- Positive turnover margin and positive score margin.
- Indicating the home team won both the turnover battle and the game.
- Negative turnover margin and negative score margin.
- Indicating the away team won both the turnover battle and the game.
Create new views of the original dataframe and check their size using the shape
attribute. Calculate a percentage by dividing these sizes.
df = df[df["turnover_margin"] != 0] df2 = df[(df["turnover_margin"] > 0) & (df["score_margin"] > 0)] df3 = df[(df["turnover_margin"] < 0) & (df["score_margin"] < 0)] result_follows_turnovers_percent = (df2.shape[0] + df3.shape[0]) / df.shape[0] * 100 print(f"Teams with a positive turnover margin win {result_follows_turnovers_percent:.1f}% of the time.")
Note: We filtered the dataframe with multiple conditions by using the &
operator. Another option is to use DataFrame.query
, which often results in more readable code. Lines 3-4 above could have been written this way:
df2 = df.query("turnover_margin > 0 & score_margin > 0") df3 = df.query("turnover_margin < 0 & score_margin < 0")
The output:
Teams with a positive turnover margin win 77.6% of the time.
To answer the original question: turnovers are incredibly important in the NFL! In fact if you adjust the above code to calculate how often a +1 turnover margin leads to victory, you’ll find it’s 66.5%. So being just one turnover ahead leads to a 2-to-1 win rate. A +2 advantage pushes the rate well above 80%. There may be a team statistic that’s more correlated with final result but I’m not aware of it.
Enjoy the upcoming NFL season and make sure your team wins the turnover battle!
Full code:
import pandas as pd from scipy.stats import linregress import matplotlib.pyplot as plt df = pd.read_csv("nfl_team_stats_2002-2020.csv", parse_dates=["date"]) df.loc[:, "turnover_margin"] = df["turnovers_away"] - df["turnovers_home"] df.loc[:, "score_margin"] = df["score_home"] - df["score_away"] df = df[df["date"] > pd.Timestamp("July 1 2011")] slope, intercept, r_value, p_value, std_err = linregress(df["turnover_margin"], df["score_margin"]) x_regression = [min(df["turnover_margin"]), max(df["turnover_margin"])] y_regression = [n * slope + intercept for n in x_regression] plt.style.use("wollen_dark.mplstyle") fig, ax = plt.subplots(figsize=(8, 8)) ax.scatter(df["turnover_margin"], df["score_margin"], color="#FFF", s=30, alpha=0.5) ax.plot(x_regression, y_regression, color="#D50A0A", linewidth=3.0, label=f"y={slope:.2f}*x{intercept:+.2f}") ax.set(xticks=range(-7, 8), xlim=(-7.25, 7.25), yticks=range(-60, 70, 10), ylim=(-62, 62), xlabel="Turnover Margin", ylabel="Score Margin") ax.set_title("NFL Turnovers & Final Score | 2011–2020") ax.text(4.5, -36, f"R² = {r_value**2:.4f}", {"fontname": "Ubuntu Condensed", "fontsize": 13, "color": "#000"}, bbox={"boxstyle": "round", "facecolor": "#FFF", "linewidth": 0.25, "alpha": 0.9, "pad": 0.25}) plt.legend(loc="upper left") plt.show() df = df[df["turnover_margin"] != 0] df2 = df[(df["turnover_margin"] > 0) & (df["score_margin"] > 0)] df3 = df[(df["turnover_margin"] < 0) & (df["score_margin"] < 0)] result_follows_turnovers_percent = (df2.shape[0] + df3.shape[0]) / df.shape[0] * 100 print(f"Teams with a positive turnover margin win {result_follows_turnovers_percent:.1f}% of the time.")