# How important are turnovers in the NFL?

The 2021-22 NFL season kicked off last night and I’m in a football mood. Let’s continue the streak of sports posts!

Football—that’s American football, not soccer—is often characterized as a game of inches. I’ve always taken that to mean that many seemingly small edges are actually crucial to the final score. And few advantages pack a punch like turnovers, which occur when one team takes the ball away from the other.

Exactly how much do turnovers matter? We can find out by utilizing simple linear regression. We’ll calculate:

- How many points each turnover is worth, on average.
- How often teams win when they earn a turnover advantage.

And of course we’ll plot the regression line.

#### 1. Prepare the data.

Begin with the imports. We’ve worked with *pandas* and *Matplotlib* many times on the blog. Making a new appearance is *SciPy*, a powerful and extensive Python library for scientific computing.

import pandas as pd from scipy.stats import linregress import matplotlib.pyplot as plt

The dataset can be downloaded from Kaggle. It contains team stats from every NFL game going back to 2002.

Get started by reading the dataset and converting its `date`

column to datetime format.

df = pd.read_csv("nfl_team_stats_2002-2020.csv", parse_dates=["date"])

Since we’re interested in calculating how turnovers affect the final score, we’ll set up the regression like this:

- Independent variable: turnover margin
- Away minus home.
- Positive is good for the home team.

- Dependent variable: score margin
- Home minus away.
- Positive is good for the home team.

Notice we’ve arbitrarily decided to work from the perspective of the home team. This helps us avoid double-counting games!

Create two new columns to contain said variables.

df.loc[:, "turnover_margin"] = df["turnovers_away"] - df["turnovers_home"] df.loc[:, "score_margin"] = df["score_home"] - df["score_away"]

After that, restrict the dataframe to the most recent 10 years of games. Football fans know how much the sport has changed throughout its history. Limiting analysis to the most recent decade, while not perfect, should better represent the modern game while still providing plenty of data points.

df = df[df["date"] > pd.Timestamp("July 1 2011")]

#### 2. Linear regression.

Now it’s time to perform a simple linear regression. Doing regressions by hand is extremely tedious, but with Python it’s as easy as passing two iterables into `scipy.stats.linregress()`

. This function returns five values but we’re interested in three:

`slope`

- The slope (rise over run) of the regression line.
- Coefficient β
_{1}in the equation below.

`intercept`

- The y-intercept of the regression line.
- Constant β
_{0}in the equation below.

`r_value`

- This is
*r*, the correlation coefficient. - We’ll use it to calculate
*R²*, which essentially tells us how tightly data points fit the regression line.

- This is

Pass the appropriate dataframe columns into `scipy.stats.linregress()`

:

slope, intercept, r_value, p_value, std_err = linregress(df["turnover_margin"], df["score_margin"])

If you print `slope`

and `intercept`

values you’ll find:

slope = 4.53 intercept = 1.96

**This means each turnover is worth about 4.5 points to the final score margin.**

Be careful not to interpret it as a causal relationship. Although turnovers almost certainly *cause* a change in the final score, it would be overstating the capabilities of our methods. Strictly speaking we’re only observing the variables’ correlation.

Also notice that `intercept`

is clearly non-zero. That’s because we’re analyzing the data from the perspective of the home team. In the NFL, home field advantage is worth a couple points. In a game of two equally matched teams with no turnover advantage, you’d expect the home team to be favored by approximately 2 points.

#### 3. Plot the data.

Next we’ll plot all data points along with the regression line. Start by creating a pair of lists to hold the regression line data. A straight line only really needs two ordered pairs. We can use `min()`

and `max()`

to easily match a line to the scatter plot.

x_regression = [min(df["turnover_margin"]), max(df["turnover_margin"])] y_regression = [n * slope + intercept for n in x_regression]

The *Matplotlib* code is fairly straightforward. We create two *axes*, one each for the scatter plot and line plot.

I like to use a square window for regressions: `figsize=(8, 8)`

. And whenever possible, locate (0, 0) at the center of the window. In other words since the x-axis extends to +7, it should extend to -7 as well.

A few more notes about the figure:

- The regression equation is displayed by a legend.
- R² is placed manually as text with a
*bbox*to aid visibility. It’s calculated by squaring the correlation coefficient. - You can pass
*pandas*dataframe columns directly into Matplotlib axes. There’s no need to convert data types. - I like to add
`alpha`

(transparency) to scatter plots when there are many overlapping points. - I use my custom
`wollen_dark`

style. It will be linked at the bottom of this post.

plt.style.use("wollen_dark.mplstyle") fig, ax = plt.subplots(figsize=(8, 8)) fig.subplots_adjust(left=0.082, bottom=0.073, top=0.953, right=0.972) ax.scatter(df["turnover_margin"], df["score_margin"], color="#FFF", s=30, alpha=0.5) ax.plot(x_regression, y_regression, color="#D50A0A", linewidth=3.0, label=f"y={slope:.2f}*x{intercept:+.2f}") ax.set(xticks=range(-7, 8), xlim=(-7.25, 7.25), xlabel="Turnover Margin", yticks=range(-60, 70, 10), ylim=(-62, 62), ylabel="Score Margin") ax.set_title("NFL Turnovers & Final Score | 2011–2020") ax.text(4.5, -36, f"R² = {r_value**2:.4f}", {"fontname": "Ubuntu Condensed", "fontsize": 13, "color": "#000"}, bbox={"boxstyle": "round", "facecolor": "#FFF", "linewidth": 0.25, "alpha": 0.9, "pad": 0.25}) plt.legend(loc="upper left") plt.show()

**The output:**

#### 4. Analysis.

With the regression done, let’s check how often winning the turnover battle leads to victory.

First, filter out all games where turnover margin is zero, then check how often either of the following conditions is true:

**Positive**turnover margin and**positive**score margin.- Indicating the
**home**team won both the turnover battle and the game.

- Indicating the
**Negative**turnover margin and**negative**score margin.- Indicating the
**away**team won both the turnover battle and the game.

- Indicating the

Create new views of the original dataframe and check their size using the `shape`

attribute. Calculate a percentage by dividing these sizes.

df = df[df["turnover_margin"] != 0] df2 = df[(df["turnover_margin"] > 0) & (df["score_margin"] > 0)] df3 = df[(df["turnover_margin"] < 0) & (df["score_margin"] < 0)] result_follows_turnovers_percent = (df2.shape[0] + df3.shape[0]) / df.shape[0] * 100 print(f"Teams with a positive turnover margin win {result_follows_turnovers_percent:.1f}% of the time.")

**Note:** We filtered the dataframe with multiple conditions by using the `&`

operator. Another option is to use `DataFrame.query`

, which often results in more readable code. Lines 3-4 above could have been written this way:

df2 = df.query("turnover_margin > 0 & score_margin > 0") df3 = df.query("turnover_margin < 0 & score_margin < 0")

The output:

Teams with a positive turnover margin win 77.6% of the time.

**To answer the original question:** turnovers are incredibly important in the NFL! In fact if you adjust the above code to calculate how often a +1 turnover margin leads to victory, you’ll find it’s 66.5%. So being just one turnover ahead leads to a 2-to-1 win rate. A +2 advantage pushes the rate well above 80%. There may be a team statistic that’s more correlated with final result but I’m not aware of it.

Enjoy the upcoming NFL season and make sure your team wins the turnover battle!

**Full code:**

import pandas as pd from scipy.stats import linregress import matplotlib.pyplot as plt df = pd.read_csv("nfl_team_stats_2002-2020.csv", parse_dates=["date"]) df.loc[:, "turnover_margin"] = df["turnovers_away"] - df["turnovers_home"] df.loc[:, "score_margin"] = df["score_home"] - df["score_away"] df = df[df["date"] > pd.Timestamp("July 1 2011")] slope, intercept, r_value, p_value, std_err = linregress(df["turnover_margin"], df["score_margin"]) x_regression = [min(df["turnover_margin"]), max(df["turnover_margin"])] y_regression = [n * slope + intercept for n in x_regression] plt.style.use("wollen_dark.mplstyle") fig, ax = plt.subplots(figsize=(8, 8)) fig.subplots_adjust(left=0.082, bottom=0.073, top=0.953, right=0.972) ax.scatter(df["turnover_margin"], df["score_margin"], color="#FFF", s=30, alpha=0.5) ax.plot(x_regression, y_regression, color="#D50A0A", linewidth=3.0, label=f"y={slope:.2f}*x{intercept:+.2f}") ax.set(xticks=range(-7, 8), xlim=(-7.25, 7.25), yticks=range(-60, 70, 10), ylim=(-62, 62), xlabel="Turnover Margin", ylabel="Score Margin") ax.set_title("NFL Turnovers & Final Score | 2011–2020") ax.text(4.5, -36, f"R² = {r_value**2:.4f}", {"fontname": "Ubuntu Condensed", "fontsize": 13, "color": "#000"}, bbox={"boxstyle": "round", "facecolor": "#FFF", "linewidth": 0.25, "alpha": 0.9, "pad": 0.25}) plt.legend(loc="upper left") plt.show() df = df[df["turnover_margin"] != 0] df2 = df[(df["turnover_margin"] > 0) & (df["score_margin"] > 0)] df3 = df[(df["turnover_margin"] < 0) & (df["score_margin"] < 0)] result_follows_turnovers_percent = (df2.shape[0] + df3.shape[0]) / df.shape[0] * 100 print(f"Teams with a positive turnover margin win {result_follows_turnovers_percent:.1f}% of the time.")