{"id":208,"date":"2021-05-07T05:30:58","date_gmt":"2021-05-07T10:30:58","guid":{"rendered":"https:\/\/wollen.org\/blog\/?p=208"},"modified":"2024-08-17T08:17:43","modified_gmt":"2024-08-17T13:17:43","slug":"violin-plots-and-the-nfl-combine","status":"publish","type":"post","link":"https:\/\/wollen.org\/blog\/2021\/05\/violin-plots-and-the-nfl-combine\/","title":{"rendered":"Violin plots and the NFL Combine"},"content":{"rendered":"<p>Following the NFL Draft I thought it would be fun to look at data from the Scouting Combine. For those unfamiliar, the <a href=\"https:\/\/www.nfl.com\/network\/events\/nfl-combine\" target=\"_blank\" rel=\"noopener\">NFL Combine<\/a> is an annual event where football&#8217;s brightest prospects are invited to show off their talents in front of scouts. Players run a gauntlet of tests that includes a 40-yard dash, vertical jump, bench press, and more, along with interviews where team managers and players can get to know each other.<\/p>\n<p>Unfortunately in the past few years many players have begun declining to participate in certain events when they believe it can only hurt their draft stock. And hey, they have my full support, but it makes it difficult to compare today&#8217;s Combine results with those from 10 or 20 years ago. Fortunately, at some point during the weekend, players&#8217; height and weight are still measured. That seems like a good place to start.<\/p>\n<hr \/>\n<h4>1. The background.<\/h4>\n<p>Often when we see one-dimensional data\u2014repeated measurements of a single variable\u2014it&#8217;s plotted on a histogram. The data is grouped into evenly sized bins and we count how many items land in each bin. From there it looks like a bar graph.<\/p>\n<p>We could plot our Combine data that way:<\/p>\n<p><a href=\"https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_weight_histo-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1765 size-full\" src=\"https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_weight_histo-1.png\" alt=\"\" width=\"1200\" height=\"650\" srcset=\"https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_weight_histo-1.png 1200w, https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_weight_histo-1-300x163.png 300w, https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_weight_histo-1-1024x555.png 1024w, https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_weight_histo-1-768x416.png 768w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>But often violin plots provide more insight into a variable&#8217;s distribution. Violin plots are like a combination of a boxplot and a kernel density estimate (KDE). They display descriptive statistics\u2014median and quartiles\u2014and also visualize the probability density function.<\/p>\n<p>Of course in this case we have the entire population of data, i.e. measurements of every player who attended the Combine, so we don&#8217;t have to calculate the probability of a random player&#8217;s size. But now we can estimate the expected size of players at an infinitely large NFL Combine. In other words, what body types are NFL teams most interested in signing? Have their preferences changed over the past decade along with coaching strategy?<\/p>\n<hr \/>\n<h4>2. Prepare the data.<\/h4>\n<p>Start by creating 2010 and 2020 dataframes and add a <code>year<\/code> column to both. We&#8217;ll combine the dataframes in a moment.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">df2010 = pd.read_csv(\"data\/2010.csv\")\r\ndf2020 = pd.read_csv(\"data\/2020.csv\")\r\n\r\ndf2010.loc[:, \"year\"] = 2010\r\ndf2020.loc[:, \"year\"] = 2020<\/pre>\n<p>Do a quick check for any <em>NaN<\/em>s that may have sneaked into the data. Were in luck. Every row includes height and weight values.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">weight_nans = df2010[\"weight\"].isna().sum() + df2020[\"weight\"].isna().sum()\r\nheight_nans = df2010[\"height\"].isna().sum() + df2020[\"height\"].isna().sum()<\/pre>\n<p>Before combining the dataframes I want to print basic descriptive statistics with <code>pd.describe<\/code>. It&#8217;s always a good idea to get a feel for the data this way. It can help you avoid errors once you&#8217;re swimming in it. What range do the variables span? Are there any obvious outliers that require special treatment?<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">print(df2010[\"weight\"].describe())<\/pre>\n<p>The output for one dataframe is below. Note that the 50% quartile is equivalent to <code>df2010[\"weight\"].median()<\/code>.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">count    326.000000\r\nmean     242.849693\r\nstd       43.826428\r\nmin      149.000000\r\n25%      209.000000\r\n50%      236.000000\r\n75%      270.750000\r\nmax      354.000000\r\nName: weight, dtype: float64<\/pre>\n<p>With that out of the way it&#8217;s time to concatenate the dataframes. Do it along <code>axis=0<\/code> so they&#8217;ll be combined vertically rather than side-by-side.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">df = pd.concat([df2010, df2020], axis=0)<\/pre>\n<p>Height in the dataset is in <em>feet-inches<\/em> format, which <em>pandas<\/em> understands as strings, so we need to convert the <code>height<\/code> column to inches.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">def parse_height(raw):\r\n    feet, inches = raw.split(\"-\")\r\n    return int(feet) * 12 + int(inches)\r\n\r\ndf.loc[:, \"height_inches\"] = df[\"height\"].apply(parse_height)<\/pre>\n<hr \/>\n<h4>3. Plot the data.<\/h4>\n<p>Then it&#8217;s time to create violin plots for the height and weight variables. For this we&#8217;ll use <em>Seaborn, <\/em>which is a fantastic wrapper for <em>Matplotlib<\/em>. Although it trades some level of control, <em>Seaborn<\/em> supports several specialized plots (like violin plots) and default styles that are much more presentable than pure <em>Matplotlib<\/em>.<\/p>\n<p>There are usually several ways to set style options but I like to immediately take care of as many as possible with the <code>set<\/code> method.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">sns.set(style=\"darkgrid\", palette=\"colorblind\", font=\"Ubuntu Condensed\", font_scale=1.1)\r\n\r\nvplot = sns.violinplot(x=df[\"year\"], y=df[\"weight\"], data=df)\r\n\r\nvplot.set_yticks(range(100, 450, 50))\r\nvplot.set_ylim(90, 410)\r\n\r\nplt.title(\"NFL Combine | Player Weight\", size=15)\r\nplt.ylabel(\"Weight (lbs.)\", size=13, labelpad=10)\r\nplt.xlabel(\"Year\", size=13)\r\n\r\nfig = plt.gcf()\r\nfig.set_size_inches(8, 8)\r\nfig.subplots_adjust(left=0.097, right=0.978, bottom=0.073, top=0.958)\r\n\r\nplt.savefig(\"nfl_combine_weight.png\", facecolor=\"#FEFEFE\")<\/pre>\n<p>My approach for the height plot is essentially identical. It can be found in the full code at the bottom of this page.<\/p>\n<p>In the plots below, median is represented by a white dot near the center. The thicker vertical line denotes the middle two quartiles, and upper and lower quartiles are on either side. The curved line, which is like a probability density function rotated 90\u00b0, represents the likelihood of a measurement falling in a given weight range. In other words, the wider the curve, the more likely that corresponding weight is to appear in the data.<\/p>\n<p>I think it goes without saying why these plots are named for violins.<\/p>\n<p><a href=\"https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_weight.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1766 size-full\" src=\"https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_weight.png\" alt=\"\" width=\"800\" height=\"800\" srcset=\"https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_weight.png 800w, https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_weight-300x300.png 300w, https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_weight-150x150.png 150w, https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_weight-768x768.png 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/a><\/p>\n<hr \/>\n<p><a href=\"https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_height.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1767 size-full\" src=\"https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_height.png\" alt=\"\" width=\"800\" height=\"800\" srcset=\"https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_height.png 800w, https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_height-300x300.png 300w, https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_height-150x150.png 150w, https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/05\/nfl_combine_height-768x768.png 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/a><\/p>\n<p>It&#8217;s difficult to spot any difference in height from 2010 to 2020. It&#8217;s tempting to see the lower 2020 median weight and imagine some effect, like a shift from run-first to pass-first offenses leading to smaller players.<\/p>\n<p>But a quick two-sample Z-test throws cold water on that idea.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">from statsmodels.stats.weightstats import ztest\r\n\r\nt_stat, p_val = ztest(df2010[\"weight\"], df2020[\"weight\"],\r\n                      value=0, alternative=\"larger\", ddof=1)\r\n\r\nprint(f\"t = {t_stat:.4f}\")\r\nprint(f\"p = {p_val:.4f}\")\r\n\r\n# t = 0.6256\r\n# p = 0.2658<\/pre>\n<p>So we&#8217;ve shown that there is no significant difference in the mean physical size of players invited to the NFL Combine over the last 10 years. Nothing wrong with a negative result.<\/p>\n<hr \/>\n<p><strong>Source: <a href=\"https:\/\/www.pro-football-reference.com\/draft\/2010-combine.htm\" target=\"_blank\" rel=\"noopener\">www.pro-football-reference.com<\/a><\/strong><\/p>\n<p><strong><a href=\"https:\/\/wollen.org\/misc\/nfl_combine_2010_2020.zip\">Download the data<\/a><\/strong>.<\/p>\n<p><strong>Full code:<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import pandas as pd\r\nimport seaborn as sns\r\nimport matplotlib.pyplot as plt\r\n\r\n\r\ndef parse_height(raw):\r\n    feet, inches = raw.split(\"-\")\r\n    return int(feet) * 12 + int(inches)\r\n\r\n\r\ndf2010 = pd.read_csv(\"data\/2010.csv\")\r\ndf2020 = pd.read_csv(\"data\/2020.csv\")\r\n\r\nweight_nans = df2010[\"weight\"].isna().sum() + df2020[\"weight\"].isna().sum()\r\nheight_nans = df2010[\"height\"].isna().sum() + df2020[\"height\"].isna().sum()\r\n\r\nprint(df2010[\"weight\"].describe())\r\nprint(df2020[\"weight\"].describe())\r\n\r\nprint(df2010[\"height\"].describe())\r\nprint(df2020[\"height\"].describe())\r\n\r\ndf2010.loc[:, \"year\"] = 2010\r\ndf2020.loc[:, \"year\"] = 2020\r\n\r\ndf = pd.concat([df2010, df2020], axis=0)\r\n\r\ndf.loc[:, \"height_inches\"] = df[\"height\"].apply(parse_height)\r\n\r\nsns.set(style=\"darkgrid\", palette=\"colorblind\", font=\"Ubuntu Condensed\", font_scale=1.1)\r\n\r\nvplot = sns.violinplot(x=df[\"year\"], y=df[\"weight\"], data=df)\r\n\r\nvplot.set_yticks(range(100, 450, 50))\r\nvplot.set_ylim(90, 410)\r\n\r\nplt.title(\"NFL Combine  |  Player Weight\", size=15)\r\nplt.ylabel(\"Weight  (lbs.)\", size=13, labelpad=10)\r\nplt.xlabel(\"Year\", size=13)\r\n\r\nfig = plt.gcf()\r\nfig.set_size_inches(8, 8)\r\nfig.subplots_adjust(left=0.097, right=0.978, bottom=0.073, top=0.958)\r\n\r\nplt.savefig(\"nfl_combine_weight.png\", facecolor=\"#FEFEFE\")\r\n\r\nvplot2 = sns.violinplot(x=df[\"year\"], y=df[\"height_inches\"], data=df)\r\n\r\nvplot2.set_yticks(range(62, 84, 2))\r\nvplot2.set_ylim(61.5, 82.5)\r\n\r\nplt.title(\"NFL Combine  |  Player Height\", size=15)\r\nplt.ylabel(\"Height  (inches)\", size=13, labelpad=16)\r\nplt.xlabel(\"Year\", size=13)\r\n\r\nfig = plt.gcf()\r\nfig.set_size_inches(8, 8)\r\n\r\nplt.savefig(\"nfl_combine_height.png\", facecolor=\"#FEFEFE\")<\/pre>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Following the NFL Draft I thought it would be fun to look at data from the Scouting Combine. For those unfamiliar, the NFL Combine is an annual event where football&#8217;s brightest prospects are invited to show off their talents in<\/p>\n","protected":false},"author":1,"featured_media":741,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[59],"tags":[39,38,24,37,30,25,36,35],"class_list":["post-208","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-sports","tag-code","tag-histogram","tag-matplotlib","tag-nfl-combine","tag-pandas","tag-python","tag-seaborn","tag-violin-plot"],"_links":{"self":[{"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/posts\/208","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/comments?post=208"}],"version-history":[{"count":39,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/posts\/208\/revisions"}],"predecessor-version":[{"id":1768,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/posts\/208\/revisions\/1768"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/media\/741"}],"wp:attachment":[{"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/media?parent=208"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/categories?post=208"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/tags?post=208"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}