If you have any questions about the code here, feel free to reach out to me on Twitter or on Reddit.

Receiver Analysis: Who's underperforming their season?

In this part of the Learn Python with Fantasy Football season, we're going to use a linear regression model to find who is "underperforming" their season given their air yards and also number of targets.

This is just going to be a fun implementation of regression to find those players who are not performing where they should be given a line of best fit on the data. Air yards and targets are somewhat predictive of fantasy football performance, but there's obviously much more to it than that. The players who are exceptionally good at creating yards after catch will outperform any model based on averages. Kamara has barely seen any air yards all season, yet he'd still be a top fantasy player if you only counted his fantasy points as a result of his receiving yards, catches, and receiving TDs.

To start off, we'll create some simple scatter plots to visualize the relationship between targets and air yards and fantasy football performance. Then, we'll move on to implementing the model, and then analyzing the residuals. Those players with a large negative residual between expected fantasy points and actual fantasy points are said to be due for some positive regression. I'm writing this post as I'm writing the code, so I have no idea what to expect. Apologies now if the results are wacky. I have a feeling Marquez-Valdez Scantling is going to show up as an underperformer, but if you start him next week based on this post and he drops a dougnut, please don't email me.

Let's start off with installing the nflfastpy library in to your Google Colab notebook.

Aaand importing our libraries as always. Some new ones here from sklearn. We are importing the LinearRegression class to actually implement our model, and the mean_absolute_error utility function to evaluate our results.

Let's load 2019 and 2020 data. We're going to be using 2019 data to train our model, and 2020 data to predict values.

Before we move on, the data loading process takes a few moments, so let's make some copies of our data so we don't have to run the cell block above again.

Now, let's write a function that's going to aggregate the data we need from the play by play data.

receiver_player_id receiver_player_name game_id targets catches air_yards yards_gained rec_td rec_fpts
0 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2019_01_NYG_DAL 4.0 3.0 3.0 15.0 1.0 10.5
1 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2019_02_DAL_WAS 4.0 4.0 15.0 25.0 1.0 12.5
2 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2019_03_MIA_DAL 4.0 3.0 42.0 54.0 0.0 8.4
3 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2019_04_DAL_NO 4.0 4.0 40.0 50.0 0.0 9.0
4 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2019_05_GB_DAL 4.0 3.0 57.0 29.0 0.0 5.9

We now have weekly data available for 2019. Let's visualize the relationship between air yards and receiving fantasy points.

We can see there is some sort of relationship between air yards/targets and receiving fantasy points output.

Let's move on to implementing a model based on these relationship for fantasy points scored.

First, we are going to split up our features (X) and target (y). Our features are what we are using to predict fantasy points, air yards and targets, and our target is fantasy points.

We are also going to double check and make sure we don't have any null values in our Data, which we probably don't.

Now that we know we have no null values, let's covert these DataFrames to numpy arrays using the values attribute.

Our model can be implemented as simply as the one-liner below.

Now that we have a model fitted on 2019 data, let's use it to predict 2020 numbers (that already happened). Again, the point here is not to predict future performance, that would require us to know how many targets and air yards a player will have next week in advance. The point here is to ensure we have a decent model, use it to predict past values, and then see which players are underperforming or overperforming based on the expected model.

receiver_player_id receiver_player_name game_id targets catches air_yards yards_gained rec_td rec_fpts rec_fpts_pred
0 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_01_LV_CAR 1.0 1.0 2.0 2.0 0.0 1.2 1.589535
1 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_02_NO_LV 1.0 1.0 3.0 3.0 0.0 1.3 1.608743
2 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_04_BUF_LV 2.0 2.0 18.0 18.0 1.0 9.8 3.426626
3 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_05_LV_KC 2.0 2.0 -1.0 6.0 0.0 2.6 3.061687
4 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_07_TB_LV 1.0 1.0 3.0 6.0 0.0 1.6 1.608743

Let's see how good our model was at predicting 2020 values.

Our model was off by about 3 fantasy points per game. This means most of our predictions were +- within 3 of the actual results.

We're going to create a new column now to calculate the difference between y_true and y_pred (our residual).

receiver_player_id receiver_player_name game_id targets catches air_yards yards_gained rec_td rec_fpts rec_fpts_pred residual
0 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_01_LV_CAR 1.0 1.0 2.0 2.0 0.0 1.2 1.589535 -0.389535
1 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_02_NO_LV 1.0 1.0 3.0 3.0 0.0 1.3 1.608743 -0.308743
2 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_04_BUF_LV 2.0 2.0 18.0 18.0 1.0 9.8 3.426626 6.373374
3 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_05_LV_KC 2.0 2.0 -1.0 6.0 0.0 2.6 3.061687 -0.461687
4 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_07_TB_LV 1.0 1.0 3.0 6.0 0.0 1.6 1.608743 -0.008743

Let's sort values by the residual column to see when our model was most wrong.

receiver_player_id receiver_player_name game_id targets catches air_yards yards_gained rec_td rec_fpts rec_fpts_pred residual
1188 32013030-2d30-3033-3337-353731d69d3c R.Tonyan 2020_04_ATL_GB 6.0 6.0 63.0 98.0 3.0 33.8 10.410052 23.389948
910 32013030-2d30-3033-3331-3130f1a1e2e4 T.Higbee 2020_02_LA_PHI 5.0 5.0 44.0 54.0 3.0 28.4 8.515338 19.884662
2307 32013030-2d30-3033-3633-32324e92bd12 J.Jefferson 2020_06_ATL_MIN 11.0 9.0 137.0 166.0 2.0 37.6 19.480263 18.119737
716 32013030-2d30-3033-3232-31312f766863 T.Lockett 2020_07_SEA_ARI 20.0 15.0 233.0 200.0 3.0 53.0 35.092131 17.907869
1284 32013030-2d30-3033-3339-3036f296898c A.Kamara 2020_03_GB_NO 14.0 13.0 -8.0 139.0 2.0 38.9 21.284519 17.615481

The first result makes sense, Robert Tonyan scored a TD on half of his targets. It looks like our model is pretty bad at predicting multiple TD games. Let's use a simple scatter plot to analyze further.

Yup, so our model got progressively worse the more a player scored a TD in a given game. 0 and 1 TD games it was alright, but 3 TD games it could not handle.

Let's move on to joining some injury data to remove injured players and then finding some underperformers anyhow.

There's a new function in the nflfastpy module that let's you load 2020 roster data. It contains data on injured players too which is pretty neat.

season team position depth_chart_position jersey_number status full_name first_name last_name birth_date ... weight college high_school gsis_id espn_id sportradar_id yahoo_id rotowire_id update_dt headshot_url
0 2020 ARI C C 52.0 Active Mason Cole Mason Cole 1996-03-28 ... 292.0 Michigan East Lake (FL) 00-0034785 3115972.0 53d25371-e3ce-4030-8d0a-82def5cdc600 31067.0 12795.0 2020-11-21T07:08:46Z https://a.espncdn.com/combiner/i?img=/i/headsh...

1 rows × 21 columns

receiver_player_id receiver_player_name game_id targets catches air_yards yards_gained rec_td rec_fpts rec_fpts_pred residual gsis_id status headshot_url
0 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_01_LV_CAR 1.0 1.0 2.0 2.0 0.0 1.2 1.589535 -0.389535 00-0022127 Active https://a.espncdn.com/combiner/i?img=/i/headsh...
1 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_02_NO_LV 1.0 1.0 3.0 3.0 0.0 1.3 1.608743 -0.308743 00-0022127 Active https://a.espncdn.com/combiner/i?img=/i/headsh...
2 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_04_BUF_LV 2.0 2.0 18.0 18.0 1.0 9.8 3.426626 6.373374 00-0022127 Active https://a.espncdn.com/combiner/i?img=/i/headsh...
3 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_05_LV_KC 2.0 2.0 -1.0 6.0 0.0 2.6 3.061687 -0.461687 00-0022127 Active https://a.espncdn.com/combiner/i?img=/i/headsh...
4 32013030-2d30-3032-3231-32373ce51f62 J.Witten 2020_07_TB_LV 1.0 1.0 3.0 6.0 0.0 1.6 1.608743 -0.008743 00-0022127 Active https://a.espncdn.com/combiner/i?img=/i/headsh...

Awesome, so we converted the receiver id to the old gsis id like we've done in previous posts, and then merged the roster data and filtered out inactive players. Let's move on to filtering out some unneccesary columns and preparing our data for more analysis.

gsis_id receiver_player_name headshot_url rec_fpts rec_fpts_pred residual
0 00-0022127 J.Witten https://a.espncdn.com/combiner/i?img=/i/headsh... 1.2 1.589535 -0.389535
1 00-0022127 J.Witten https://a.espncdn.com/combiner/i?img=/i/headsh... 1.3 1.608743 -0.308743
2 00-0022127 J.Witten https://a.espncdn.com/combiner/i?img=/i/headsh... 9.8 3.426626 6.373374
3 00-0022127 J.Witten https://a.espncdn.com/combiner/i?img=/i/headsh... 2.6 3.061687 -0.461687
4 00-0022127 J.Witten https://a.espncdn.com/combiner/i?img=/i/headsh... 1.6 1.608743 -0.008743

Let's group by receiver and find the players who our model favored the most.

gsis_id receiver_player_name headshot_url rec_fpts rec_fpts_pred residual
72 00-0031381 D.Adams https://a.espncdn.com/combiner/i?img=/i/headsh... 27.014286 19.745273 7.269013
83 00-0031588 S.Diggs https://a.espncdn.com/combiner/i?img=/i/headsh... 18.760000 17.436972 1.323028
44 00-0030279 K.Allen https://a.espncdn.com/combiner/i?img=/i/headsh... 18.222222 17.433888 0.788334
55 00-0030564 D.Hopkins https://a.espncdn.com/combiner/i?img=/i/headsh... 18.820000 16.508142 2.311858
325 00-0035659 T.McLaurin https://a.espncdn.com/combiner/i?img=/i/headsh... 17.077778 16.440406 0.637372

Adams, Diggs, Allen, Hopkins, and McLaurin are our models top expected receivers for this year. We can see that Adams has a large average residual, probably as a result of his TD production.

We can also see that our model is really conservative. Let's get rid of the predicted values and just use a rank instead.

receiver_player_name headshot_url rec_fpts rec_fpts_pred residual
72 D.Adams https://a.espncdn.com/combiner/i?img=/i/headsh... 1.0 1.0 7.269013
83 S.Diggs https://a.espncdn.com/combiner/i?img=/i/headsh... 7.0 2.0 1.323028
44 K.Allen https://a.espncdn.com/combiner/i?img=/i/headsh... 8.0 3.0 0.788334
55 D.Hopkins https://a.espncdn.com/combiner/i?img=/i/headsh... 5.5 4.0 2.311858
325 T.McLaurin https://a.espncdn.com/combiner/i?img=/i/headsh... 15.0 5.0 0.637372

So now we have expected rank and also actual rank. Finally, let's find the biggest differences between expected rank and actual rank

receiver_player_name headshot_url rec_fpts rec_fpts_pred diff
85 D.Waller https://a.espncdn.com/combiner/i?img=/i/headsh... 37.0 19.0 18.0
191 C.Kupp https://a.espncdn.com/combiner/i?img=/i/headsh... 33.0 15.0 18.0
265 D.Chark Jr. https://a.espncdn.com/combiner/i?img=/i/headsh... 28.0 12.0 16.0
79 A.Cooper https://a.espncdn.com/combiner/i?img=/i/headsh... 23.0 8.0 15.0
294 D.Johnson https://a.espncdn.com/combiner/i?img=/i/headsh... 38.0 23.0 15.0

Darren Waller is our biggest underperformer for the 2020 season so far. He should be ranked 19 amongst all WRs and TEs, but he's only ranked 37.

Let's style our DataFrame and call it a day.