Hi all, in this part of the intermediate series, I'm going to take you through part of a project we go through in the python course offered on this site - only this post is going to give a look into the current season. The course offers a more in-depth analysis of this project along with an hour long video explanation.

Rushing TD Regression Candidates

We are going to be doing a TD regression analysis for RBs. Essentially, for each RB this season, we'll be looking at their number of carries + how far each carry was from the endzone. Based off how far a player received a carry from the endzone, we assign a score or expected touchdown value to their carry based off historical numbers on how probable a TD is from that given yardline. Once you add each carry's expected TD value for each RB, you'll get an expected TD number for the entire season. For each RB, we'll be comparing that expected TD value to the actual TD values they posted. If a player actually scored more TDs than their expected TD value, we can say they are due for negative regression and "over performed" so far this season. And vice versa if their expected TD value was higher than their actual TD posting. This will help us mid season to assess if players should have gotten more TDs to this point and give us insight when making roster moves in fantasy football.

There is actually quite a bit of statistics that go into this. Since we have the probability of a score from a given yardline, does it make sense to simply add up all the rush attempts from a player in a single drive? One thing you may notice is that the TD probability from a rush attempted on the 1 yardline is greater than 50%. This means that if a player attempts 2 rushes from the 1 yardline their expected touchdowns is greater than 1. The problem with this is although technically that is true, a player cannot score more than 1 touchdown on a drive. To get around this we implement a cap at 1 touchdown per drive.

No analysis is perfect - we are treating every RB as if they have the same talent level. Obviously some RBs have that special ability to break off a 60 yard TD at any moment, while other RBs simply do not. Those RBs that have that special ability to break off a huge play at any moment obviously have a higher probability of scoring a TD per handoff than an average RB. Thus, those RBs will be underweighted using this sort of analysis. Conversely, RBs that don't have that special ability are overweighted by this sort of analysis. This analysis is still very valuable since it is using an on average approach to allow us to assess the opportunity based value of a player compared to their actual performance.

Feel free to scroll to the bottom to get a list of RB expected and actual touchdowns scored.

The code

With that out of the way, let's start work on the project. You'll need to run this code in a notebook environment, for those unfamiliar with my content, and the easiest solution would be to set up a Google Colab runtime.

To start, we're going to be installing the nflfastpy library which is maintained by Fantasy Football Data Pros. It pulls from the nflfastR data package and exposes 20 years of play by play data.

Importing our libraries as always, and setting the style for our visualizations.

In this first block of code, we are going to be pulling data from the past 5 years of the NFL, excluding the most recent season and concatenating the data together into one big DataFrame. We will be using this data to find the probability of scoring a touchdown X yards from the endzone, for each value of X 0 - 100, and putting this all in to a DataFrame we'll call rushing_df_probs.

Now we're going to take our big DataFrame and transform it the way I described. Like I said earlier, this post will not be as in depth as my course, so if the code here confuses a bit, it's covered a bit more in depth in the course. I've added some comments to each line to explain what we're doing.

yardline_100 probability_of_touchdown
0 1.0 0.560390
3 2.0 0.428058
5 3.0 0.336910
7 4.0 0.304251
9 5.0 0.206349

Now we can plot our DataFrame you see above so we can visualize the probability of scoring a rushing touchdown X yards from the endzone. As anyone could have expected, as you get farther from the endzone, the less probable it is a TD will occur. We're more concerned with the values from the DataFrame and using it it calculate expected TDs than any sort of revelation from visualizing this data.

Now we'll load in 2021 data. We're going to use the rushing_df_probs DataFrame to calculate expected TD values for each RB in the 2021 season, then compare that calculated value to actual TDs scored.

Below, we're going to filter, clean, and merge our data a bit. I've added comments above each operation.

rusher_id drive_id rusher_player_name rush_attempt rush_touchdown yardline_100 probability_of_touchdown
0 00-0032764 2021_01_ARI_TEN_1 D.Henry 1.0 0.0 75.0 0.003111
1 00-0035228 2021_01_ARI_TEN_2 K.Murray 1.0 0.0 23.0 0.018553
2 00-0034681 2021_01_ARI_TEN_2 C.Edmonds 1.0 0.0 9.0 0.081081
3 00-0032764 2021_01_ARI_TEN_3 D.Henry 1.0 0.0 80.0 0.000969
4 00-0032764 2021_01_ARI_TEN_5 D.Henry 1.0 0.0 75.0 0.003111

Now, we have a DataFrame that contains each rush attempt for the season by an RB, with each play assigned an expected touchdown value. All that's left to do now is to group by rusher id and add up the actual touchcowns and the expected values, right? Recall in the beginning of this post we discussed the problem I originally ran into when running this sort of analysis. Some RBs would receive a greater than 1 value for expected TDs on a given drive, which doesn't make sense.

I was considering different ways to approach this problem. I considered using conditional probability, but since we already know the outcome of the events it isn't the most appropriate use. In my opinion the best thing we can do is cap the per drive TD maximum at 1. This made the most sense to me since if a RB gets 3 rushes from the 1 yardline I would say he should have scored based on his opportunity.

So, we need to limit expected TDs to 1 per drive, or at the very least, very close to 1. That was the reason for assigning a drive id column above. First, we will add expected values for each RB, drive without this cap.

We can see here we have 18 instances where an RB was assigned a expected TD value of greater than 1 on a single drive. Let's cap all drives to 0.999.

Now that we've gotten that out of the way, let's groupby again and this time only group by rusher id. We'll assign a column called positive_regression_candidate which will flash True when expected touchdowns > actual touchdowns, indicating a player may be due for positive regression. The delta column we assign here will be the difference expected and actual touchdowns.

rusher_id rusher_player_name actual_touchdowns expected_touchdowns positive_regression_candidate delta
73 00-0032764 D.Henry 10.0 7.339330 False 2.660670
209 00-0036223 J.Taylor 6.0 7.021710 True 1.021710
186 00-0035657 D.Harris 6.0 5.897961 False 0.102039
189 00-0035664 D.Henderson 5.0 5.003799 True 0.003799
108 00-0033856 L.Fournette 4.0 4.867322 True 0.867322

All that's left now is to visualize the results, but I encourage you to mess around with this data because it's interesting and useful on it's own. I've added some comments to the code below to guide you through it once again.

The Henry injury is devastating, but lets use him as an example to interpret this plot. Unsurprisingly Derrick Henry finds himself at the top of the list in terms of expected touchdowns and actual touchdowns. The plot suggests that Henry is over performing, but like I mentioned in the beginning of the post we are not taking into account each individual rusher's ability, and this is Derrick Henry we are talking about. If the average rusher got the opportunities Henry has gotten this season I might think that the TD production will not be sustainable, but there is 0 reason to suggest Henry was over performing and would regress to the mean if he stayed healthy.

Someone like Sam Darnold on the other hand is a huge regression candidate since he ran in 5 TDs in the first 4 weeks of the season. He has yet to score a rushing touchdown since then (this analysis is taking place after week 7) which means we are already seeing a sort of regression to the mean. Its safe to say we do not expect Darnold to be the leading TD rusher (or anywhere near it like he was after week 4).

I anticipate Miles Sanders to get it going, at least more than he has so far this season since he is still without a TD. Don't expect him to be a top 10 RB (or even top 15) since Hurts will continue to sap his value, but with Sanders under performing his expected TD value by 3, I'd wager he will find the endzone sooner rather than later.

That's all for this post. Thank you for reading, you guys are awesome!

Appendix

Here are all the RBs from this season that have at least 1 expected touchdown.

rusher_id rusher_player_name actual_touchdowns expected_touchdowns positive_regression_candidate delta
73 00-0032764 D.Henry 10.0 7.339330 False 2.660670
209 00-0036223 J.Taylor 6.0 7.021710 True 1.021710
186 00-0035657 D.Harris 6.0 5.897961 False 0.102039
189 00-0035664 D.Henderson 5.0 5.003799 True 0.003799
108 00-0033856 L.Fournette 4.0 4.867322 True 0.867322
102 00-0033553 J.Conner 8.0 4.865928 False 3.134072
83 00-0033045 E.Elliott 5.0 4.715967 False 0.284033
92 00-0033293 A.Jones 3.0 4.640432 True 1.640432
117 00-0033906 A.Kamara 2.0 4.557704 True 2.557704
235 00-0036389 J.Hurts 4.0 4.183223 True 0.183223
259 00-0036893 N.Harris 3.0 4.103959 True 1.103959
105 00-0033699 A.Ekeler 5.0 3.922145 False 1.077855
116 00-0033897 J.Mixon 5.0 3.900342 False 1.099658
148 00-0034791 N.Chubb 4.0 3.681646 False 0.318354
198 00-0035831 J.Robinson 5.0 3.407411 False 1.592589
120 00-0033923 K.Hunt 5.0 3.320475 False 1.679525
265 00-0036924 Mi.Carter 3.0 3.311899 True 0.311899
149 00-0034796 L.Jackson 2.0 3.289903 True 1.289903
11 00-0027966 M.Ingram 1.0 3.248947 True 2.248947
211 00-0036251 Z.Moss 3.0 3.147690 True 0.147690
218 00-0036275 D.Swift 3.0 3.134956 True 0.134956
130 00-0034301 Da.Williams 4.0 3.050942 False 0.949058
223 00-0036328 A.Gibson 3.0 3.009944 True 0.009944
173 00-0035243 M.Sanders 0.0 2.891143 True 2.891143
79 00-0032972 D.Booker 2.0 2.780498 True 0.780498
192 00-0035700 J.Jacobs 5.0 2.615657 False 2.384343
114 00-0033893 D.Cook 2.0 2.585961 True 0.585961
66 00-0032426 A.Collins 2.0 2.579420 True 0.579420
273 00-0036997 J.Williams 1.0 2.455934 True 1.455934
59 00-0032144 M.Gordon 3.0 2.444273 False 0.555727
158 00-0034857 J.Allen 2.0 2.398869 True 0.398869
52 00-0031806 M.Brown 1.0 2.370514 True 1.370514
121 00-0033948 J.Williams 2.0 2.288534 True 0.288534
172 00-0035228 K.Murray 2.0 2.277217 True 0.277217
183 00-0035537 T.Johnson 1.0 2.227564 True 1.227564
89 00-0033280 C.McCaffrey 1.0 2.185380 True 1.185380
193 00-0035710 D.Jones 2.0 2.160797 True 0.160797
241 00-0036555 C.Hubbard 2.0 2.136616 True 0.136616
159 00-0034869 S.Darnold 5.0 2.040697 False 2.959303
136 00-0034414 B.Scott 3.0 1.984017 False 1.015983
84 00-0033077 D.Prescott 0.0 1.946874 True 1.946874
262 00-0036906 K.Herbert 1.0 1.881730 True 0.881730
155 00-0034844 S.Barkley 2.0 1.849759 False 0.150241
56 00-0032063 M.Davis 1.0 1.842768 True 0.842768
145 00-0034681 C.Edmonds 1.0 1.839042 True 0.839042
77 00-0032950 C.Wentz 1.0 1.718546 True 0.718546
152 00-0034816 R.Jones 1.0 1.711764 True 0.711764
205 00-0036096 J.Taylor 2.0 1.666519 False 0.333481
31 00-0030578 C.Patterson 2.0 1.617028 False 0.382972
32 00-0030874 D.Williams 2.0 1.572870 False 0.427130
70 00-0032602 J.McKissic 1.0 1.560870 True 0.560870
191 00-0035685 D.Montgomery 3.0 1.543478 False 1.456522
242 00-0036567 E.Mitchell 3.0 1.532005 False 1.467995
104 00-0033594 C.Carson 3.0 1.495154 False 1.504846
276 00-0037012 T.Lance 1.0 1.355950 True 0.355950
268 00-0036971 T.Lawrence 2.0 1.324875 False 0.675125
156 00-0034845 S.Michel 1.0 1.307333 True 0.307333
72 00-0032741 P.Barber 1.0 1.269330 True 0.269330
96 00-0033357 T.Hill 3.0 1.264787 False 1.735213
264 00-0036919 K.Gainwell 2.0 1.233825 False 0.766175
28 00-0030513 L.Murray 4.0 1.227994 False 2.772006
42 00-0031345 J.Garoppolo 3.0 1.227769 False 1.772231
74 00-0032780 J.Howard 2.0 1.200711 False 0.799289
215 00-0036265 A.Dillon 0.0 1.188196 True 1.188196
41 00-0031285 D.Freeman 2.0 1.188181 False 0.811819
146 00-0034750 R.Penny 0.0 1.117933 True 1.117933
177 00-0035311 M.Gaskin 0.0 1.114066 True 1.114066
0 00-0019596 T.Brady 1.0 1.109926 True 0.109926
175 00-0035261 T.Pollard 1.0 1.103899 True 0.103899
161 00-0034972 A.Mattison 0.0 1.082761 True 1.082761
34 00-0031045 C.Hyde 0.0 1.070865 True 1.070865
174 00-0035250 D.Singletary 1.0 1.040830 True 0.040830
185 00-0035628 D.Johnson 2.0 1.016086 False 0.983914