2021 Rushing Bar Chart Race

In this blog post were going to walk through how to create an animated bar chart that shows how the rushing leaders changed over time throughout the 2021 season. The bar chart will display the season long rushing leaders at each week during the regular season. We can do this in three main steps: data retrieval, data manipulation, and plotting. Let's get straight away into it.

Data Retrieval

The first phase is the data retrieval phase. We are using play by play data across the 2021 and will pull this from NFLfastpy. Like always, start with the installation (if you don't have NFLfastpy already), and then use the load_pbp_data method to load in the data. Play by play data contains information for every play for the desired time period. So we'll be looking at 2021 data and using every rush from the entire season.

play_id game_id old_game_id home_team away_team season_type week posteam posteam_type defteam ... out_of_bounds home_opening_kickoff qb_epa xyac_epa xyac_mean_yardage xyac_median_yardage xyac_success xyac_fd xpass pass_oe
0 1 2021_01_ARI_TEN 2021091207 TEN ARI REG 1 NaN NaN NaN ... 0 1 NaN NaN NaN NaN NaN NaN NaN NaN
1 40 2021_01_ARI_TEN 2021091207 TEN ARI REG 1 TEN home ARI ... 0 1 0.000000 NaN NaN NaN NaN NaN NaN NaN
2 55 2021_01_ARI_TEN 2021091207 TEN ARI REG 1 TEN home ARI ... 0 1 -1.399805 NaN NaN NaN NaN NaN 0.491433 -49.143299
3 76 2021_01_ARI_TEN 2021091207 TEN ARI REG 1 TEN home ARI ... 0 1 0.032412 1.165133 5.803177 4.0 0.896654 0.125098 0.697346 30.265415
4 100 2021_01_ARI_TEN 2021091207 TEN ARI REG 1 TEN home ARI ... 0 1 -1.532898 0.256036 4.147637 2.0 0.965009 0.965009 0.978253 2.174652

5 rows × 372 columns

Data Manipulation

We have worked with play by play data in the past and this time around we are going to focus on a few key features. In order to get NFL rushing leaders each week in 2021 we need to grab the rushing yards for each player, grouped by week. To do this we use the groupby method and sum. To understand what this looks like I selected Derrick Henry's rushing yards in each game. While doing this I noticed that our data includes playoff weeks, so we can filter that out in the next step. I always like to check that the data makes sense in the real life context it comes from, and in this case we only see data from week 1 until week 8 for Henry, after which he got hurt, so this checks out.

rusher_id rusher week yards_gained
550 00-0032764 D.Henry 1 58.0
551 00-0032764 D.Henry 2 182.0
552 00-0032764 D.Henry 3 115.0
553 00-0032764 D.Henry 4 157.0
554 00-0032764 D.Henry 5 130.0
555 00-0032764 D.Henry 6 143.0
556 00-0032764 D.Henry 7 86.0
557 00-0032764 D.Henry 8 68.0
558 00-0032764 D.Henry 20 62.0

Next we're going to work on formating the data to be plotted. The bar_chart_race package that we use for plotting requires a specific data format to properly work, so we're going to take a few steps to get it into a format that works well with that package. First we need to make a pivot table so that the index on the left is the NFL week, and the columns are players with their respective rushing yards.

rusher A.Abdullah A.Armah A.Bachman A.Brewer A.Brown A.Collins A.Dalton A.Dillon A.Dulin A.Ekeler ... T.Tremble T.Williams V.Jefferson W.Gallman W.Smallwood Z.Ertz Z.Jones Z.Moss Z.Pascal Z.Wilson
week
1 4.0 NaN NaN NaN 6.0 NaN NaN 19.0 NaN 57.0 ... NaN 65.0 NaN NaN NaN NaN NaN NaN NaN 2.0
2 NaN NaN NaN NaN NaN 25.0 NaN 18.0 NaN 56.0 ... NaN 77.0 NaN NaN NaN NaN NaN 26.0 NaN -1.0
3 24.0 5.0 NaN NaN 3.0 8.0 NaN 18.0 -7.0 55.0 ... 7.0 22.0 NaN NaN NaN NaN NaN 60.0 NaN 2.0
4 NaN 2.0 NaN 0.0 NaN 44.0 NaN 81.0 NaN 117.0 ... NaN NaN NaN 29.0 NaN NaN NaN 61.0 NaN -4.0
5 2.0 NaN NaN NaN NaN 47.0 NaN 30.0 NaN 66.0 ... NaN 6.0 NaN 2.0 NaN NaN NaN 37.0 NaN NaN

5 rows × 370 columns

Now we have to fill the NaN values with 0. We see NaN values in the dataframe since these are instances when the players did not get rushes in the given week, hence they got 0 rushing yards. Lastly, we have to do a cumulative sum so that we are adding up the rushing yards for each player individually. We can do this with a cumsum() method.

rusher A.Abdullah A.Armah A.Bachman A.Brewer A.Brown A.Collins A.Dalton A.Dillon A.Dulin A.Ekeler ... T.Tremble T.Williams V.Jefferson W.Gallman W.Smallwood Z.Ertz Z.Jones Z.Moss Z.Pascal Z.Wilson
week
1 4.0 0.0 0.0 0.0 6.0 0.0 0.0 19.0 0.0 57.0 ... 0.0 65.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0
2 4.0 0.0 0.0 0.0 6.0 25.0 0.0 37.0 0.0 113.0 ... 0.0 142.0 0.0 0.0 0.0 0.0 0.0 26.0 0.0 -1.0
3 28.0 5.0 0.0 0.0 9.0 33.0 0.0 55.0 -7.0 168.0 ... 7.0 164.0 0.0 0.0 0.0 0.0 0.0 86.0 0.0 2.0
4 28.0 7.0 0.0 0.0 9.0 77.0 0.0 136.0 -7.0 285.0 ... 7.0 164.0 0.0 29.0 0.0 0.0 0.0 147.0 0.0 -4.0
5 30.0 7.0 0.0 0.0 9.0 124.0 0.0 166.0 -7.0 351.0 ... 7.0 170.0 0.0 31.0 0.0 0.0 0.0 184.0 0.0 0.0

5 rows × 370 columns

Next we clean up the size of the dataframe by only keeping the rows that will show up in the bar chart. To do this we loop through the data and only keep the top 10 at any given week. This step is not entirely necessary, but I thought it was best to clean up our data frame and get it plot ready.

rusher C.Edwards-Helaire A.Ekeler A.Kamara D.Cook M.Gordon N.Chubb D.Harris D.Montgomery A.Jones C.Edmonds ... N.Harris E.Elliott E.Mitchell C.Carson J.Mixon D.Henderson M.Ingram L.Fournette J.Taylor J.Robinson
week
1 43.0 57.0 83.0 61.0 101.0 83.0 100.0 108.0 9.0 63.0 ... 45.0 33.0 104.0 91.0 127.0 70.0 85.0 32.0 56.0 25.0
2 89.0 113.0 88.0 192.0 132.0 178.0 131.5 169.0 76.0 109.0 ... 83.0 104.0 146.0 122.0 196.0 123.0 126.0 84.0 83.0 72.0
3 189.0 168.0 177.0 192.0 192.0 262.0 145.5 203.0 158.0 135.0 ... 123.0 199.0 146.0 202.0 286.0 123.0 147.0 92.0 116.0 160.0
4 291.0 285.0 297.0 226.0 248.0 362.0 141.5 309.0 206.0 255.0 ... 185.0 342.0 146.0 232.0 353.0 212.0 171.0 184.0 167.5 238.0
5 304.0 351.0 368.0 226.0 282.0 523.0 199.5 309.0 309.0 270.0 ... 307.0 452.0 189.0 232.0 386.0 294.0 212.0 251.0 220.5 387.0

5 rows × 26 columns

Plotting

Finally were ready to plot our data. To do this I found a package that fits our use case perfectly. It takes in a dataframe that is partitioned over time and allows the user flexibility to change timestamps and transitions. First we install the package and then plot our data. I recommend playing around with some of the features to see how the plot reacts. If you are doing this in Google Colab you should see a MP4 file pop up in the file tab on the left with our final result.

I was really happy with this result, I think visualization is one of the most important parts of data science. I've watched it many times over and every time you can notice something new. Like I mentioned earlier seeing Derrick Henry dominate over the first 8 weeks and then get injured is really cool to see in this type of plot, and he stayed in the top 10!

I hope you like this type of blog where we create a cool visual, and as always let me know if you have any questions.