2021 Rushing Bar Chart Race

In this blog post were going to walk through how to create an animated bar chart that shows how the rushing leaders changed over time throughout the 2021 season. The bar chart will display the season long rushing leaders at each week during the regular season. We can do this in three main steps: data retrieval, data manipulation, and plotting. Let's get straight away into it.

Data Retrieval

The first phase is the data retrieval phase. We are using play by play data across the 2021 and will pull this from NFLfastpy. Like always, start with the installation (if you don't have NFLfastpy already), and then use the load_pbp_data method to load in the data. Play by play data contains information for every play for the desired time period. So we'll be looking at 2021 data and using every rush from the entire season.

	play_id	game_id	old_game_id	home_team	away_team	season_type	week	posteam	posteam_type	defteam	...	home_opening_kickoff	qb_epa	xyac_epa	xyac_mean_yardage	xyac_median_yardage	xyac_success	xyac_fd	xpass	pass_oe
0	1	2021_01_ARI_TEN	2021091207	TEN	ARI	REG	1	NaN	NaN	NaN	...	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	40	2021_01_ARI_TEN	2021091207	TEN	ARI	REG	1	TEN	home	ARI	...	1	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	55	2021_01_ARI_TEN	2021091207	TEN	ARI	REG	1	TEN	home	ARI	...	1	-1.399805	NaN	NaN	NaN	NaN	NaN	0.491433	-49.143299
3	76	2021_01_ARI_TEN	2021091207	TEN	ARI	REG	1	TEN	home	ARI	...	1	0.032412	1.165133	5.803177	4.0	0.896654	0.125098	0.697346	30.265415
4	100	2021_01_ARI_TEN	2021091207	TEN	ARI	REG	1	TEN	home	ARI	...	1	-1.532898	0.256036	4.147637	2.0	0.965009	0.965009	0.978253	2.174652

5 rows × 372 columns

Data Manipulation

We have worked with play by play data in the past and this time around we are going to focus on a few key features. In order to get NFL rushing leaders each week in 2021 we need to grab the rushing yards for each player, grouped by week. To do this we use the groupby method and sum. To understand what this looks like I selected Derrick Henry's rushing yards in each game. While doing this I noticed that our data includes playoff weeks, so we can filter that out in the next step. I always like to check that the data makes sense in the real life context it comes from, and in this case we only see data from week 1 until week 8 for Henry, after which he got hurt, so this checks out.

	rusher_id	rusher	week	yards_gained
550	00-0032764	D.Henry	1	58.0
551	00-0032764	D.Henry	2	182.0
552	00-0032764	D.Henry	3	115.0
553	00-0032764	D.Henry	4	157.0
554	00-0032764	D.Henry	5	130.0
555	00-0032764	D.Henry	6	143.0
556	00-0032764	D.Henry	7	86.0
557	00-0032764	D.Henry	8	68.0
558	00-0032764	D.Henry	20	62.0

Next we're going to work on formating the data to be plotted. The bar_chart_race package that we use for plotting requires a specific data format to properly work, so we're going to take a few steps to get it into a format that works well with that package. First we need to make a pivot table so that the index on the left is the NFL week, and the columns are players with their respective rushing yards.

rusher	A.Abdullah	A.Armah	A.Bachman	A.Brewer	A.Brown	A.Collins	A.Dalton	A.Dillon	A.Dulin	A.Ekeler	...	T.Tremble	T.Williams	V.Jefferson	W.Gallman	W.Smallwood	Z.Ertz	Z.Jones	Z.Moss	Z.Pascal	Z.Wilson
week
1	4.0	NaN	NaN	NaN	6.0	NaN	NaN	19.0	NaN	57.0	...	NaN	65.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0
2	NaN	NaN	NaN	NaN	NaN	25.0	NaN	18.0	NaN	56.0	...	NaN	77.0	NaN	NaN	NaN	NaN	NaN	26.0	NaN	-1.0
3	24.0	5.0	NaN	NaN	3.0	8.0	NaN	18.0	-7.0	55.0	...	7.0	22.0	NaN	NaN	NaN	NaN	NaN	60.0	NaN	2.0
4	NaN	2.0	NaN	0.0	NaN	44.0	NaN	81.0	NaN	117.0	...	NaN	NaN	NaN	29.0	NaN	NaN	NaN	61.0	NaN	-4.0
5	2.0	NaN	NaN	NaN	NaN	47.0	NaN	30.0	NaN	66.0	...	NaN	6.0	NaN	2.0	NaN	NaN	NaN	37.0	NaN	NaN

5 rows × 370 columns

Now we have to fill the NaN values with 0. We see NaN values in the dataframe since these are instances when the players did not get rushes in the given week, hence they got 0 rushing yards. Lastly, we have to do a cumulative sum so that we are adding up the rushing yards for each player individually. We can do this with a cumsum() method.

rusher	A.Abdullah	A.Armah	A.Bachman	A.Brewer	A.Brown	A.Collins	A.Dalton	A.Dillon	A.Dulin	A.Ekeler	...	T.Tremble	T.Williams	V.Jefferson	W.Gallman	W.Smallwood	Z.Ertz	Z.Jones	Z.Moss	Z.Pascal	Z.Wilson
week
1	4.0	0.0	0.0	0.0	6.0	0.0	0.0	19.0	0.0	57.0	...	0.0	65.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0
2	4.0	0.0	0.0	0.0	6.0	25.0	0.0	37.0	0.0	113.0	...	0.0	142.0	0.0	0.0	0.0	0.0	0.0	26.0	0.0	-1.0
3	28.0	5.0	0.0	0.0	9.0	33.0	0.0	55.0	-7.0	168.0	...	7.0	164.0	0.0	0.0	0.0	0.0	0.0	86.0	0.0	2.0
4	28.0	7.0	0.0	0.0	9.0	77.0	0.0	136.0	-7.0	285.0	...	7.0	164.0	0.0	29.0	0.0	0.0	0.0	147.0	0.0	-4.0
5	30.0	7.0	0.0	0.0	9.0	124.0	0.0	166.0	-7.0	351.0	...	7.0	170.0	0.0	31.0	0.0	0.0	0.0	184.0	0.0	0.0

5 rows × 370 columns

Next we clean up the size of the dataframe by only keeping the rows that will show up in the bar chart. To do this we loop through the data and only keep the top 10 at any given week. This step is not entirely necessary, but I thought it was best to clean up our data frame and get it plot ready.

rusher	C.Edwards-Helaire	A.Ekeler	A.Kamara	D.Cook	M.Gordon	N.Chubb	D.Harris	D.Montgomery	A.Jones	C.Edmonds	...	N.Harris	E.Elliott	E.Mitchell	C.Carson	J.Mixon	D.Henderson	M.Ingram	L.Fournette	J.Taylor	J.Robinson
week
1	43.0	57.0	83.0	61.0	101.0	83.0	100.0	108.0	9.0	63.0	...	45.0	33.0	104.0	91.0	127.0	70.0	85.0	32.0	56.0	25.0
2	89.0	113.0	88.0	192.0	132.0	178.0	131.5	169.0	76.0	109.0	...	83.0	104.0	146.0	122.0	196.0	123.0	126.0	84.0	83.0	72.0
3	189.0	168.0	177.0	192.0	192.0	262.0	145.5	203.0	158.0	135.0	...	123.0	199.0	146.0	202.0	286.0	123.0	147.0	92.0	116.0	160.0
4	291.0	285.0	297.0	226.0	248.0	362.0	141.5	309.0	206.0	255.0	...	185.0	342.0	146.0	232.0	353.0	212.0	171.0	184.0	167.5	238.0
5	304.0	351.0	368.0	226.0	282.0	523.0	199.5	309.0	309.0	270.0	...	307.0	452.0	189.0	232.0	386.0	294.0	212.0	251.0	220.5	387.0

5 rows × 26 columns

Plotting

Finally were ready to plot our data. To do this I found a package that fits our use case perfectly. It takes in a dataframe that is partitioned over time and allows the user flexibility to change timestamps and transitions. First we install the package and then plot our data. I recommend playing around with some of the features to see how the plot reacts. If you are doing this in Google Colab you should see a MP4 file pop up in the file tab on the left with our final result.

I was really happy with this result, I think visualization is one of the most important parts of data science. I've watched it many times over and every time you can notice something new. Like I mentioned earlier seeing Derrick Henry dominate over the first 8 weeks and then get injured is really cool to see in this type of plot, and he stayed in the top 10!

I hope you like this type of blog where we create a cool visual, and as always let me know if you have any questions.

2021 Rushing Bar Chart Race!

2021 Rushing Bar Chart Race

Data Retrieval

Data Manipulation

Plotting