Georgia Tech Football: Advanced Stats Week 2022 – Data and its Environs

Unlike Robert’s work this week, my contributions will be less focused on specific data and more on a meta-narrative of that data. This is where the debate about the current state of the college football analytics world begins and ends: what data do we have, and what Writes of the data we have, and how we can access it.

This seems like a very basic thing, right? Of course, data matters – no data means no geeks. Post infographics on the Bird™️ app, and we just can’t get over that. But there’s a nuance in this detail: Data guides insight. You can slice and slice a spreadsheet any way you want to create a tweakable graphic, but:

  1. Well, you need the actual data table (literal data), and:
  2. You need to know what the spreadsheet contains (what kind of data you have).

Now, with that in mind, let’s get down to business.

How do we get the data?

This is one of the two tricky parts when it comes to creating the charts you see on Twitter (the other part is data cleaning). For college football specifically, there are several different types of data you might want to get:

  • Scores and tables
  • Team stats
  • Player stats
  • Play
  • Recruit visits

Some examples for you to see:

In general, these things can be found from the usual suspects:,, Football Reference,,, and even subscription sites like Pro Football Focus (PFF) and Sports Info Solutions ( SIS) – All places where you can manually copy numbers and text to create a spreadsheet at your leisure.

But what if you wanted to automate this? After all, it’s not necessarily an efficient process of copying and pasting all of this stuff into Excel every time you want to create a chart or table.

Well, now you speak my language – programming language, that is. With a little R or Python programming, one can create iterable pipelines to access them All from this data and build visualizations based on it. I’ll tell experts (for R and Python) the details of this approach if you’re interested, but in short, these tools are (after some layers of abstraction) a public (but undocumented and mostly hidden) data feed that ESPN uses to populate its sites. The tools retrieve data and organize it into frames (IE: tables) that can be manipulated in code and as part of they Automation, these tools can clean and simplify data for users to process it more easily. This preprocessing also means that these tools can prepare supplementary information for the essentials from ESPN to better aid user analysis (eg: expected points added).

What data do we get?

Now that we have the data – whether by manual assembly or programmed pipelines – what have we actually recovered? Intuitively, we’ll focus first on the data structure: do we have numeric or logical columns? text columns? Maybe we have alphanumeric identifiers of teams or players that we need to match against full information records? These are all (relatively) easy-to-answer questions seek in what is available to us.

But there is a descriptive way to address this question as well, and let’s take it with an example: Take a look at this Jahmyr Gibbs game from last year’s Kennesaw State game and make some mental notes of what’s going on:

Are these notes in mind? Well, let’s compare those to the way ESPN describes this play in their Play by Play feed:

One sentence: That’s all we get. We only get the simplest explanation of the end state of this play. Other stats keepers may notice the multiple broken hurdles, but even given that, what we’re seeing here is the destination, not the journey, so to speak. We only get the events and the players involved in them (maybe the third if we’re lucky – the striker), not the whole story. It’s like reading a book through CliffsNotes only half censored.

This is where I go when I encourage you to question Writes From the data you’re working with: do you only get the events and results, or do you get the whole picture? Twenty-two people are on the field at any given moment during a college football game: What are they doing? Where are they going in the field? To make a long series of rhetorical questions short: How do they contribute to the success or failure of that particular play to their team?

These are things you don’t get from the usual event data, and are an important missing link in how we analyze the game. Now, you might offer that a study of film (such as the one done in college football programs with armies of analysts) covers this gap and you’d most likely be right. Studying the film reveals much of the processes behind a play—coverings, formations, and avenues—and provides the opportunity to explore alternative outcomes as an intellectual exercise (Example: “What if Gibbs retreated inward through his earlier cannons?”).

But here’s the catch – while the study of the film reveals more semantic concepts about the game, it suffers from two major flaws compared to the analysis of event data:

  1. It takes a long time to go through the same clip of the same play a few X times and jot something different each time.
  2. Because it is time consuming, it is not scalable.

Let me frame this: What is the probability that Gibbs will break those intrusions based on the angles of attack of the potential attackers? Was there a different receiver who had more separation from his guns that Jordan Yates could have targeted on this play? What is the hunting probability of this recipient based on said class and its style? Based on the defensive line’s speed and trajectory, how much time did Yates need to make this throw? Across the film, we now understand how we got from point A to point B, but what is the chance of that outcome? What other points B (point B?) have we not gone to and why? Bottom line: Did Yates make the best throw here, or was there a better decision in the time he made his reading? There is no way we can answer these questions just by looking at the tape.

So, from the playing data, we know what or what It happened, and from the movie, we know How It happened. Now, with improvements in computing and machine learning technologies, software can bring these two areas together: by hiring player tracking data, Analysts can empower coaches and field personnel with information on player tendencies, coverages, and scheming tendencies that would take years to glean via traditional film study. Now, it’s data analysts – and not just movie junkies – who can analyze the intermediate states of plays and evaluate unrealistic outcomes. A program that effectively uses this data can change the game (pun unintended) about how it prepares its opponents and dissects its own performance—traditional schematic and semantic conversations can now become more process-dependent, rather than accurate, results-driven.

In short: The game is all about blocking and interfering, and the coach’s job is to see where they can find little edges to improve their team’s ability to do these things and prevent their opponents from doing these things. You take past data, learn from it, and adjust your chart – this iterative process is essential to the success of a football team. Every percentage point, every fraction, every significant number: all of these things matter when 1) talent margins are tight in a competitive league and 2) a ball bouncing the wrong way at the wrong time causes someone (or multiple) to lose. profession. Every piece of data is critical to improving team success. Now, companies have given *virtual galleon software full of tracking data for players to comb through and find more ways to win.


* Well, if the program pays for it. You’ll notice that I haven’t touched much on paid data services like StatsBomb, Pro Football Focus (PFF), or Sports Info Solutions (SIS) when discussing which data sources to opt out of. These companies are an important part of the data ecosystem, especially in the enterprise (read: FBS) scale, but based on what I’ve read and seen, PFF and SIS mainly provide structured data (read: what you can get from studying the movie) and metrics that are created on top of it. StatsBomb seems to be the only one who has discovered the college-wide tracking data – more about them and their project (along with others) soon.