Background
I am a lover of soccer. Have been since I started playing goalie between two trees when I was 10 years old. I remember growing up watching A.C. Milan destroy opponents with finesse and skill (Sadly those days are past but I hope for a return someday).
I also love data. We are becoming all too aware as a society how influential even the smallest amount of data can be. This is why data science as a general concept appeals to me so much. The science of data. Figuring out what data can show us and how it can guide us, many times way better than we can as humans.
So I decided to see if I could put two of my loves together. Soccer is often described, and I agree, as a wholly strategic game. Yes there is the individual skill, but even Barcelona or Real Madrid have to create a strategy around Messi or Ronaldo.
But what are these strategies? Are they simple or complex? Do managers have one they stick to or does it depend on one of many different factors? This is what I hoped to answer with data.
I wanted to see if a simple dataset of soccer statistics could help determine the types of strategies teams employ.
There were four main steps to determining this:
- Data Cleaning (finding the data from whichever sources available and converting the data into something usable for modeling)
- EDA (standing for Exploratory Data Analysis, which is looking at the data and seeing what changes must be made to the data to ensure final results won't be skewed by an abnormalities)
- Modeling (the piece where we choose a specific model to use and apply it to the data)
- Evaluation (looking at the results of the modeling piece and deciding how, if at all, it answers are initial question)
Let's see how it went!
Data Cleaning
Thankfully and unsurprisingly I wasn't the first person to think soccer data could hold some value (shocked face incoming).
The data I used came from many different sources around the web. Some were databases downloaded directly (both in SQL, CSV, and JSON formats) from the internet, others were 'scraped' manually (meaning I had to create a process to find the data I needed on the web and essentially copy and paste it into my own database), and others were requested through APIs (using websites similar to downloading it directly).
The first question was what scale to look at. Should I look at individual players? A whole season? Game by game? This was relatively easy to answer as there is very limited player statistics available and anything higher level than game by game would also create too small a dataset. Game by game it is.
The next piece was to get as much game by game information as possible, with some scope constraints. To begin with I wanted to only focus on one league so I chose the English Premier League and only took games within that league. This was to make sure differences between leagues didn't create too much noise for my future models to work effectively. Using Github, Kaggle, and other websites, I was able to create a dataset of over 1600 games which included which stadium the game was played, the teams, and a host of other soccer specific metrics such as number of shots, possession (how much of the ball each team had during the game), passing accuracy, etc.
To get all these games into one dataset a lot of rearranging had to take place: making sure numeric columns were considered numeric, creating 'dummy' columns for categorical data (the models we will discuss later only take numerical data as inputs so categories were converted to columns with a 1 or 0 if the game had that information), merging datasets together, and other data cleaning or 'munging' techniques.
An interesting angle was to see if the weather had an effect. Do teams play different when it is hot or cold? When it's raining? To include this I downloaded over ten years of weather data from Great Britain's weather stations which had data in up to 20 minute incremenets. This data included the temperature, wind speed, visibility, etc.
To merge this data to the soccer dataset, I used the latitude and longitude coordinates of the stadiums where each game was played. I then selected the weather station that was closest to those coordinates and merged the weather associated.
EDA
Before choosing a model to see what kind of groupings (what I'll be calling the 'strategies' I'm looking for) I had to look and visualize the data to make sure no oddities were included.
The graphs below show a sample of the type of things I looked at, whether it was did each stadium have enough games (sorry Cardiff fans, you just weren't in the Premier League long enough), was their enough actual data that could be used (e.g. precipitation had to be removed because surprisingly there wasn't much of it for my games or more likely that data wasn't there from the weather API I used), etc.
EDA is a process which has no end. You can spend forever looking at different angles for the data, grouping things together differently, or creating different visualizations. So I had to stop myself at some point when I was comfortable with the data I had and move on to the modeling piece.
Except for one last thing....PCA.
Principal Component Analysis, or PCA, is the process of going from many variables to few without losing the information. The process is essentially a weighting of each variable and combining those weights into new variables. The graph above shows how much of the old variables (shots, possession, etc.) are explained by the new variables. You can see that the first four variables (called PC1 through PC4) explain ~25% of the old variables, using different weightings for each.
The above is just a sample of how the old variables are weighted to create PC1. This process is great when you have a large number of variables for each data point (as I do) and want to make it easier for the modeling process. Models will have a much better time grouping four variables than 100.
The main two downsides here are: 25% is still a whole lot less than 100%, and we're now a step removed from our original data so explaining the final groupings (if there are any) will be harder. I hear that, but my approach does bear fruit (I promise!) and does better at explaining the data than if we had left all 100 variables in. Ideally we wouldn't have to do this but data science, and unsupervised learning especially, is all about what we can get out of the data.
Modeling
Now to the modeling. The main decision is which model to use for this process. Unsupervised learning (this describes a whole branch of machine learning where there is no target variable to predict) has many algorithms which can create clusters from varying amounts of data. However, one main component to my question was that I have no prior idea as to how many groups should be contained in the output. The strategies could be few or many and anywhere in between. Therefore I wanted to use a model which does not require setting a number of groups to output. This left me with the Density-based spatial clustering of applications with noise algorithm, known as DBSCAN. This is an exceptionally good model when working with multiple types and sizes of data, as well as accounting for noise.
However, one piece that must be created first is something called a 'distance matrix'. This is simply a square matrix which shows how close variables are to each based on a formula. Usually the Euclidian distance can be used when all the variables are continuous. Since my dataset includes different types of data (continuous, discrete, the 'dummy' variables mentioned previously), another type of distance formula was used. This is the Gower distance which simply compares variables and applies a 'similarity coefficiant' which is between 0 and 1, 1 being identical and 0 being completely separate.
The double-edged sword of unsupervised learning is there is no clear way of determining how well your model works. This is great since there are many more options but it does requires a manual process for fine tuning the inputs and parameters of the model. For example, when deciding which group a data point falls into, you can set the maximum distance which is allowed for a point to be in a group. You can also set the minimum number of points that can be called a group. This requires trial and error to find a grouping that makes sense.
Another option I used was a process known as hierarchical clustering. This is another model that doesn't need a number of groups as input, and usually produces good results, but is super inefficient. Instead of looking through the data once and constructing what the model thinks are the best groups (like in DBSCAN), hierarchical clustering goes from either bottom up (every data point is its own group) or top down (there is only one group....called Highlander :)) and slowly and iteratively splits similar data into groups. A visualization, known as a dendrogram, will show this process later on.
Evaluation
So how did DBSCAN do?
Well...as you can see from the above...not well. Even after adjusting the parameters, DBSCAN continually either grouped almost all the games into one 'strategy' or none at all, labeling each point its own strategy (which makes zero sense). For some reason DBSCAN just couldn't pick out any patterns within the data.
Crap. Maybe it isn't that easy to find strategies within soccer data and it's a much more complex process? But how did the hierarchical clustering do?
For that we can take a look at the dendrogram I mentioned previously.
Is this good? Bad? With hierarchical clustering it's not that straightforward. You can see above this tree that keeps splitting into more and more branches. The splits at each level are the many options for the groupings. We have to take a look at these different options to see what makes sense.
So I did, and I decided to use the groups at distance 60 on the dendrogram, yielding three different groups. The split above was too high level and didn't really make meaningful distinctions, and all splits below had more and more focused groups which only impact 10 or so games. Now we could've taken those as our final groups and you still can! This is where judgement and knowing your data is key. Do these groups make sense? Are they distinctive? Unsupervised learning doesn't give us a quick metric to base these answers on so we just have to trust our knowledge. Or at least gamble on our knowledge.
So what are these three groups I discovered?
Easy! Makes sense right? ;-) Yeah that's because I did that whole PCA thing before. So these are actually the values of those four variables which are combination of all our original variables. Confusing yes...but productive. Because now we have distinctive groups.
Using the weights within each of the four principal components, I was able to glean the overall differences in how the original data looked between each group. Below are how I like to describe them:
Cluster 1 or Barcelona: Strategy where the home team tries to dominate the game but at an even pace. This means keeping possession, not giving away fouls, but also not trying to kill the away team off immediately and potentially leaving room for counters. Slow build up, high possession stats, finesse over physicality.
Cluster 2 or Leicester City (I guess 2015-2016 Leicester specifically): The home team does not try to keep possession or control the game but attempts to burst the away team down before half time. Still tries not to give away fouls more likely to be physical on defense and tries to reduce the number of chances for both teams. This is a counter-attacking strategy that gives up possession for speed. This is why killing off the game before half-time is important as players will not be able to keep up the defensive responsibilities all game.
Cluster 3 or Atletico Madrid: This is the one strategy geared toward the away team, who attempts to kill the game by increasing the physicality of the game. This reduces the possession of the home team and their ability to kill the game before half-time. The highest number of fouls and cards (PC4) with limited home attacking stats (PC1) and overall shots (PC3). The away team increases the physicality of the game, limiting home opportunities and tries to take the few opportunities to get ahead and stay ahead.
Are these completely different? No. There is obviously a lot of overlap in values, especially when our principal components only cover 25% of the original data. But does it make sense? Yes. Do these groups cover the data well? Below is a 3-D (whooooah) visualization of all the games and the first three principal component values as the axes (I wanted all four but not sure yet how to do a 4-D visualization).
Not bad right? It seems that these groups do in some sense separate the games well into three distinctive groups. We did it!
Conclusion
Our original hypothesis was whether intuitive groupings could be created from the soccer statistics provided which could be translated into strategies.
Given a bunch of assumptions, caveats, and some tweaking of our data, it looks like we can get some strategies out of this data.
Done!
Except not....remember that double-edged sword? Well here's the good part. We can work on this forever! There are still many things we can do:
- Expand our dataset to include other leagues and countries
- Compare strategies with managers to see if Arsene Wenger really doesn't know how to defend
- See if these strategies can help determine the scores or outcome of games (now moving into that predictive machine learning)
I want to tackle all of these and will be updating regularly as I find new interesting things and maybe not so interesting things. Thank you all for reading and feel free to ask any questions or comment on my git repo!