Project Background

Finding Replacement Players in the Transfer Market

Using K-means clustering in R to find younger, cheaper replacements based on statistical profiles

TL;DR

I used K-means clustering in R to find potential replacements for three aging Premier League players: Giroud, Fernandinho, and Vardy. By grouping younger players (born after 1997) based on their Goals+Assists and Expected Goals+Assists per 90 minutes, the algorithm identified which young talents have the most similar statistical profiles to these veterans. The analysis found Phil Foden, Ferran Torres, and Eddie Nketiah as the closest matches for Giroud.

The Objective

Football clubs constantly face the challenge of replacing aging players. Finding someone who can fill a specific role is difficult because raw stats alone do not tell the full story. A player is able to under perform or over perform leading to inflated transfer fees that are destined to fail.

Goal: Use clustering analysis to group players by statistical output, then identify which young players fall into the same cluster as established veterans.

Getting transfers right saves clubs millions. A data driven approach can narrow down a shortlist and try to remove the under or over performing stats and provide a stronger assessment for scouts.

Data & Context

Source: FBRef (free, publicly available football statistics)
Scope: Players born after 1997 in same positions as target players
Key Variables: Goals+Assists per 90, Expected Goals+Assists per 90
Limitations: Only attacking metrics available. Defensive stats missing, which affects players in non-attacking positions like Fernandinho

Metrics Analyzed

I focused on two key metrics that capture attacking output while also factoring in expected performance:

G+A / 90
Goals plus assists per 90 minutes played
xG+xA / 90
Expected goals plus expected assists per 90
Position
Forward or Midfielder filter
Birth Year
Born after 1997 for youth focus

My Approach

I chose K-means clustering because it groups players by similarity across multiple variables at once. This reveals natural groupings in the data rather than relying on arbitrary cutoffs.

1
Filtered the dataset: Selected only players in the same position as each target and born after 1997, plus included the veteran player for comparison.
2
Selected clustering variables: Extracted G+A/90 and xG+xA/90 for each player. These two metrics balance actual output with expected performance.
3
Determined optimal clusters: Used the elbow method (within cluster sum of squares) to find that 2 clusters gave the best separation without overfitting.
4
Applied K-means algorithm: Ran the clustering with k=2 and visualized which players landed in the same group as the veterans.
5
Created labeled scatter plots: Built visualizations with player names so I could easily identify the closest statistical matches.
Key decision: I used only 2 clusters based on the elbow method results. More or Less clusters would have lead to either overfitting or underfitting.
Elbow method plot showing optimal clusters
Figure 1: Finding the Optimal Number of Clusters The elbow appears at k=2, meaning two clusters capture most of the variance without creating artificial groupings.
Scatter plot of player statistics
Figure 2: Player Distribution by Performance Metrics Players cluster naturally into two groups. Those near Giroud in the top right represent the closest statistical matches.
Close up of scatter plot
Figure 3: Close Up View of Top Performers Foden, Torres, and Nketiah appear closest to Giroud's position, making them the strongest candidates based on statistical similarity.

Key Outcomes

Three Replacement Candidates for Giroud

Phil Foden (0.65 G+A/90), Ferran Torres (0.49 G+A/90), and Eddie Nketiah (0.55 G+A/90) all clustered closest to Giroud's profile of 0.56 G+A/90.

xG Reveals Overperformers

Foden's actual output (0.65) exceeds his expected output (0.52), suggesting he may be overperforming this season. Nketiah's numbers align more closely with expectations, making him a more reliable recommendation for scouts despite not being the highest performer in the cluster.

Limited Defensive Insight

For Fernandinho, the analysis was less useful since the dataset lacked defensive metrics. Midfielders need different variables for accurate clustering.

Recommendation: Clubs should use this as a starting point for scouting, not a final answer. Nketiah appears the safest choice since his actual output matches his expected numbers, suggesting sustainable performance.

Reflection

This project was my first time experimenting with machine learning. I'd improve on almost every aspect of this project, with the elbow method indicating that the model wasn't predicting cluster like I thought it was with only 2 clusters.

Future improvements would include: Improving the quality of my data, like adding more bespoke metrics for each position. I would also complete some feature selection analysis to see which variables have the most impact on clustering. Finally, I would experiment with different clustering algorithms beyond K-means to see if they yield better groupings.

Tools & Technologies

R K-means Clustering ggplot2 dplyr Elbow Method