Using K-means clustering in R to find younger, cheaper replacements based on statistical profiles
I used K-means clustering in R to find potential replacements for three aging Premier League players: Giroud, Fernandinho, and Vardy. By grouping younger players (born after 1997) based on their Goals+Assists and Expected Goals+Assists per 90 minutes, the algorithm identified which young talents have the most similar statistical profiles to these veterans. The analysis found Phil Foden, Ferran Torres, and Eddie Nketiah as the closest matches for Giroud.
Football clubs constantly face the challenge of replacing aging players. Finding someone who can fill a specific role is difficult because raw stats alone do not tell the full story. A player is able to under perform or over perform leading to inflated transfer fees that are destined to fail.
Goal: Use clustering analysis to group players by statistical output, then identify which young players fall into the same cluster as established veterans.
Getting transfers right saves clubs millions. A data driven approach can narrow down a shortlist and try to remove the under or over performing stats and provide a stronger assessment for scouts.
I focused on two key metrics that capture attacking output while also factoring in expected performance:
I chose K-means clustering because it groups players by similarity across multiple variables at once. This reveals natural groupings in the data rather than relying on arbitrary cutoffs.
Phil Foden (0.65 G+A/90), Ferran Torres (0.49 G+A/90), and Eddie Nketiah (0.55 G+A/90) all clustered closest to Giroud's profile of 0.56 G+A/90.
Foden's actual output (0.65) exceeds his expected output (0.52), suggesting he may be overperforming this season. Nketiah's numbers align more closely with expectations, making him a more reliable recommendation for scouts despite not being the highest performer in the cluster.
For Fernandinho, the analysis was less useful since the dataset lacked defensive metrics. Midfielders need different variables for accurate clustering.
Recommendation: Clubs should use this as a starting point for scouting, not a final answer. Nketiah appears the safest choice since his actual output matches his expected numbers, suggesting sustainable performance.
This project was my first time experimenting with machine learning. I'd improve on almost every aspect of this project, with the elbow method indicating that the model wasn't predicting cluster like I thought it was with only 2 clusters.
Future improvements would include: Improving the quality of my data, like adding more bespoke metrics for each position. I would also complete some feature selection analysis to see which variables have the most impact on clustering. Finally, I would experiment with different clustering algorithms beyond K-means to see if they yield better groupings.