Project - Predicting the Premier League Season with Monte Carlo Simulation

TL;DR

By combining live betting odds with Monte Carlo simulation, I modelled 50,000 possible end-of-season outcomes for every Premier League club based on their next two fixtures. The simulation confirmed Arsenal as near-certain champions (100%) with Manchester City locked into second (99%), while revealing a much tighter battle in the middle of the table — and a two-horse relegation fight at the bottom where Wolves and Burnley face a combined 93.8% chance of occupying the final two relegation spots.

The Objective

Late in a Premier League season, the title race and relegation battle are often mathematically tight. Points gaps are small, remaining fixtures vary in difficulty, and a single result can reshape the entire table. Standard league table views tell you where teams stand right now — they don't tell you where they are likely to end up.

Goal: Build a simulation-based model that uses current league position and upcoming match odds to generate a probability distribution over every possible finishing position for all 20 clubs.

Rather than making a single deterministic prediction, the goal was to produce a full probability matrix — answering questions like "what is the chance Tottenham get relegated?" or "how likely is a top-four finish for Aston Villa?" in a way that a simple league table cannot.

Data & Context

Source 1: BBC Sport — live Premier League table scraped via BeautifulSoup, capturing current position, points, wins, draws, losses, and goal difference for all 20 clubs

Source 2: The Odds API — upcoming match odds (decimal format) from UK bookmakers, fetched for each team's next two fixtures, covering win, draw, and lose markets

Scope: All 20 Premier League clubs; odds averaged across multiple UK bookmakers to reduce individual bookmaker bias

Limitations: Only the next two fixtures are simulated — teams with three or more games remaining introduce uncertainty not captured here; odds shift in real time as news breaks

My Approach

I chose Monte Carlo simulation because it naturally handles uncertainty without forcing a single deterministic outcome. Rather than predicting one result per match, the model runs thousands of possible season endings and reports how often each scenario occurs.

Collect live data. The current league table was scraped from the BBC Sport website, capturing each club's points tally and position in real time. Upcoming match odds were pulled from The Odds API — specifically the head-to-head market for each team's next two fixtures, averaged across all available UK bookmakers.

Convert odds to implied probabilities. Decimal odds were inverted (probability = 1 ÷ odds) to produce raw implied probabilities for a win, draw, or loss in each game. Because bookmakers build in a margin (overround), the three raw probabilities for any match always sum to more than 100%.

Normalise to remove the bookmaker margin. Each team's raw win, draw, and lose probabilities were divided by their total sum, producing a normalised set of probabilities that correctly sum to 100%. This strips out the bookmaker's profit margin and gives a fairer estimate of the true match probability.

Run 50,000 season simulations. In each simulation, every team's two remaining fixtures were independently simulated using their normalised probabilities — drawing a random number and comparing it against the win/draw/lose thresholds to determine the result. Points were added accordingly (3 for a win, 1 for a draw, 0 for a loss) and the table was re-ranked by final points total.

Aggregate into a probability matrix. Across all 50,000 simulations, the frequency with which each team finished in each position (1st through 20th) was tallied and converted to a percentage. This produced a 20×20 probability heatmap — the core output of the analysis.

Key decision: Using bookmaker odds as the probability input rather than building a statistical model from scratch. Betting markets are highly efficient — they already encode recent form, injuries, home/away advantage, squad depth, and market consensus into a single number. For a short-horizon simulation (2 games), this is a robust and well-calibrated starting point that avoids the complexity of feature engineering a predictive model.

Premier League season finish probability heatmap — **Figure 1: Finishing Position Probability Heatmap (50,000 simulations)** Each cell shows the percentage chance of a team finishing in that league position. The heatmap uses a magma colour scale — brighter cells indicate higher probability. The diagonal structure is clear for the top and bottom clubs, reflecting near-certainty in their final positions, while mid-table teams show broader, flatter distributions spanning several positions.

Findings

Title Race: Effectively Settled

Arsenal finish 1st in 100% of simulations, with Manchester City at 99% for 2nd place. With only two games remaining and a sufficient points gap, no realistic combination of results changes the top two — the title is Arsenal's to lose.

The Real Battle: 3rd to 5th

Manchester United (68.5% for 3rd) and Aston Villa (67.8% for 4th) are locked in a tight contest for European spots. Liverpool sit at 75.8% for 5th but face a 17.2% chance of slipping to 6th — a result that could have significant implications for European competition qualification.

Tottenham in Danger

The simulation places Tottenham at 53.8% to finish 17th and 30.1% to finish 18th — meaning over an 80% chance of landing in the bottom four. Despite their historical stature, the odds for their remaining fixtures reflect a team in serious form trouble heading into the final weeks.

Relegation Resolved at the Bottom

Wolverhampton Wanderers and Burnley have a combined 187.6% probability points across 19th and 20th — each finishing in those spots in 93.8% of simulations. Their relegation is not yet mathematically confirmed, but the model treats it as near-certain given remaining fixture difficulty.

Bookmaker Overround: ~5–8%

Before normalisation, raw implied probabilities for each match summed to roughly 105–108%, reflecting the typical bookmaker margin. Stripping this out meaningfully shifts win probabilities for underdogs, making the simulation fairer and better calibrated than using raw odds directly.

Simulation Stability at 50,000 Runs

Running 50,000 iterations ensures the probability estimates are stable to within roughly ±0.5 percentage points. Early test runs at 1,000 and 10,000 simulations showed noticeable variance in mid-table probabilities; 50,000 produced consistent, reproducible results across runs.

Recommendation: For fans, analysts, or clubs monitoring the table, this approach provides a more honest view of uncertainty than headline points gaps alone. A team four points behind with a favourable fixture might have a 35% chance of overtaking — or only 5% — and the simulation quantifies that distinction clearly.

Reflection

This project reinforced how powerful Monte Carlo methods are for communicating uncertainty. Rather than stating "Arsenal will win the league," the model produces a distribution that is both more honest and more informative. The normalisation step — removing the bookmaker overround — was a small but important detail that required understanding why raw implied probabilities can't be used directly.

If I revisited this analysis, I would extend it to simulate all remaining fixtures in the season rather than just the next two, and would incorporate goal difference tiebreakers for cases where teams are level on points. A natural extension would be to re-run the simulation daily as odds update and new results come in — turning this into a live season tracker rather than a single snapshot.

Tools & Technologies

Python pandas BeautifulSoup requests The Odds API Monte Carlo Simulation seaborn Probability Normalisation Web Scraping

Predicting the Premier League Season Finish