Bayesian A/B Testing

1 — Why A/B Testing Exists

Imagine you run an online shop. Your "Buy Now" button is currently grey. Someone on your team says, "I bet if we made it orange, more people would click it." Should you just switch? What if it makes things worse? How would you even know?

This is exactly the problem A/B testing solves: making decisions based on evidence, not gut feeling.

The Core Problem: You Can't Know Without Trying

Every day, companies make decisions — which headline to use, which button colour to pick, which email subject line to send. Most people just guess or go with what "feels right." But feelings are unreliable. An A/B test replaces guessing with a controlled experiment.

🛒

Real scenario: Your shop gets 1,000 visitors a day. Your grey button converts at 3% (30 purchases). You think orange might be better — but you're not sure. If you're wrong and you switch, you might lose sales. If you're right and you don't switch, you're leaving money on the table.

How an A/B Test Works

You split your visitors into two groups — randomly, like flipping a coin for each visitor:

 1,000 visitors arrive today
         │
         ▼
    ┌────┴────┐
    │  SPLIT  │  (random 50/50)
    └────┬────┘
    ┌────┴────────────────┐
    ▼                     ▼
┌─────────┐         ┌─────────┐
│ GROUP A │         │ GROUP B │
│  (500)  │         │  (500)  │
│  Grey   │         │ Orange  │
│ button  │         │ button  │
└────┬────┘         └────┬────┘
     │                   │
     ▼                   ▼
  15 buy              22 buy
  (3.0%)              (4.4%)

Now you have evidence. Group B (orange) looks better. But wait — could this just be random luck? This is where Bayesian statistics comes in.

Why "Bayesian"?

Approach	What it asks	What it tells you	Downside
Frequentist	"If there's NO real difference, how weird is this result?"	A p-value (hard to interpret!)	Doesn't directly tell you probability that B is better
Bayesian	"Given this data, how probable is it that B beats A?"	"There's a 94% chance orange is better"	Requires you to specify starting beliefs (actually a feature!)

🎯

The key insight: A/B testing doesn't just tell you which button performed better in your sample — it tells you how confident you can be that it'll keep performing better in the future.

2 — The Core Concepts

Bayesian statistics is built on four ideas. Don't be scared by the names — each one maps onto something completely intuitive.

Concept 1 · Prior

Your belief about something before you collect any data.

Imagine you're about to flip a coin you've never seen before. Before it's flipped, you'd probably say "it's roughly 50/50." That's your prior. If someone hands you a coin they tell you is weighted, your prior shifts. You haven't flipped it yet; you're just encoding what you already believe.

  Your prior belief about conversion rate
  (before running the test):

  Probability
      │
  High│    ╭────╮
      │   ╭╯    ╰╮
      │  ╭╯      ╰╮
  Low │╭─╯        ╰────────
      └──────────────────────▶ Conversion rate
        0%   2%   5%   10%  20%

  "I think it's probably around 3–5%, but I'm not sure."

Concept 2 · Likelihood

The signal in your data — how consistent the results are with different possible truths.

You run your experiment: 22 out of 500 people clicked. The likelihood answers: "If the true conversion rate were X%, how likely would we be to see exactly this result?" It's a measurement of what the data says, not a belief.

  Likelihood: "The data saw 22/500 clicks"

  How likely is this result if the true rate were...

  True rate │ Likelihood
  ──────────┼───────────────────────────────────
    2%      │ ██ (unlikely — we'd expect only 10)
    3%      │ ████ (possible)
    4.4%    │ ██████████ (most likely! 22/500 = 4.4%)
    6%      │ ████ (possible)
    10%     │ ██ (unlikely — we'd expect 50)

  The data "points to" 4.4% most strongly.

Concept 3 · Posterior

Your updated belief after combining your prior with the data you collected.

You start with what you believe (prior), collect evidence (likelihood), and blend them to get a smarter, updated belief (posterior). It's like being a detective who starts with a hunch, then updates it as clues come in. The more data you collect, the more the posterior is shaped by the data.

  ┌──────────────────────────────────────────┐
  │  Prior     +    Data    =   Posterior    │
  │  (belief)    (evidence)   (new belief)   │
  └──────────────────────────────────────────┘

  Before test:          After test:
  "Probably 3-5%"       "Almost certainly 3.8–5.2%"

  Probability           Probability
      │                     │
  ████│                     │    ████
  ████│    ╭──╮             │   ██████
  ████│   ╭╯  ╰╮            │  ████████
  ████│ ──╯    ╰──          │ ─╯        ╰─
      └──────────▶          └─────────────▶
       0%  5%  10%            0%  5%  10%

  Wide & uncertain          Narrow & confident

Concept 4 · Credible Interval

A range of values that you're, say, 95% sure contains the true answer.

After your orange button test, your posterior might say the true conversion rate is between 3.5% and 5.3% with 95% probability. Unlike a frequentist "confidence interval," a Bayesian credible interval means exactly what it sounds like: "We're 95% certain the true value is in here."

  Orange button posterior distribution:

  Prob.
    │
    │           ╭────╮
    │          ╭╯    ╰╮
    │         ╭╯      ╰╮
    │      ───╯        ╰───
    └───────────────────────▶ True conversion rate
         3%  3.5%  4.4%  5.3%  6%
              └──────────┘
             95% Credible Interval
           "The truth is probably here"

  "We're 95% sure the orange button's
  true conversion rate is between 3.5% and 5.3%"

How They All Fit Together

  ┌────────────────────────────────────────────────────┐
  │                                                    │
  │   PRIOR          LIKELIHOOD        POSTERIOR       │
  │  (before)     ×  (from data)   =  (after)          │
  │                                                    │
  │  "I think it's   "The data        "Now I think     │
  │   about 3–5%"     points to 4.4%"  it's 3.5–5.3%"  │
  │                                                    │
  │  ← What you      ← What the       ← Your updated  │
  │    believed        data showed       smart belief  │
  │    going in                                        │
  └────────────────────────────────────────────────────┘

🧠

The "detective" analogy: Your prior is your initial hunch. The likelihood is the evidence you find at the scene. The posterior is your updated conclusion after considering both. The credible interval is how confident you are in that conclusion.

3 — Designing the Test

Before you flip the switch, you need a plan. A badly designed test wastes time, money, and can lead to wrong conclusions. Here are the three key design decisions.

Decision 1: How Many Visitors Do You Need?

The golden rule: the smaller the difference you want to detect, the more data you need.

📏

Rule of thumb: If your current conversion rate is around 5%, and you want to detect a 1 percentage point improvement (to 6%), you'll need roughly 5,000 visitors per variant — so 10,000 total. If you only care about catching a 2pp improvement, you can get away with about 2,000 per variant.

  How sample size relates to detectable difference:
  (assuming ~5% baseline conversion rate)

  Difference     Visitors per        Test duration
  you want       variant needed      (at 500/day)
  to detect
  ─────────────────────────────────────────────────
    0.5%         ~20,000             80 days  ████████████████
    1.0%         ~5,000              20 days  ████
    2.0%         ~1,300               5 days  █
    3.0%         ~600                 2 days  ▌

  Trying to detect tiny differences? You'll need patience!

Decision 2: What Prior Should You Use?

Your situation	Prior type	What it looks like	Effect on test
Brand new shop, no history	Weak (uninformative)	"Any rate from 0–20% seems equally plausible"	Data does most of the work — you need more of it
Established shop, 6 months of data	Strong (informative)	"I know my rate is usually 4–6%"	Prior pulls estimates toward known range — less data needed

Decision 3: What Does "Winning" Mean?

Define upfront before the test starts:

Minimum Detectable Effect (MDE): The smallest improvement that would actually matter to your business
Probability threshold: How confident do you need to be? (Typically 95%)

  Pre-test checklist:
  ┌─────────────────────────────────────────┐
  │  ☐  Baseline conversion rate: 3%        │
  │  ☐  Minimum Detectable Effect: 1%       │
  │  ☐  Required sample per variant: 5,000  │
  │  ☐  Win threshold: P(B > A) ≥ 95%       │
  │  ☐  Test duration: ~20 days             │
  │  ☐  One change only (button colour)     │
  └─────────────────────────────────────────┘
  Write this down BEFORE you start. Don't change
  the rules once you see results!

4 — Running the Test

Let's follow a real-ish story. You run a newsletter, and you're testing two subject lines:

Version A: "5 tips to grow your audience this week"
Version B: "The mistake that's killing your open rates"

You're measuring open rate. Prior belief: rates hover around 22%. MDE: 3 percentage points. Stop when you have 2,000 recipients per variant — about 14 days.

☁️

The weather forecast analogy: Your prior is Monday's forecast. Each day of data is new weather information. The posterior is today's revised forecast. The more data rolls in, the more accurate and narrow the forecast becomes.

Day-by-Day: How the Posterior Evolves

  DAY 1 — Only 100 recipients each, small sample
  ─────────────────────────────────────────────
  A: 21 opens / 100 sent  =  21%
  B: 26 opens / 100 sent  =  26%

  A: ─────░░░░░░░░░░░░░░░░░░░░░─────
          10%      22%       35%

  B: ──────▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓────
          10%      25%       40%

  Big overlap. P(B > A) = ~68%  →  "Not confident yet"

  DAY 7 — 700 recipients each, moderate data
  ─────────────────────────────────────────────
  A: 151 opens / 700 sent  =  21.6%
  B: 182 opens / 700 sent  =  26.0%

  A:    ────░░░░░░░░░░░░░────
             18%  22%  26%

  B:      ──▓▓▓▓▓▓▓▓▓▓▓▓──
             22%  26%  30%

  P(B > A) = ~89%  →  "Strong signal, but not quite there"

  DAY 14 — 2,000 recipients each, full sample
  ─────────────────────────────────────────────
  A: 432 opens / 2,000 sent  =  21.6%
  B: 520 opens / 2,000 sent  =  26.0%

  A:       ──░░░░░░──
            20% 22% 24%

  B:             ──▓▓▓▓▓▓──
                 24% 26% 28%

  Minimal overlap!
  P(B > A) = ~97%  →  "Call it. B wins."

📬

Plain English conclusion: The curiosity-driven subject line (Version B) gets about 4–5 more opens per 100 emails sent. We're 97% confident this isn't a fluke. At 10,000 sends/week, that's 400–500 extra opens every single week.

5 — Reading the Results

Your test has run. The data is in. Reading Bayesian results is refreshingly straightforward compared to traditional statistics.

Reading a Posterior Distribution

  How to read a posterior curve:

  Probability
    │
    │           ╭──╮          ← Peak = most likely value
    │          ╭╯  ╰╮
    │         ╭╯    ╰╮
    │        ╭╯      ╰╮
    │  ──────╯        ╰──────
    └──────────────────────────▶  Conversion rate
      3%  3.5%  4.4%  5.3%  6%
           └──────────┘
           95% credible interval
           "The true rate is almost certainly in here"

  Narrow peak  = lots of data (certain)
  Wide peak    = less data (uncertain)

A Sample Results Table

Variant	Visitors	Conversions	Observed Rate	95% Credible Interval	P(Variant Wins)
A — Grey button (control)	2,000	60	3.0%	2.3% – 3.8%	3%
B — Orange button (variant)	2,000	88	4.4%	3.5% – 5.3%	97%

The credible intervals don't overlap — a very strong sign that B is genuinely better. P(B wins) = 97% clears our pre-set threshold of 95%. Call it: B wins. Ship the orange button.

When to Call a Test "Done"

You've hit your pre-planned sample size (most important!)
P(winner) ≥ your threshold and sample size is reached
You've run long enough to cover a full business cycle (weekends matter!)

⚠️

Never stop early just because it looks good! Imagine flipping a coin: after 5 flips, you might see 4 heads and think "this coin is biased!" — but after 100 flips it evens out. Always honour your pre-planned stopping point.

6 — Common Mistakes

Knowing the theory is half the battle. The other half is avoiding the landmines that trip up even experienced practitioners.

🔭 Peeking at results early and stopping when it "looks good"

Random variation makes results look conclusive early on, then regress. Stop too soon and you'll "find" winners that don't exist — a false positive.

Fix: Set your sample size before the test. Don't look at results until you've hit it — or use a sequential testing method designed for early stopping.

🎯 Testing too many things at once

If you test 20 variants simultaneously, pure chance means ~1 will look like a winner even if nothing works. More variants = more noise.

Fix: Test one change at a time (or use a proper multi-armed bandit / multi-variant approach with appropriate corrections).

📅 Running the test for too short a time

People behave differently on different days (Monday vs. Friday). A test that only runs Tuesday–Thursday captures a skewed audience.

Fix: Run tests for at least one full week — ideally two — regardless of how fast you hit your sample size target.

🔀 Changing the test mid-way

If you change the page, audience targeting, or anything else while the test runs, you corrupt the data — it becomes impossible to know what caused any effect.

Fix: Treat the test as sacred. No changes while it runs. If something must change, stop the test, make the change, and restart from scratch.

📊 Confusing statistical significance with practical significance

With enough data, even a 0.01% improvement becomes "statistically significant" — but is it worth implementing?

Fix: Always sanity-check against your MDE. "Is this improvement large enough to justify the engineering effort, risk, and opportunity cost?"

Quick Reference Cheatsheet

  ┌──────────────────────────────────────────────────────┐
  │         BAYESIAN A/B TESTING — QUICK REFERENCE       │
  ├──────────────────┬───────────────────────────────────┤
  │  Before test     │  Define MDE, sample size,         │
  │                  │  prior, win threshold             │
  ├──────────────────┼───────────────────────────────────┤
  │  During test     │  Don't peek! Let it run.          │
  │                  │  No changes to page or targeting  │
  ├──────────────────┼───────────────────────────────────┤
  │  Stopping        │  Hit pre-set sample size AND      │
  │                  │  run at least 1 full week         │
  ├──────────────────┼───────────────────────────────────┤
  │  Reading results │  P(B > A) ≥ 95%?                  │
  │                  │  Credible intervals don't overlap?│
  │                  │  Effect ≥ MDE? → Ship it          │
  ├──────────────────┼───────────────────────────────────┤
  │  If inconclusive │  The test worked! "No difference  │
  │                  │  found" is a valid, useful result │
  └──────────────────┴───────────────────────────────────┘

🎉

You're ready. You now understand more about A/B testing than most people who run them. You know what a prior is, how data updates your beliefs, how to design a test that won't mislead you, and how to read results with confidence. The best way to cement this is to run a real test — pick one thing, form a hypothesis, set it up properly, and watch Bayesian inference do its magic.

TL;DR

1 — Why A/B Testing Exists

The Core Problem: You Can't Know Without Trying

How an A/B Test Works

Why "Bayesian"?

2 — The Core Concepts

How They All Fit Together

3 — Designing the Test

Decision 1: How Many Visitors Do You Need?

Decision 2: What Prior Should You Use?

Decision 3: What Does "Winning" Mean?

4 — Running the Test

Day-by-Day: How the Posterior Evolves

5 — Reading the Results

Reading a Posterior Distribution

A Sample Results Table

When to Call a Test "Done"

6 — Common Mistakes

Quick Reference Cheatsheet

Tools & Technologies