A beginner's guide — from "what even is an A/B test?" to making confident decisions from data
Bayesian A/B testing replaces cryptic p-values with a direct answer: "there's a 94% probability that Version B is genuinely better." By combining prior beliefs with observed data, it produces a posterior distribution — a full picture of all the plausible truths your data supports. This guide walks through every step: the four core concepts, how to design and run a test properly, how to read the results, and the five mistakes that trip up even experienced practitioners.
Imagine you run an online shop. Your "Buy Now" button is currently grey. Someone on your team says, "I bet if we made it orange, more people would click it." Should you just switch? What if it makes things worse? How would you even know?
This is exactly the problem A/B testing solves: making decisions based on evidence, not gut feeling.
Every day, companies make decisions — which headline to use, which button colour to pick, which email subject line to send. Most people just guess or go with what "feels right." But feelings are unreliable. An A/B test replaces guessing with a controlled experiment.
You split your visitors into two groups — randomly, like flipping a coin for each visitor:
1,000 visitors arrive today
│
▼
┌────┴────┐
│ SPLIT │ (random 50/50)
└────┬────┘
┌────┴────────────────┐
▼ ▼
┌─────────┐ ┌─────────┐
│ GROUP A │ │ GROUP B │
│ (500) │ │ (500) │
│ Grey │ │ Orange │
│ button │ │ button │
└────┬────┘ └────┬────┘
│ │
▼ ▼
15 buy 22 buy
(3.0%) (4.4%)
Now you have evidence. Group B (orange) looks better. But wait — could this just be random luck? This is where Bayesian statistics comes in.
| Approach | What it asks | What it tells you | Downside |
|---|---|---|---|
| Frequentist | "If there's NO real difference, how weird is this result?" | A p-value (hard to interpret!) | Doesn't directly tell you probability that B is better |
| Bayesian | "Given this data, how probable is it that B beats A?" | "There's a 94% chance orange is better" | Requires you to specify starting beliefs (actually a feature!) |
Bayesian statistics is built on four ideas. Don't be scared by the names — each one maps onto something completely intuitive.
Your prior belief about conversion rate
(before running the test):
Probability
│
High│ ╭────╮
│ ╭╯ ╰╮
│ ╭╯ ╰╮
Low │╭─╯ ╰────────
└──────────────────────▶ Conversion rate
0% 2% 5% 10% 20%
"I think it's probably around 3–5%, but I'm not sure."
Likelihood: "The data saw 22/500 clicks"
How likely is this result if the true rate were...
True rate │ Likelihood
──────────┼───────────────────────────────────
2% │ ██ (unlikely — we'd expect only 10)
3% │ ████ (possible)
4.4% │ ██████████ (most likely! 22/500 = 4.4%)
6% │ ████ (possible)
10% │ ██ (unlikely — we'd expect 50)
The data "points to" 4.4% most strongly.
┌──────────────────────────────────────────┐
│ Prior + Data = Posterior │
│ (belief) (evidence) (new belief) │
└──────────────────────────────────────────┘
Before test: After test:
"Probably 3-5%" "Almost certainly 3.8–5.2%"
Probability Probability
│ │
████│ │ ████
████│ ╭──╮ │ ██████
████│ ╭╯ ╰╮ │ ████████
████│ ──╯ ╰── │ ─╯ ╰─
└──────────▶ └─────────────▶
0% 5% 10% 0% 5% 10%
Wide & uncertain Narrow & confident
Orange button posterior distribution:
Prob.
│
│ ╭────╮
│ ╭╯ ╰╮
│ ╭╯ ╰╮
│ ───╯ ╰───
└───────────────────────▶ True conversion rate
3% 3.5% 4.4% 5.3% 6%
└──────────┘
95% Credible Interval
"The truth is probably here"
"We're 95% sure the orange button's
true conversion rate is between 3.5% and 5.3%"
┌────────────────────────────────────────────────────┐ │ │ │ PRIOR LIKELIHOOD POSTERIOR │ │ (before) × (from data) = (after) │ │ │ │ "I think it's "The data "Now I think │ │ about 3–5%" points to 4.4%" it's 3.5–5.3%" │ │ │ │ ← What you ← What the ← Your updated │ │ believed data showed smart belief │ │ going in │ └────────────────────────────────────────────────────┘
Before you flip the switch, you need a plan. A badly designed test wastes time, money, and can lead to wrong conclusions. Here are the three key design decisions.
The golden rule: the smaller the difference you want to detect, the more data you need.
How sample size relates to detectable difference:
(assuming ~5% baseline conversion rate)
Difference Visitors per Test duration
you want variant needed (at 500/day)
to detect
─────────────────────────────────────────────────
0.5% ~20,000 80 days ████████████████
1.0% ~5,000 20 days ████
2.0% ~1,300 5 days █
3.0% ~600 2 days ▌
Trying to detect tiny differences? You'll need patience!
| Your situation | Prior type | What it looks like | Effect on test |
|---|---|---|---|
| Brand new shop, no history | Weak (uninformative) | "Any rate from 0–20% seems equally plausible" | Data does most of the work — you need more of it |
| Established shop, 6 months of data | Strong (informative) | "I know my rate is usually 4–6%" | Prior pulls estimates toward known range — less data needed |
Define upfront before the test starts:
Pre-test checklist: ┌─────────────────────────────────────────┐ │ ☐ Baseline conversion rate: 3% │ │ ☐ Minimum Detectable Effect: 1% │ │ ☐ Required sample per variant: 5,000 │ │ ☐ Win threshold: P(B > A) ≥ 95% │ │ ☐ Test duration: ~20 days │ │ ☐ One change only (button colour) │ └─────────────────────────────────────────┘ Write this down BEFORE you start. Don't change the rules once you see results!
Let's follow a real-ish story. You run a newsletter, and you're testing two subject lines:
You're measuring open rate. Prior belief: rates hover around 22%. MDE: 3 percentage points. Stop when you have 2,000 recipients per variant — about 14 days.
DAY 1 — Only 100 recipients each, small sample
─────────────────────────────────────────────
A: 21 opens / 100 sent = 21%
B: 26 opens / 100 sent = 26%
A: ─────░░░░░░░░░░░░░░░░░░░░░─────
10% 22% 35%
B: ──────▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓────
10% 25% 40%
Big overlap. P(B > A) = ~68% → "Not confident yet"
DAY 7 — 700 recipients each, moderate data
─────────────────────────────────────────────
A: 151 opens / 700 sent = 21.6%
B: 182 opens / 700 sent = 26.0%
A: ────░░░░░░░░░░░░░────
18% 22% 26%
B: ──▓▓▓▓▓▓▓▓▓▓▓▓──
22% 26% 30%
P(B > A) = ~89% → "Strong signal, but not quite there"
DAY 14 — 2,000 recipients each, full sample
─────────────────────────────────────────────
A: 432 opens / 2,000 sent = 21.6%
B: 520 opens / 2,000 sent = 26.0%
A: ──░░░░░░──
20% 22% 24%
B: ──▓▓▓▓▓▓──
24% 26% 28%
Minimal overlap!
P(B > A) = ~97% → "Call it. B wins."
Your test has run. The data is in. Reading Bayesian results is refreshingly straightforward compared to traditional statistics.
How to read a posterior curve:
Probability
│
│ ╭──╮ ← Peak = most likely value
│ ╭╯ ╰╮
│ ╭╯ ╰╮
│ ╭╯ ╰╮
│ ──────╯ ╰──────
└──────────────────────────▶ Conversion rate
3% 3.5% 4.4% 5.3% 6%
└──────────┘
95% credible interval
"The true rate is almost certainly in here"
Narrow peak = lots of data (certain)
Wide peak = less data (uncertain)
| Variant | Visitors | Conversions | Observed Rate | 95% Credible Interval | P(Variant Wins) |
|---|---|---|---|---|---|
| A — Grey button (control) | 2,000 | 60 | 3.0% | 2.3% – 3.8% | 3% |
| B — Orange button (variant) | 2,000 | 88 | 4.4% | 3.5% – 5.3% | 97% |
The credible intervals don't overlap — a very strong sign that B is genuinely better. P(B wins) = 97% clears our pre-set threshold of 95%. Call it: B wins. Ship the orange button.
Knowing the theory is half the battle. The other half is avoiding the landmines that trip up even experienced practitioners.
┌──────────────────────────────────────────────────────┐ │ BAYESIAN A/B TESTING — QUICK REFERENCE │ ├──────────────────┬───────────────────────────────────┤ │ Before test │ Define MDE, sample size, │ │ │ prior, win threshold │ ├──────────────────┼───────────────────────────────────┤ │ During test │ Don't peek! Let it run. │ │ │ No changes to page or targeting │ ├──────────────────┼───────────────────────────────────┤ │ Stopping │ Hit pre-set sample size AND │ │ │ run at least 1 full week │ ├──────────────────┼───────────────────────────────────┤ │ Reading results │ P(B > A) ≥ 95%? │ │ │ Credible intervals don't overlap?│ │ │ Effect ≥ MDE? → Ship it │ ├──────────────────┼───────────────────────────────────┤ │ If inconclusive │ The test worked! "No difference │ │ │ found" is a valid, useful result │ └──────────────────┴───────────────────────────────────┘