How big does our list need to be to run meaningful A/B tests?

For open rate tests, at least 5,000 contacts per variant. For click rate, around 10,000. For donation conversion, usually 20,000 plus, or you need to test repeatedly over several campaigns to accumulate signal.

How long should a test run before reading results?

Wait until at least 80 percent of opens have happened, typically 48 hours for the first wave and a week before drawing donation conclusions. Reading results at four hours is a recipe for false positives.

Should we test on every campaign?

No. Testing on emergency appeals and time-critical sends is a bad trade. Test on regular newsletter cadence and standing welcome journeys, where the audience is large, predictable and consequence-light.

A/B Testing Charity Emails the Right Way in 2026

Most charity A/B tests are not really tests. They are two emails sent to small lists with results read too early and conclusions drawn too confidently. A disciplined approach that actually moves opens, clicks and donations over a year.

Most charity email A/B testing produces noise dressed up as insight. Two subject lines, sent to small samples, results read at four hours, winner declared, lesson logged, nothing changes. Done at scale over a year, this kind of testing actively makes the email programme worse: every false positive teaches the team something untrue.

Disciplined A/B testing is a different exercise. Done well, it compounds: tests inform each other, the team builds a real understanding of what their audience responds to, and the email programme genuinely improves campaign after campaign.

The five rules of disciplined charity email testing

1. Test one thing at a time

Subject line OR preview text OR send time OR layout. Not three of them in the same test. If the winning variant changes three things, you do not know which change drove the result, and your next test starts from confusion.

2. Use proper sample sizes

5,000 contacts per variant for open rate tests, 10,000 for click rate, 20,000 plus for donation conversion. Below those thresholds, observed differences are usually within the margin of natural variance. Smaller charities are not excluded; they just need to test repeatedly over multiple campaigns and aggregate results.

3. Wait for results

Open data settles within 48 hours, click data within 72 hours, donation data within a week. Reading and acting on partial results is the most common testing mistake in charity email, and it produces false confidence in moves that did not actually work.

4. Document the hypothesis, not just the result

Every test should have a written hypothesis explaining what you expect to happen and why. Without it, results are just trivia. With it, the team builds a body of evidence about what the audience values.

5. Test things that compound

Subject line tests compound less than expected because every campaign has a different subject. Preview text patterns, layout patterns, CTA wording patterns and journey timing all compound across many campaigns. Prioritise the latter.

What to actually test

Subject line patterns, not single subjects

Rather than testing two individual subjects, test patterns: "question vs statement", "named beneficiary vs cause-led", "urgency framing vs invitation framing". Run the same pattern test on several campaigns and look at the pattern-level effect.

Preview text discipline

Test whether having a deliberate preview text (the first line of the email) versus relying on inbox default lifts open rate. The answer is almost always yes, often by 10 percent or more.

Send time clusters

Test broad windows (Tuesday morning vs Sunday evening, etc.) rather than precise minutes. Within a window the difference is noise; between windows it can be substantial.

CTA placement and wording

Test single CTA vs three CTAs. Test wording ("Donate now" vs "Help fund a place at the centre"). Test button colour only if you must, but expect the effect to be tiny.

Journey timing

In automated sequences, test the gap between emails (1 day vs 3 days vs 7 days). The right gap is supporter-dependent and worth testing because it compounds across every contact in the journey forever.

What not to test

Emergency appeals

An appeal launched in response to a disaster is not the time to discover that variant B converts worse. Use battle-tested patterns and ship.

One-off campaigns

Annual gala invitations, one-off matched funding announcements. Insufficient repeat opportunities to learn from the test, and the cost of getting it wrong outweighs the marginal learning.

Cosmetic changes with no hypothesis

Blue button vs green button without a reason produces noise. Save your testing budget for changes that, if confirmed, would change something material in your programme.

The testing calendar

Run one disciplined test per month, focused on a single learning question. Document the hypothesis, the variants and the result. Quarterly, review the cumulative learnings and convert the confirmed patterns into defaults for future campaigns. Annually, retire the documented patterns and run them again to check the audience has not moved.

Twelve tests a year, with proper sample sizes and proper read times, produce more learning than fifty rushed tests. The compounding effect over two years is significant.

Reporting tests honestly

Show the confidence level

Most email platforms display a confidence indicator on tests. Anything below 95 percent should be reported as inconclusive, not as a win.

Show the absolute lift, not just the relative

A 20 percent uplift on a 5 percent click rate is a 1-point change. Report both numbers so trustees and team members understand what the test actually demonstrated.

Log the failures

Tests that show no significant difference are useful evidence: they tell you the change does not matter. Log them with the same discipline as the wins.

The point of testing is not to prove every campaign is improving. It is to build, over a year, a small library of patterns the team can rely on without arguing.

The one-page testing template

Hypothesis (what you expect to happen and why).
Variant A and Variant B definitions, with only one variable changed.
Sample size per variant and total send size.
Primary metric, secondary metrics and read window.
Result, confidence level, and decision (adopt / reject / re-test).
Implication for future campaigns.

Six lines per test. Stored in a shared document. Reviewed quarterly. That is the entire system the average UK charity needs to convert email testing from theatre into compounding programme improvement.

A/B Testing Charity Emails the Right Way