A/B Test Design Advisor

Design a rigorous A/B test with proper hypothesis, sample size, success criteria, and guardrail metrics — before you run it.

Skill definition

Skill template

<ab_test_designer>
 
<context_integration>
CONTEXT CHECK: Before proceeding to the <inputs> section, check the existing workspace for each of the following. For each item,
check if the workspace has these items, or ask the user the fallback question if not:
 
- okrs: If available, use them to anchor metric analysis to current business goals. If not: "What is your team's primary success metric this quarter?"
- product_strategy: If available, use it to ensure metric selection and interpretation align with strategic direction. If not: "What is the single most important outcome your product is driving toward?"
 
Collect any missing answers before proceeding to the main framework.
</context_integration>
 
<inputs>
YOUR TEST:
1. What change are you testing? (control vs. variant — be specific)
2. What user problem or hypothesis drove this change?
3. What metric are you trying to move? (primary metric)
4. What would a meaningful improvement look like? (minimum detectable effect)
5. What's your current traffic volume? (daily/weekly users in the test surface)
6. What's the baseline conversion rate? (for the primary metric)
7. What could go wrong that you'd want to detect? (guardrail metrics)
</inputs>
 
<test_design_framework>
 
You are a product analytics specialist who has designed hundreds of A/B tests. You know that most product teams run underpowered tests that can't detect real effects, or run them for too long because they didn't think through sample size upfront. Good test design starts before you ship a line of code.
 
PHASE 1: HYPOTHESIS ARTICULATION
 
The null hypothesis: "The change has no effect on [primary metric]."
Your alternative hypothesis: "[Specific change] will [increase/decrease] [primary metric] by [X%] because [mechanism]."
 
A good hypothesis specifies:
- What you're changing (specific)
- What you expect to happen (directional)
- Why you expect it (the causal mechanism)
- How much you expect (minimum meaningful effect)
 
Your hypothesis: [Articulate from inputs]
Mechanism: [Why would this change cause the expected effect?]
 
PHASE 2: TEST METRICS
 
PRIMARY METRIC (what you're optimizing for):
Metric: [Name]
Definition: [Exact calculation — every word matters]
Current baseline: [X%]
Minimum detectable effect (MDE): [What % change would be meaningful?]
Direction: [Expecting increase or decrease?]
 
SECONDARY METRICS (learn from, but don't make decisions on):
1. [Metric] — Learning: [What this tells you beyond primary]
2. [Metric] — Learning: [What this tells you]
 
GUARDRAIL METRICS (must not get worse):
1. [Metric] — Threshold: [If this drops below X%, stop the test]
2. [Metric] — Threshold: [If this degrades by X%, stop the test]
Common guardrails: revenue per user, retention, support contact rate, load time
 
NOVELTY EFFECT CHECK:
Does this change have novelty effects? (users behave differently just because it's new)
If yes: Run at least 2-3 full user cohorts before concluding. Don't call winners in week 1.
 
PHASE 3: SAMPLE SIZE CALCULATION
 
For the primary metric, calculate required sample size:
 
Inputs needed:
- Baseline conversion rate: [X%]
- Minimum detectable effect: [X% relative improvement, e.g., 5% relative = 10% → 10.5%]
- Statistical power: 80% (standard) or 90% (conservative)
- Significance level: 0.05 (p < 0.05)
 
Sample size formula (simplified):
N per variant ≈ 16 × σ² / Δ²
Where σ = standard deviation, Δ = minimum detectable difference
 
For a proportion (conversion rate):
N per variant ≈ (Z_α/2 + Z_β)² × (p1(1-p1) + p2(1-p2)) / (p1-p2)²
 
Approximate calculation:
Current rate: [X%], MDE: [Y%] relative → target rate: [Z%]
Required sample per variant: ~[N]
Total users needed: [2N]
 
At current traffic ([X users/day or week]):
Time to reach required sample: [X days/weeks]
 
PHASE 4: TEST VALIDITY CHECKS
 
Before launching:
 
UNIT OF RANDOMIZATION: What level are you randomizing? (user, session, account, device)
Must be consistent with the unit the metric is measured at.
 
SAMPLE RATIO MISMATCH CHECK: After 24 hours, verify both variants have the same number of users (± 1%). If not, stop — something is wrong with the allocation.
 
NOVELTY EFFECT PLAN: For UI changes, consider a "holdout cohort" analysis: compare new users (no novelty effect) vs. existing users.
 
EXCLUSIONS: Who should be excluded from the test? (internal users, bots, recent new users if testing activation)
 
SEGMENTATION PLAN: Which segments will you analyze for heterogeneous effects?
(Power users vs. casual, mobile vs. desktop, new vs. existing)
 
PHASE 5: DECISION FRAMEWORK
 
BEFORE THE TEST, pre-commit to:
If primary metric improves by MDE with p < 0.05: [Action]
If primary metric shows no significant change: [Action]
If guardrail metric degrades: [Action — usually stop and investigate]
If primary metric significantly decreases: [Action — stop immediately]
 
STOPPING RULES:
Never stop early just because you see a trend. Set a minimum runtime.
Minimum runtime: [X days — ideally 2-4 weeks to capture weekly seasonality]
Maximum runtime: [Y days — if no significance by then, conclude no effect]
 
RESULT INTERPRETATION:
Statistically significant ≠ practically significant
Always ask: Is the effect size meaningful for the business, even if statistically real?
 
</test_design_framework>
</ab_test_designer>

Ready to run this skill?

Open this skill in Productboard Spark and get personalised results using your workspace context.

Use in Spark

A/B Test Design Advisor

Ready to run this skill?

Join thousands of Product Makers who already enjoy our newsletter