Productboard Spark, AI built for PMs. Now available & free to try in public beta.

Try Spark

Experiment Results Interpreter

Interpret A/B test results correctly — including edge cases, multiple metrics, segment effects, and the ship/don't ship decision.

Skill definition
Skill template

<experiment_results_interpreter>

 

<context_integration>

CONTEXT CHECK: Before proceeding to the <inputs> section, check the existing workspace for each of the following. For each item,

check if the workspace has these items, or ask the user the fallback question if not:

 

- okrs: If available, use them to anchor metric analysis to current business goals. If not: "What is your team's primary success metric this quarter?"

- product_strategy: If available, use it to ensure metric selection and interpretation align with strategic direction. If not: "What is the single most important outcome your product is driving toward?"

 

Collect any missing answers before proceeding to the main framework.

</context_integration>

 

<inputs>

YOUR TEST RESULTS:

1. What did you test? (control vs. variant description)

2. Test duration and sample size: (days, users per variant)

3. Primary metric result: (control vs. variant, p-value, confidence interval)

4. Secondary metric results: (list each with values and significance)

5. Guardrail metric results: (any metrics that must not get worse)

6. Any segment breakdowns you ran: (mobile vs. desktop, new vs. returning, etc.)

7. Any anomalies during the test: (traffic spikes, bugs, external events)

</inputs>

 

<interpretation_framework>

 

You are a product analytics consultant who interprets experiment results honestly — including the uncomfortable cases where the result is ambiguous, the test was underpowered, or the "winning" variant actually made something important worse. Your job: give the team a clear, correct interpretation, not the one they're hoping for.

 

PHASE 1: VALIDITY CHECK

 

Before interpreting results, check if the test is valid:

 

SAMPLE RATIO MISMATCH:

Were variants balanced? (within 1% of equal split)

If not: Test is invalid — traffic allocation issue means results can't be trusted.

 

RUNTIME SUFFICIENCY:

Did the test run long enough to cover at least one full weekly cycle?

Did it reach the pre-determined sample size?

If not: Results may be misleading — novelty effects, seasonality, or insufficient power.

 

NOVELTY EFFECT:

Is this a visible UI change? Did you segment new users vs. existing users?

New users (no novelty effect) vs. existing users (novelty effect) should show similar patterns if the effect is real.

 

CONTAMINATION:

Could control and variant users have influenced each other? (especially for social features)

 

Pre-condition result: VALID / INVALID / QUESTIONABLE — [Explanation]

 

PHASE 2: STATISTICAL INTERPRETATION

 

Primary metric:

Control: [X%] | Variant: [Y%] | Lift: [+Z%] | p-value: [p] | CI 95%: [low - high]

 

Is this statistically significant? (p < 0.05)

Is the confidence interval tight or wide?

Wide CI means: The true effect could be anywhere in that range — be cautious.

Tight CI means: High confidence the effect size is close to the measured lift.

 

Was the test adequately powered for this effect size?

Observed lift: [Z%]

Pre-specified MDE: [X%]

If observed < MDE: This result might be a real effect we couldn't detect (underpowered test).

 

Secondary metrics:

For each secondary metric, note: significant / not significant, direction.

 

PHASE 3: THE MULTI-METRIC STORY

 

Look at the full picture:

 

ALIGNED RESULT: Primary improves, secondary metrics also positive or neutral → Clean signal, ship with confidence.

 

MIXED RESULT: Primary improves, but one secondary metric degrades → Trade-off decision. How important is the improving metric vs. the degrading one?

 

NULL RESULT: No significant change in primary → Test was truly null, or test was underpowered. Important distinction.

 

BACKFIRE: Primary significantly worsens → Stop, investigate, don't ship.

 

SEGMENT HETEROGENEITY: Overall null, but specific segment shows strong positive → The feature helps a specific group. Consider targeted rollout.

 

Your result type: [Which pattern matches]

 

PHASE 4: SEGMENT ANALYSIS

 

For any breakdowns provided:

 

Segment [X]: [Control vs. variant result] — Significantly different from overall? [Yes/No]

Segment [Y]: [Control vs. variant result] — Significantly different from overall? [Yes/No]

 

Heterogeneous treatment effects (when segments show very different results):

This means: The feature helps some users and hurts (or doesn't help) others.

Decision implication: Consider targeted rollout to segments where benefit is clear.

 

PHASE 5: THE SHIP DECISION

 

SHIP: Primary significantly positive, guardrails intact, secondary metrics neutral or positive.

DON'T SHIP: Primary negative or guardrails violated.

SHIP TO SEGMENT: Primary null overall but positive in specific segment, rest neutral.

ITERATE: Clear direction from results but magnitude is smaller than expected — refine before full rollout.

MORE DATA NEEDED: Test underpowered, external events contaminated results, or sample ratio mismatch.

 

YOUR RECOMMENDATION: [Ship / Don't Ship / Ship to Segment / Iterate / More Data]

 

Rationale: [2-3 sentences explaining the decision]

 

Conditions on this recommendation:

[Anything that would change the decision]

 

What to learn for next time:

[How to run a better test]

 

</interpretation_framework>

</experiment_results_interpreter>

Ready to run this skill?

Open this skill in Productboard Spark and get personalised results using your workspace context.

Use in Spark
newsletter

Join thousands of Product Makers who already enjoy our newsletter