How to prioritise experiments with ICE (and when it lies to you)

Most CRO teams have a backlog problem. Ideas pile up faster than the bandwidth to run them, and the queue rots into a graveyard of we should test that someday. ICE scoring is the most widely used way out of that gridlock: it gives every idea a single comparable number so you can sequence your work in minutes, not meetings.

The catch is that ICE is also one of the most abused frameworks in the field. Scores get inflated, important experiments get buried, and the number starts to feel like a decision when it is really just a prompt for one. This article shows you how to use ICE well and, more importantly, where it lies to you and what to do about it.

What ICE actually measures #

ICE stands for Impact, Confidence, and Ease. You score each dimension from 1 to 10 and multiply them. The product surfaces tests that are likely to move the metric, that you have real evidence for, and that you can ship without heroics.

Impact: how much will this move your conversion metric if it wins? A headline test on the pricing page that carries your entire upgrade flow scores higher than a button-colour test on a low-traffic internal page.
Confidence: how strong is your evidence that the change will lift performance? Direct user research, a session replay showing friction, or a well-established UX principle all raise it. A hunch does not.
Ease: how little engineering and design does this need? A one-line copy change is a 9. A multi-step checkout redesign is a 2.

The output is a ranked list. High-ICE experiments run first; low-ICE ideas wait in the backlog until you have capacity or new evidence to revisit them. This maps directly to the prioritise step in the CRO process. It is how you decide what to act on once research surfaces your opportunities.

See it as a 2x2 before you trust a number #

The multiplication hides a useful picture. Before you sort by score, plot ideas on Impact against Ease. The quadrant tells you the kind of bet each idea is. The number alone never does.

The Impact-Ease matrix behind every ICE score. Each dot is an illustrative experiment; positions are relative, not measured.

Quick wins (top-right) are where ICE earns its keep: high impact, low effort, easy to justify. Big bets (top-left) are where ICE quietly fails you: high impact but hard, so the multiplication drags them down even when they matter most. Fill-ins are for spare capacity. The bottom-left is where ideas go to be archived.

How to score each dimension honestly #

The failure mode is treating ICE as a gut-feel exercise. Scoring should be calibrated, not aspirational. Score relative to your own backlog: a 10 means best idea I have, not theoretically perfect.

Scoring Impact. Anchor Impact to your actual traffic and funnel math, not optimism. Ask: if this wins and we roll it out, what is the realistic change in conversions per month? A 10 is a test on your highest-traffic, highest-drop-off page. A 5 is a meaningful page with moderate traffic. A 1 is a low-traffic edge case. One technique: rank Impact across the whole backlog before assigning numbers. Pick your single highest-impact opportunity, call it a 10, then score everything relative to it.

Scoring Confidence. Confidence reflects the quality of evidence behind the hypothesis, not how much you like the idea. Score against where your evidence sits on a hierarchy: own A/B data beats a competitor teardown beats a hunch.

Evidence behind the idea	Confidence band
Controlled A/B result from your own site	9-10
Session replay or heatmap showing a specific friction point	7-8
On-site survey responses naming the problem	6-7
Established public UX research (e.g. Baymard Institute on checkout)	5-6
Industry best practice or analogy to a similar site	3-4
Gut feel or internal opinion	1-2

Scoring Ease. Ease is the inverse of effort, scored against your team’s real capacity, not an abstract notion of simple. For a solo founder touching the site weekly, a new landing-page section might be a 5; for a team with a dedicated front-end engineer, an 8. Be honest about dependencies: if an experiment needs engineering, design, legal review, and a content update, do not call it a 7 because each piece looks fast on its own. Cascade dependencies kill velocity.

Rule of thumb: if you cannot explain in one sentence why an idea scored above 6 on any dimension, the score is optimism, not evidence.

Running a scored backlog #

A score you write once and never revisit is decoration. ICE becomes useful only as a living queue. Here is a minimal workflow that works without heavy tooling. A spreadsheet is plenty.

Log every idea: capture from research, stakeholder requests, and your own observations. Do not filter at the door; bad ideas score themselves out.
Score on entry: assign I, C, and E while the idea is fresh. Batch-scoring a stale list is where bias and false precision creep in.
Sort and review weekly: look at the top ten. Ask whether anything blocks the top three from running; fix the blocker or move down the list.
Archive, never delete: evidence changes. A 3 from last quarter can become a 7 after new research surfaces.
Re-score anything older than ~60 days: your traffic, your site, and your evidence base all move. Stale scores rank stale ideas.

Treat the one-line justification beside each score as mandatory, not optional. It is what turns a number you can game into a claim someone can challenge.

Where ICE lies to you #

This is the section that matters most. ICE is a heuristic, not a truth machine, and four failure modes surface again and again. The point is not to abandon the framework. It is to know exactly where to stop trusting the number and start using judgement.

Subjectivity masquerading as rigour. A single person scoring their own backlog is scoring their own biases with a veneer of maths. The experiment a founder personally believes in reliably earns generous Impact and Confidence. Peer review fixes this: have someone challenge any score above 7 and argue it down a point. If they cannot, it stays.

Score-gaming. Once people know ICE decides what gets built, they optimise for the score instead of the idea. Whoever wants the new checkout flow will talk up its Ease and wave away the complexity. The antidote is the written justification: scores without a documented reason are provisional and do not rank.

Learning value is invisible. ICE captures expected lift, not the value of information. Sometimes a low-ICE experiment is the right call because it resolves a strategic question faster than anything else in the queue. Imagine a SaaS team unsure whether to position on feature depth or ease of use: a quick five-second test on two headlines might score a 4 yet be the highest-value experiment of the quarter because it ends a six-month debate. Add a Learning column for ideas whose answer cascades into other decisions.

Strategic bets get buried. High-Ease, moderate-Confidence ideas score reliably well, so they dominate ICE-sorted lists. The long-horizon, high-complexity bets, the ones that could change your trajectory, sit at the bottom forever. ICE has a structural bias toward incrementalism. Counter it by ring-fencing capacity: reserve a slice of bandwidth for strategic experiments regardless of score. A roughly 20/80 split (a fifth on big bets, the rest on the highest-ICE queue) is a reasonable starting point to revisit quarterly.

ICE tells you what is easy to justify. It does not tell you what is worth doing.

Notice that two of these four failures are about people gaming the system, and two are about the formula being blind by design. They need different fixes: process discipline for the first pair, deliberate overrides for the second.

Calibration: do this, not that #

Calibration is what keeps scores comparable over weeks and across people. Without it, a 7 from you and a 7 from a teammate are different currencies and the ranking is meaningless.

Do this

Write a 1-10 rubric down. A 7 on Confidence should mean the same thing to everyone scoring.
Anchor to outcomes, not activities: improve the checkout is not an Impact-9; fix the exact field where replay shows the highest drop-off might be.
Score Impact before you fall in love with a specific solution.
Track predicted vs actual after each test, so you learn where your scoring runs hot or cold.

Not this

Score from memory and vibes, with no shared definition of the numbers.
Reward effort and motion instead of expected lift.
Anchor the score to how excited you are about your own idea.
Score once, ship, and never check whether the prediction held.

That predicted-vs-actual habit is the same systematic, bias-removal mindset behind a good friction audit: you are correcting the instrument, not just one reading.

Rule of thumb: if your top-ten ICE experiments are all copy tweaks and button changes, you have an incrementalism problem, not an ICE problem.

When to override the score #

ICE should inform the decision, not make it. Override it deliberately when:

A strategic bet needs de-risking. If a major product or positioning change is coming, run the tests that answer the most pressing unknowns first, whatever they score.
Confidence is unresolvable without running the test. Sometimes the only way to generate evidence is to ship. A low-Confidence, low-Ease experiment that produces real learning is worth more than its number suggests.
There is a hard deadline. A seasonal campaign, a pricing change, or a launch creates a time-bound window. ICE has no concept of time sensitivity. You have to supply it.
A high-score idea has a flaw you cannot score. Ease does not capture political friction, third-party dependencies, or regulatory review. If you know an experiment needs three months of approvals, score it down or move it to a blocked list.

The discipline is to make every override explicit: we ran this instead of the higher-ICE idea because [reason] should live in the backlog. If you override regularly without recording why, the framework is not doing its job, and neither are you.

ICE is the start of a conversation, not a substitute for one. Use it to convert more, guess less. But keep your judgement firmly in the loop.

Frequently asked questions #

Is ICE better than RICE or PIE for CRO?

They are close cousins. RICE adds a Reach term and divides by Effort; PIE scores Potential, Importance, and Ease. For a focused conversion backlog, ICE is usually enough because Impact already absorbs reach via your traffic and funnel math. Use the simplest model your team will actually keep current: a maintained ICE beats an abandoned RICE.

Should one person or the whole team score?

One person can run ICE, but solo scoring is exactly where subjectivity hides. If you are a solo operator, build in a synthetic check: justify every score above 7 in writing and argue it down a point before it locks. With a team, have a second person review high scores rather than letting the author set them unchallenged.

How often should I re-score the backlog?

Score each idea on entry, then re-score anything that has waited more than about 60 days. Your traffic, site, and evidence base all shift, and a stale score quietly ranks a stale idea. A quick weekly pass over the top ten keeps the queue honest without turning scoring into a second job.

Does a high ICE score mean the test will win?

No. ICE estimates how worthwhile an experiment is to run, not whether it will succeed. Plenty of high-ICE tests lose. That is the point of testing. Validate outcomes with proper A/B testing and statistical significance; ICE only decides what you put in front of your users next.

For what happens after you pick the experiment, A/B testing explained covers how to structure and read the test itself.

How to prioritise experiments with ICE (and when it lies to you)

What ICE actually measures #

See it as a 2x2 before you trust a number #

How to score each dimension honestly #

Running a scored backlog #

Where ICE lies to you #

See your own site’s conversion leaks in 15 seconds

Calibration: do this, not that #

Do this

Not this

When to override the score #

Frequently asked questions #

OptiWolf

Keep reading

The CRO process in five steps: research, hypothesise, prioritise, test, learn

What is conversion rate optimization? A practical definition for operators

How to calculate and benchmark your conversion rate