A/B EXPERIMENTS

Test a change. Read what it did to players.

Split your players across two or more variants of a behavior, ship, and let Playloop tell you which one played better. You define the variants and how to split traffic; Playloop assigns each player, measures retention, engagement, crashes, and friction per variant, and writes a plain-English summary of what actually changed. No bucketing code in your game.

OVERVIEW

The server does the bucketing. Your game just asks.

An experiment is a measured rollout of two or more variantsof some behavior: a new tutorial, a different HUD layout, a tweaked difficulty curve. You create it on the Experiments tab of any game, give each variant a key, and decide how to split traffic. Once it's running, your game asks Playloop which variant a player should see and renders accordingly.

Assignment happens on Playloop's servers, not in your game, so every SDK (Unity, Unreal, Godot, Python, TypeScript) agrees on the same answer for the same player. The SDK fetches the player's variant map once per session, caches it, and tags every event it sends with the variants in effect. That tagging is what lets the dashboard split all of your existing metrics, retention, engagement, crashes, friction clusters, feedback quotes, by variant.

The structured comparison is free on every tier. The optional AI digest, a written read on what changed between variants, uses your managed-AI session quota (or your own AI key). See Cost & quota.

CREATE AN EXPERIMENT

Variants, allocations, audience.

From a game's Experiments tab, click New experiment, name it, and define its pieces:

Variants

Between two and eight variants. Each has a key (letters, digits, _ and -, up to 40 characters, e.g. control or tutorial-v2), a display name, and an allocation. The key is what your game passes to read the assignment, and what every event gets tagged with, so pick something stable and readable. The first variant you declare is the baseline; every other variant is compared against it.

Allocations

An allocation is the relative weight of a variant: how much of your traffic lands on it. They don't have to add up to 100; [50, 30]splits traffic 62.5% / 37.5% by relative weight. If your weights don't sum to 100, Playloop shows a confirmation before you start (“your allocations sum to N; we'll normalize to 100”) so the split is never silently rewritten behind your back.

Audience (optional)

By default an experiment applies to everyone in the game. Attach an audienceto scope it to a subset, for example “players on build 1.4.0 and up.” A player who doesn't match the audience is skipped for that experiment entirely (your game falls back to its default behavior). Audiences are reusable: build one once and point several experiments at it.

HOW PLAYERS ARE ASSIGNED

Deterministic, by device.

Assignment is keyed on the player's device id, the same stable, anonymous identifier the SDK already uses to group a returning tester's sessions. The same device always lands on the same variant for the life of an experiment, so a player's experience stays consistent across sessions. Different experiments are independent: being in the treatment arm of one says nothing about which arm you land on in another.

For QA, you can pin a specific device to a specific variant from the dashboard, which bypasses the split entirely. That's how you force yourself onto the variant you want to test.

Known limitations

Because assignment is per-device, two cases are worth knowing about when you read the numbers:

Reinstalls. A player who uninstalls and reinstalls gets a fresh device id, and may be re-assigned to a different variant. For typical experiment durations this is noise, but if your game has a high reinstall rate (mobile first-time-user experiments especially), interpret retention numbers knowing some reshuffling happens.
Multiple devices. A player on two devices, a phone and a PC, has two device ids and may see a different variant on each. They count as two players for analytics. This is rare in practice for most game contexts.

QA pins are also per-device: a tester who wants the same variant on several machines sets up one pin per device.

LIFECYCLE

Draft, running, stopped.

An experiment moves through three statuses:

Draft.You're still configuring it. Nothing is being assigned to players yet, and you can edit the variants, allocations, and audience freely.
Running. Click Start and the experiment goes live: players begin getting assigned and their sessions get tagged. The comparison and digest populate as data arrives.
Stopped.Click Stop and assignment ends. The variant fetch no longer returns this experiment, so new sessions stop being tagged, while the data you've already collected stays for analysis. You can Resume a stopped experiment to start assigning again, and you can record which variant you decided to ship as the winner.

Why config locks once you start

The moment an experiment leaves draft, its variants, allocations, and audience are locked. This keeps the comparison statistically honest: you can't re-weight the split or change who's eligible halfway through and still trust the side-by-side numbers. Names and descriptions stay editable throughout. To run a fresh version with different config, clone the experiment and start the new one.

READING THE COMPARISON

Every metric, split by variant.

The detail page shows one card per variant, side by side. Each card carries:

Session count for the variant.
Retention at day 1, day 2, and day 7, each shown as a rate plus its retained / eligible denominator (see cohort-eligible below).
Engagement %, active time over wall-clock time. This one is informational only: it never carries a confidence label.
Crash rate and crash count.
Top friction and top praise clusters, the recurring themes the analyzer pulled from each variant's sessions.
Sample feedback quotes, a few verbatim free-text answers from players in that variant.

The baseline variant is badged as such and shows no confidence labels (there's nothing to compare it against). With three or more variants, a headline at the top names the strongest non-baseline variant by retention and its top-line label.

CONFIDENCE LABELS

Plain words instead of a p-value.

Next to each retention window and the crash rate on a non-baseline variant, Playloop shows a one-word label that says how much to trust the difference. The labels come from a Bayesian comparison of the two variants; you don't configure anything:

Label	What it means
`insufficient`	Not enough data yet. A window stays here until at least 15 players are eligible on both the variant and the baseline.
`inconclusive`	There is data, but the two variants look too close to call. The most likely direction is below roughly 65% probability.
`promising`	One variant is leading at roughly 65 to 85% probability. Worth watching, not yet worth shipping on.
`likely`	A stronger lead, roughly 85 to 95% probability. The direction is fairly trustworthy.
`clear`	At least 95% probability AND the credible range of the difference excludes zero. This is the only label that calls a real, separated result.

The label is direction-agnostic: for retention, higher is better; for crash rate, lower is better. It always describes the strength of the difference, not which way it points. The rate itself tells you the direction.

Cohort-eligible retention

Retention is measured only over players who've had a fair chance to come back. A player is cohort-eligiblefor the day-N window only if their first session was at least N days ago. You can't know whether someone who arrived yesterday will return on day 7, so they're left out of the day-7 denominator until enough time passes.

That's why each retention figure shows retained / eligible: the eligible count is usually smaller than the variant's total players, and it grows as the experiment runs. Early on, a window can read insufficient simply because not enough players are eligible yet, even if the point estimate looks decisive. Give it time.

AI DIGEST

A written read on what changed.

Below the comparison cards, the AI digest turns the numbers into a short narrative: a recommendation, a per-variant headline with a sentiment read, the themes the variants share, and how sentiment shifted between them. It's the “so what?” layer on top of the raw stats, what the change did to how players experienced your game, not just which number moved.

Generate or refresh it with the Regenerate button. Until you do, the structured comparison cards above are fully usable on their own. The digest is an addition, not a gate.

The calibration promise

The digest will never claim more certainty than the data supports. Its language is tied to the confidence label: when a result is only promising or inconclusive, the digest won't call a “winner” or say one variant is “clearly” better. If a draft ever overreaches, Playloop rewrites it to match the evidence, erring toward underclaiming rather than overclaiming.

It also won't invent quotes or themes. Every cluster the digest cites and every verbatim quote it includes is checked against your real session data first. If that check can't be satisfied, Playloop shows the structured stats only rather than surface unverifiable prose.

COST & QUOTA

One regen equals one session.

Building the experiment, splitting players, and reading the structured comparison are all free, on every tier, with no cap. The only part that costs anything is the AI digest.

Each time you generate or regenerate a digest, it counts as one session-equivalent against your monthly managed-AI quota, the same currency a session analysis uses. One number to think about, not two.

Using your own AI key (any tier): regenerating runs on your provider account, no managed quota involved.
On managed AI: each regen draws one from your monthly allowance. If you've hit the cap, regeneration is paused until your billing cycle resets. Add your own AI key to keep going in the meantime.
To prevent runaway cost, a digest can be regenerated at most once per hour per experiment, whether you click the button or it's on an automatic schedule.

The structured comparison never counts against any quota. Only the written digest does.

AUTOMATIC REGENERATION

Off by default. Opt in per game.

By default the digest only refreshes when you click Regenerate, so you're never surprised by managed-AI usage you didn't ask for. If you'd rather keep it current automatically, you can opt a game in to a fixed refresh cadence:

Off: manual only (the default).
Daily: refresh once a day.
Every 6 hours: four times a day.
Weekly: refresh once a week.

The cadence is a fixed list on purpose: it keeps the cost story predictable. Each automatic refresh counts the same one session-equivalent as a manual one and obeys the same once-per-hour limit, so picking a cadence is the same as agreeing to that many session-equivalents per running experiment over time.

Read the variant from your game

Each SDK exposes a one-call lookup for the player's assigned variant. See your engine's install page under SDKs.

The variant-assignment endpoint

The HTTP contract the SDKs use to fetch a player's variant map lives in the API reference.