Benchmark · Twin-2K-500 · N = 2,058 real people

MNAI beats the published state of the art at predicting real people.

0.0%

of the human ceilingHow close MNAI gets to the best score anyone could achieve. Above the ~85% reported by Park et al. (2024).

What MNAI does

MNAI is a digital twin. Give it a person's profile and it predicts how that specific individual would answer a new question: their choices, judgments, and preferences. This report measures how often those predictions match what 2,058 real people actually said.

How it compares

Everything is measured against the same yardstick: how consistently a person answers themselves over time. The best any predictor can do.

Human · test–retestraw self-agreement

73%*

Park et al. 2024reference

85%

MNAIthis study

89.0%

0%50%100%

* The Human bar is raw self-agreement: how often a person reproduces their own earlier answer (≈73% exact, ≈83% allowing near-misses). The other two are normalized to that ceiling (each system's share of the predictable signal). Two related measures on one axis; MNAI reaches 89% of the human ceiling.

01The idea

Predicting a person is only possible up to how predictable that person is.

Ask someone the same question twice, weeks apart, and they often answer differently. Not from carelessness, but because people are naturally a little inconsistent. That self-inconsistency sets a hard ceiling: no tool can predict a person more reliably than the person predicts themselves. So the honest question isn't “is MNAI 100% right?”. It's “how close does MNAI get to that human ceiling?” The answer is 89%.

02How the test works

Build a persona, hide the answers, predict, and score against reality.

Each of the 2,058 people completed several survey waves. We build MNAI's persona from their earlier waves, demographics and psychometric profile, and hold out a later wave as unseen questions. The real answers are stripped from the prompt (so MNAI can't peek), and the held-out questions don't appear in the persona at all: MNAI has to genuinely generalize.

Formally, for person i and question j we have the person's true answer y_ij, MNAI's prediction ŷ_ij, and the question's answer-scale range R_j (e.g. R_j = 4 for a 1–5 scale). Predictions run on Mnemonic's Frontier Model at temperature 0, so ŷ_ij is fully deterministic and the benchmark repeats exactly.

03Scoring one answer

Right on the nose scores 1; close counts for partial credit.

Each prediction is scored by how far it lands from the real answer, relative to the scale. An exact hit scores 1; being one step off on a five-point scale still earns most of the credit; a wild miss scores 0.

a_ij = max(0, 1 − | ŷ_ij − y_ij |R_j) ∈ [0, 1]

In plain termsOne minus how far off MNAI was, scaled by the size of the answer scale. Predict 4 when the truth was 5 on a 1–5 item → score 0.75. This is the dataset authors' own metric, not one we invented.

04MNAI's overall accuracy

Average the per-answer scores across every person and question.

A_MNAI = 1|S|Σ_(i,j)a_ij = 74.2%

In plain termsA_MNAI is simply MNAI's average score over all 191,406 held-out answers from all 2,058 people.

05The human ceiling

Score people against their own earlier answers, the exact same way.

Because the later wave repeats earlier questions, we can score each person against their own past answer: how often they agree with themselves. That is the ceiling.

a^H_ij = max(0, 1 − | y^(earlier)_ij − y^(later)_ij |R_j) , A_human = 1|S|Σ a^H_ij = 83.4%

In plain termsPeople reproduce their own earlier answer only about 73% of the time exactly, rising to 83% once near-misses count. That 83.4% is the best score anyone could get, MNAI included.

06The headline number

MNAI's accuracy as a share of the human ceiling.

𝒩 = A_MNAIA_human = 74.2%83.4% = 89.0%

In plain termsMNAI captures 89% of the predictable signal: it's 89% of the way to matching how well people predict their own future answers. A score above 100% would actually be a red flag: it would mean MNAI is more rigidly consistent than a real human.

07How sure are we?

With 2,058 people, the estimate is tight.

Answers from the same person are correlated, so we compute uncertainty by resampling people (a cluster bootstrap, 4,000 draws) rather than individual answers. The confidence interval sits comfortably above the ~85% reference from Park et al.

𝒩 = 89.03% (95% CI [88.66%, 89.40%])

In plain termsIf we re-ran the study on fresh samples of people, we'd expect the score to stay between 88.7% and 89.4% about 19 times out of 20.

08Did we measure it right?

We reproduced the published numbers before scoring ourselves.

To be sure we're comparing like with like, we first re-computed the dataset authors' own published results with our code. They matched. So our 89% is measured on exactly the field's yardstick, not a re-defined one.

Quantity	Our code	Published
Human ceiling	0.831	0.827
Reference LLM twin	0.746	0.737

09The result

This is the number MNAI delivers.

What MNAI delivers

89.0%

95% confidence: 88.7% – 89.4%

MNAI predicts a person's answers 89% as accurately as that person predicts their own future answers and the human's own consistency is the ceiling nobody can beat. This is the headline figure: it beats the ~85% state of the art (Park et al.).

89.0% = MNAI's accuracy (74.2%) ÷ the human ceiling (83.4%)

Across 2,058 people191,406 answers0 failuresMNAI accuracy 74.2% (±0.15)human ceiling 83.4% (±0.14)

The same score, by type of question

MNAI's share of the human ceiling (the headline number), broken out by topic. Higher means closer to the best score anyone could achieve.

Reasoning & judgment

Logic, estimates, and everyday decisions

93.8%

of the human ceiling

Economic choices

Pricing, risk, and preferences

84.0%

of the human ceiling

MNAI is strongest on reasoning and judgment, nearly matching human self-consistency, and solid on economic choices.

10What we're careful about

Partial credit is a choice. Scoring near-misses as “close” is the dataset's convention; we also report strict exact-match (where MNAI reaches 81% of a lower, ~71% human ceiling).
One deterministic run. Temperature 0 makes results repeatable but means we don't measure run-to-run variation; our confidence interval is across people.