MNAI is a digital twin. Give it a person's profile and it predicts how that specific individual would answer a new question: their choices, judgments, and preferences. This report measures how often those predictions match what 2,058 real people actually said.
Everything is measured against the same yardstick: how consistently a person answers themselves over time. The best any predictor can do.
* The Human bar is raw self-agreement: how often a person reproduces their own earlier answer (≈73% exact, ≈83% allowing near-misses). The other two are normalized to that ceiling (each system's share of the predictable signal). Two related measures on one axis; MNAI reaches 89% of the human ceiling.
Predicting a person is only possible up to how predictable that person is.
Ask someone the same question twice, weeks apart, and they often answer differently. Not from carelessness, but because people are naturally a little inconsistent. That self-inconsistency sets a hard ceiling: no tool can predict a person more reliably than the person predicts themselves. So the honest question isn't “is MNAI 100% right?”. It's “how close does MNAI get to that human ceiling?” The answer is 89%.
Build a persona, hide the answers, predict, and score against reality.
Each of the 2,058 people completed several survey waves. We build MNAI's persona from their earlier waves, demographics and psychometric profile, and hold out a later wave as unseen questions. The real answers are stripped from the prompt (so MNAI can't peek), and the held-out questions don't appear in the persona at all: MNAI has to genuinely generalize.
Formally, for person i and question j we have the person's true answer yij, MNAI's prediction ŷij, and the question's answer-scale range Rj (e.g. Rj = 4 for a 1–5 scale). Predictions run on Mnemonic's Frontier Model at temperature 0, so ŷij is fully deterministic and the benchmark repeats exactly.
Right on the nose scores 1; close counts for partial credit.
Each prediction is scored by how far it lands from the real answer, relative to the scale. An exact hit scores 1; being one step off on a five-point scale still earns most of the credit; a wild miss scores 0.
In plain termsOne minus how far off MNAI was, scaled by the size of the answer scale. Predict 4 when the truth was 5 on a 1–5 item → score 0.75. This is the dataset authors' own metric, not one we invented.
Average the per-answer scores across every person and question.
In plain termsAMNAI is simply MNAI's average score over all 191,406 held-out answers from all 2,058 people.
Score people against their own earlier answers, the exact same way.
Because the later wave repeats earlier questions, we can score each person against their own past answer: how often they agree with themselves. That is the ceiling.
In plain termsPeople reproduce their own earlier answer only about 73% of the time exactly, rising to 83% once near-misses count. That 83.4% is the best score anyone could get, MNAI included.
MNAI's accuracy as a share of the human ceiling.
In plain termsMNAI captures 89% of the predictable signal: it's 89% of the way to matching how well people predict their own future answers. A score above 100% would actually be a red flag: it would mean MNAI is more rigidly consistent than a real human.
With 2,058 people, the estimate is tight.
Answers from the same person are correlated, so we compute uncertainty by resampling people (a cluster bootstrap, 4,000 draws) rather than individual answers. The confidence interval sits comfortably above the ~85% reference from Park et al.
In plain termsIf we re-ran the study on fresh samples of people, we'd expect the score to stay between 88.7% and 89.4% about 19 times out of 20.
We reproduced the published numbers before scoring ourselves.
To be sure we're comparing like with like, we first re-computed the dataset authors' own published results with our code. They matched. So our 89% is measured on exactly the field's yardstick, not a re-defined one.
| Quantity | Our code | Published |
|---|---|---|
| Human ceiling | 0.831 | 0.827 |
| Reference LLM twin | 0.746 | 0.737 |
This is the number MNAI delivers.
MNAI predicts a person's answers 89% as accurately as that person predicts their own future answers and the human's own consistency is the ceiling nobody can beat. This is the headline figure: it beats the ~85% state of the art (Park et al.).
89.0% = MNAI's accuracy (74.2%) ÷ the human ceiling (83.4%)
MNAI's share of the human ceiling (the headline number), broken out by topic. Higher means closer to the best score anyone could achieve.
MNAI is strongest on reasoning and judgment, nearly matching human self-consistency, and solid on economic choices.