Benchmark release · July 2026

Obvious mistakes are now a product tradeoff.

ObviousBench measures whether models avoid simple, visible mistakes across repeated attempts—and what that reliability costs.

See the findings Compare reliability bars

—model/config rows

144private held-out items

8visible-failure families

pass^3all three sampled answers correct

Large language models can now do things that would have sounded absurd a few years ago. They can summarize thousands of pages in seconds, write simple applications in minutes, and plan entire businesses in hours.

And yet, the same class of system can show up in a flagship product and get basic arithmetic wrong, make obvious contradictions, or even fail to spell its own name. These are the kinds of mistakes that can cause users to lose trust in the technology, the product, and the company behind it.

Larger models and deeper reasoning can reduce these failures, but they also add cost and latency. Product teams still need to know what visible mistakes become more likely when they choose a smaller model, cheaper route, or shorter reasoning budget.

ObviousBench is not a smart-versus-dumb ranking. It is a way to compare capability, visible brittleness, and cost before those choices reach users.

What this measures

Plain tasks with objective answers: literal counting, spelling transforms, ordering, negation, formatting, arithmetic, word counting, and simple constraint awareness.

How to read it

Answer pass^3 is not pass@3. It requires all three sampled answers to be correct. Strict pass^3 stays available as a formatting diagnostic.

What it is not

Not a global intelligence ranking, not a shame board, and not a claim that one visible miss makes a model generally bad.

ArithmeticWhat is 27 - 9 + 4?→ 22 Character countingHow many lowercase “a” characters are in “abracadabra”?→ 5 Spelling transformsReplace every e in freezer with 3.→ fr33z3r OrderingSort q, m, z, h, r alphabetically.→ h, m, q, r, z NegationChoose the word without the letter e: peach, melon, plum, cherry.→ plum Format complianceReply with exactly this string and nothing else.→ receipt-204 Word countingHow many words are in “Fresh bread cooled on the rack”?→ 6 Constraint awarenessThe car wash is 100m away. If I want my car washed, should I walk there or drive there?→ drive there

The aggregate thesis

The results form a cost–reliability frontier.

Pass rates should be read alongside cost and reported reasoning effort. The preferred region is the upper-right: higher repeated reliability at lower estimated full-run cost.

Color = provider Diamond = reported reasoning Circle = no/minimal reported reasoning Black line = Pareto frontier

All public-surface configurations. Quiet points show the field; the connected points are not beaten on both score and cost.

A decision table, not a podium

What it costs to cross a reliability bar.

Because the benchmark is intentionally saturatable, the practical view is the cheapest setting in each family that clears a chosen quality threshold.

#	Model	Effort	Pass^3	Cost	Tokens	Weights

One-model proof

The same model can be brittle or reliable.

The benchmark’s central result is visible inside a single model family. Turning up reasoning does not merely add a few points for GPT‑5.4 nano; it changes the apparent product risk.

GPT‑5.4 nano moves from — answer pass^3 with no explicit reasoning to — at low, — at medium, — at high, and — at xhigh.

The first jumps are enormous. Later settings still improve reliability, but the larger points show the progressively higher run cost and reported reasoning-token use required to buy those gains.

Answer pass^3 across five GPT‑5.4 nano effort settings. The line shows repeated-answer reliability; the blue bars and right axis show estimated full-run cost for 144 items and 432 attempts.

A ceiling by design

Saturation is evidence, not a defect.

While most benchmarks seek to test harder and harder capabilities, ObviousBench aims to be saturatable by top models. It is designed to provide contrast between model sizes and reasoning depths, not to remain unsolved at the frontier or obscure false negatives.

Once several systems solve nearly every item, the useful question changes. Rank matters less than the cost and reasoning budget required to reach the same reliability bar.

—

Each model family contributes its best public-surface configuration, grouped by answer pass^3 band.

Secondary stories

Useful cuts through the same result surface.

The same aggregate data tells several product stories: newer rows are not always cheaper reliability improvements, early reasoning models remain surprisingly strong, and open-weight rows can be unusually efficient.

OpenAI history

No-thinking performance has improved unevenly.

The progress of OpenAI's no-thinking models has been uneven, with GPT‑4.1 matching GPT‑5.4 performance at 56% of the cost. GPT‑5.6 Sol improves on GPT‑5.5 from 84.0% to 88.2% at essentially the same measured run cost, while Terra reaches 77.1% at roughly half Sol's cost.

The useful product question is not just whether the newest label is better. It is whether each generation shifts the efficient frontier, or asks teams to pay more for similar visible-risk exposure.

Selected public-surface rows with no reported reasoning telemetry. The line shows answer pass^3; bars show measured run cost.

Appendix

Data.

The launch story uses curated, defensible cuts. This appendix keeps the complete aggregate surface available for checking alternative questions.

Models

Provider

Weights Reasoning effort Minimum answer pass^3

—

Click a column heading to sort.

Obvious mistakes are now a product tradeoff.

The results form a cost–reliability frontier.

What it costs to cross a reliability bar.

The same model can be brittle or reliable.

Saturation is evidence, not a defect.

Useful cuts through the same result surface.

No-thinking performance has improved unevenly.

Reasoning settings do not guarantee reasoning use.

Gemini Flash improved, then became more expensive.

Strong reasoning became dramatically cheaper.

Opus improves, regresses, then recovers unevenly.

Sonnet gains reliability, but not in a perfectly monotonic line.

Haiku has made limited progress.

Data.