Public Dataset
Release link pending. No unpublished item cards or private human-baseline data are committed here.
Where obvious tasks still break language models
Draft: This website and result table are placeholders until the paper-sweep artifacts are frozen.
ObviousBench measures short, objective prompts that a careful human can solve directly, but language models can still answer incorrectly.
The headline score is answer correctness. Format and strict-compliance scores are reported separately, so a correct answer with extra prose is not treated as the main benchmark failure.
The public site will link only frozen, release-safe artifacts after the report, dataset, and code paths are intentionally published.
Draft placeholder layout. Replace with frozen paper-sweep results before public launch.
| Rank | Model | Correct | 95% CI | Strict | Cost |
|---|---|---|---|---|---|
| 1 | Example model | -- | -- | -- | -- |
| Rank | Model | Strict | Correct | Notes |
|---|---|---|---|---|
| 1 | Example model | -- | -- | Draft placeholder |
Current task families include character counting, spelling transforms, arithmetic, word counting, ordering, format compliance, negation, and constraint awareness.
Deterministic scorers are used for the release-safe benchmark artifacts.
A lightweight demo can be added after the public prompt examples are selected. Until then, this page intentionally avoids exposing unpublished benchmark items.
The report link will be added after the arXiv-ready PDF and citation metadata are frozen.
Release link pending. No unpublished item cards or private human-baseline data are committed here.
Release link pending. The public code path will be linked only after the publication repo or branch is approved.
BibTeX pending until arXiv metadata exists.