More than Multiple Choice

How do we measure if an LLM is growing in knowledge? Historically, the answer has been simple: give it a standardized test. Traditionally, we've measured progress by their ability to ace various domain specific exams (bar, medical boards, trivia, etc). But as existing question/answer evaluation benchmarks begin to saturate, we may reach a point where static knowledge tests are no longer sufficient to gauge true intelligence. We have to stop asking what a model knows, and start measuring what it can do.

Knowledge Benchmark Saturation

One goal of evaluating LLMs is to understand where they approach 'expert level' capability in a given domain. In the ancient age of five years ago, we measured text model performance with multiple choice knowledge questions, like MMLU:

MMLU examples — Examples from "Massive Multitask Language Understanding" (MMLU, 2021)

As of March 2026, frontier models are saturating benchmarks like MMLU, with the top 10 models reporting between 82% and 93% on Global-MMLU-Lite, a subset of the original MMLU evaluation focused on global knowledge and global context:

Global-MMLU-Lite March 2026 — Global MMLU Lite, March 2026 (Artificial Analysis)

Perhaps it is unsurprising that models 'know' the answers to MMLU evaluations - these questions are well answered via direct memorization. As model size grows (either via monolithic, dense models, or larger mixture of expert architectures), the models can store more static, factual information. This saturation is driving a push to develop new benchmarks that go beyond simple information recall. For instance, in 2024, MMLU-Pro introduced a more difficult, 10 answer format (instead of MMLU's 4 multiple choice answers), and added more 'reasoning-based' questions, like this one:

Example 'challenging question' from MMLU-Pro — Table 10 Case 3 from MMLU-Pro Paper - Example "Reasoning" question

These 'more difficult' questions in MMLU-Pro are hard to successfully answer via memorization, forcing models to 'think through' the given data to reach the correct answer. When MMLU-Pro was published, no frontier models exceeded a 73% success rate on the benchmark. Failure cases included reasoning errors, question understanding errors, calculation errors, and lack of specific subject knowledge. Notably, the performance of models that used CoT (e.g. "Think step-by-step...") to answer instead of 'directly' generating an answer performed significantly better on MMLU-Pro than MMLU - indicating that MMLU Pro tested something beyond simple 'answer regurgitation'.

Refer to caption — Models utilizing Chain of Thought perform better on MMLU-Pro

Today, multiple frontier models all benchmark above 80% on MMLU-Pro, with Gemini 3 Flash (not even 3.1!) pushing 90%:

mmlu pro march 2026 results — MMLU Pro, March 2026 (Artificial Analysis)

While benchmarks like MMLU still have value for identifying unexpected 'general knowledge' losses in smaller model ablation or distillation scenarios, to challenge the current frontier, new evaluations are emerging. Humanity's Last Exam claims to offer "a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage", and the results are compelling - no frontier model currently even reaches 45% success rate.

HLE results March 2026 artificial analysis — Humanity's Last Exam, March 2026 (Artificial Analysis)

As an analogy: If the original MMLU represented pub trivia, and MMLU-Pro pushed into undergraduate course territory, HLE focuses on PhD level knowledge, generated with the help of almost 1000 professor/researcher level contributors across 500 institutions globally. Humanity's Last Exam tries to require both multi-step thinking, and extremely deep domain knowledge (not generally 'solved' on the broader internet). Some example questions from HLE help illustrate this:

HLE evaluation question examples — Examples from the HLE Corpus

But, what happens if (when?) models eventually saturate 'expert level questions' like what HLE presents above - are models done improving? Intuitively, there is a sense that expert level work goes beyond 'getting the answer right'. A human PhD doesn't just stare at a piece of paper and generate an answer. They formulate hypotheses, run experiments, write and test code, search the internet, use external software, make mistakes, and correct their approach over days or weeks. How might one test this kind of capability, then?

Give Models the Tools

If traditional evaluations (MMLU, MMLU-Pro, even HLE) measure 'what a model knows' parametrically, agentic evaluations (indirectly) measure 'how well can a model use tools to accomplish a given task'. The first approach represents crystallized intelligence, but agentic evaluations move towards measuring fluid intelligence (the ability to solve problems abstractly, independently of having seen them before). For example, if we ask a question about how many patients in a hospital are wearing shoes, there are at least two distinct 'approaches' to the problem. The first approach applies high school algebra (the 'crystallized' approach), and the second involves a solution that bypasses mental math altogether.

In an agentic workflow, the model doesn't need to perfectly compute the algebraic steps within its own neural network weights, essentially 'memorizing' formulas. Instead, it relies on its liquid intelligence: it writes a quick Python script, executes it in a secure terminal, reviews the output, and uses that definitive result to answer your question, or to refine a script and reason through its own work. Instead of memorized formulas, the model leverages tool use, iteration, and verification.

Using learned algebraic tools, or general logic

SWE-Bench: Real World Code Issues

So, what does a practical agentic evaluation look like? Lets look at SWE-Bench, which asks models to generate a patch that will fix an actual issue from a real codebase. The issues were generated from real-world Github PRs that resolved at least one Github issue, and updated tests as part of the PR. For each issue, the benchmark clones the code repository into a Docker instance, and asks the LLM to resolve the issue. Once the LLM finishes its work, the benchmark runs a set of (hidden from the LLM) unit tests to confirm that the issue is successfully resolved.

SWE-bench Teaser Image — Overview of SWE-Bench Workflow

When SWE-Bench was first introduced, models struggled to solve even single-digit percentages of the issues. A simple 'this is the answer to the issue' is insufficient - the answer required comprehension of the codebase, and reasonable decision making about the correct changes to make to resolve the issue.

Today, however, frontier models and specialized coding agents are excelling on SWE-Bench - on a verified subset of questions from the benchmark, using a common test harness, top models generally exceed 70% successful resolve rates, at a 'price per issue resolved' between $0.30 and $0.75 USD (Kimi K2.5 and MiniMax M2.5 as extremely low cost outliers):

So, if SWE-Bench evaluated the 'liquid intelligence' of a specific aspect of software engineering (straightforward resolution of bugs/issues), how might similar evaluations benchmark models on other aspects of knowledge work?

OfficeQA Pro - Grounded Enterprise Reasoning

In March 2026, Databricks published a new evaluation, OfficeQA Pro, to measure performance on "navigating large, heterogeneous document corpora, identifying and retrieving relevant materials and then performing grounded analysis", or what they define as 'grounded reasoning'. While there are existing attempts to measure how well models can perform 'economically valuable' tasks (like GDPVal), these are typically much more 'closed world' evaluations, with only a handful of artifacts which are provided to the model up front - the model isn't responsible for gathering the data.

OfficeQA Pro goes beyond this, to a more messy, enterprise data environment. None of the frontier models from Anthropic, Google, or OpenAI exceed 50% on the benchmark, with only Anthropic's Opus 4.6 breaking 40%:

frontier models failing to exceed 50% on OfficeQA Pro — Frontier Model results from OfficeQA Pro Paper

The dataset for OfficeQA Pro is US Treasury Bulletins from 1932 onward, which are typically 100-200 pages each, and vary dramatically based on when they were published:

OfficeQA Pro dataset examples — Example Data in the OfficeQA Pro Dataset

From this dataset, the authors constructed questions that could not be answered by frontier model parametric knowledge alone. The questions are umambiguous and have one correct numerical answer, which cannot be found simply by 'looking up' a value in the table - they require complex reasoning over documents from multiple reports.

OfficeQA Evaluation Questions — Pro/Easy Questions from the OfficeQA Evaluation

To answer the questions, each agent is given three tools:

Web Search (via DuckDuckGo) for supplementary information
Python Execution/REPL for computations and data manipulation
File Search for exploring parsed PDF files

File search tool for OfficeQA Pro — Custom file search tools given to the agent in OfficeQA Pro

The agent then is given 200 rounds (maximum) to take tool actions to get to an answer. Agents are evaluated in two scenarios: first, they are given the specific pages (oracle pages) that are required to get the answer. Next, they're asked to answer the question with only the full corpus provided. The performance drops when the agent has to find the relevant information from the full document corpus, but interestingly, in a subset of 30 test scenarios, the top agents outperform or tie humans at the ask, even if allowable error in the result is increased to 5%:

ai exceeding human capability on officeQA tasks treasury reports

On these tasks, the authors report that human annotators take approximately four times longer than the average agent, and are about 28% less accurate in the 0.0% allowable error scenario. And, given that the top models are still below 50% correctness on the entire OfficeQA Pro dataset at 0.0% allowable error, there is room to grow.

The era of treating LLMs like prodigy test-takers is coming to a close. As models inevitably saturate traditional benchmarks, the goalposts must shift from crystallized knowledge to fluid execution.

If these systems will eventually take on longer-running delegated 'real world' tasks, our evaluations must mirror the messy, unstructured reality of actual work. In the real world, the answer isn't a multiple-choice bubble; it's a code change that implements a feature, a financial report that is factually correct and aligned with the company style, and requires use of the same systems of record that humans use. We are no longer just building encyclopedias; we are building reasoning engines. And to measure their true capabilities, we finally have to grade them on what they can actually do.

Unalarming

More than Multiple Choice

Knowledge Benchmark Saturation

Give Models the Tools

SWE-Bench: Real World Code Issues

OfficeQA Pro - Grounded Enterprise Reasoning

Unalarming

Unalarming

More than Multiple Choice

Knowledge Benchmark Saturation

Give Models the Tools

SWE-Bench: Real World Code Issues

OfficeQA Pro - Grounded Enterprise Reasoning

Newsletter Subscription

Recent posts

The Anatomy of a Harness

Spec-Driven Development

Personalizing the "How"

Unalarming