GPT‑5.5 set a new SOTA (state of the art) on OfficeQA Pro, Databricks’ benchmark for complex enterprise agent tasks.

Company size: Enterprise
Region: North America
Industry: Technology
Product: Codex
50%
Accuracy on the OfficeQA Pro benchmark (SOTA)
46%
Reduction in error rate on the OfficeQA Pro benchmark compared with GPT-5.4
Listen to article
After the model established a new SOTA on OfficeQA Pro, Databricks is rolling out GPT‑5.5 to customer agent workflows. OfficeQA Pro is the company’s benchmark for complex enterprise document tasks.
OfficeQA Pro evaluates how well models handle parsing, retrieval, and grounded reasoning in workflows involving scanned PDFs, legacy documents, and long-context documents—tasks that often cause agentic systems in production to fail.
In the agent-harness scenario, GPT‑5.5 reduced error rate by 46% compared with GPT‑5.4, and became the first model to exceed 50% accuracy on OfficeQA Pro.
“Codex with 5.5 is now state of the art among all agents and models.”
– Arnav Singhvi, Research Engineer
00:00
SOTA performance on OfficeQA Pro
OfficeQA Pro includes a large number of scanned or legacy enterprise documents, where even small extraction errors during parsing can cascade through downstream workflows. “If you can’t extract a number or a value, that changes the entire trajectory of what the agent is going to do later,” Singhvi explained.
Databricks found that GPT‑5.5 delivered its biggest gains in these parsing-heavy workflows. “Earlier models like 5.4 couldn’t get all the numbers right, but 5.5 seems to have made a leap in parsing old documents and scanned PDFs,” Singhvi said.
The team also observed improvements in orchestrating multi-step tasks. “One thing we saw with 5.4 is that it would sometimes go off on unnecessary retrieval tangents, which could lead to very inefficient trajectories,” Singhvi said.
Compared with earlier models, GPT‑5.5 is more reliable at retrieving relevant context and completing complex workflows without additional supervision.
Bringing GPT‑5.5 into production workflows
Databricks is now offering GPT‑5.5 through AI Unity Gateway, where customers can use the model in workflows built with AgentBricks and the Agent Supervisor API. In these systems, GPT‑5.5 handles orchestration across specialized agents for parsing, retrieval, and execution.
“We’re seeing a lot of customers use AgentBricks and the Agent Supervisor API to build custom agent workflows,” Singhvi said. “It’s really exciting to have GPT‑5.5 supervise those workflows.”
“GPT-5.5 is extremely good at knowledge work augmentation. It’s a quantum shift in how we can do knowledge work for us.”
— Arnav Singhvi, Research Engineer