Databricks Brings GPT-5.5 to Enterprise Agent Workflows

GPT‑5.5 set a new SOTA (state of the art) on OfficeQA Pro, Databricks’ benchmark for complex enterprise agent tasks.

Contact Sales

Image 1: Hero image for the Databricks customer story.

Company size: Enterprise

Region: North America

Industry: Technology

Product: Codex

50%

Accuracy on the OfficeQA Pro benchmark (SOTA)

46%

Reduction in error rate on the OfficeQA Pro benchmark compared with GPT-5.4

Listen to article

Audio 1

After the model established a new SOTA on OfficeQA Pro, Databricks is rolling out GPT‑5.5 to customer agent workflows. OfficeQA Pro is the company’s benchmark for complex enterprise document tasks.

OfficeQA Pro evaluates how well models handle parsing, retrieval, and grounded reasoning in workflows involving scanned PDFs, legacy documents, and long-context documents—tasks that often cause agentic systems in production to fail.

In the agent-harness scenario, GPT‑5.5 reduced error rate by 46% compared with GPT‑5.4, and became the first model to exceed 50% accuracy on OfficeQA Pro.

“Codex with 5.5 is now state of the art among all agents and models.”

– Arnav Singhvi, Research Engineer

Video 1

00:00

SOTA performance on OfficeQA Pro

OfficeQA Pro includes a large number of scanned or legacy enterprise documents, where even small extraction errors during parsing can cascade through downstream workflows. “If you can’t extract a number or a value, that changes the entire trajectory of what the agent is going to do later,” Singhvi explained.

Databricks found that GPT‑5.5 delivered its biggest gains in these parsing-heavy workflows. “Earlier models like 5.4 couldn’t get all the numbers right, but 5.5 seems to have made a leap in parsing old documents and scanned PDFs,” Singhvi said.

The team also observed improvements in orchestrating multi-step tasks. “One thing we saw with 5.4 is that it would sometimes go off on unnecessary retrieval tangents, which could lead to very inefficient trajectories,” Singhvi said.

Compared with earlier models, GPT‑5.5 is more reliable at retrieving relevant context and completing complex workflows without additional supervision.

Bringing GPT‑5.5 into production workflows

Databricks is now offering GPT‑5.5 through AI Unity Gateway, where customers can use the model in workflows built with AgentBricks and the Agent Supervisor API. In these systems, GPT‑5.5 handles orchestration across specialized agents for parsing, retrieval, and execution.

“We’re seeing a lot of customers use AgentBricks and the Agent Supervisor API to build custom agent workflows,” Singhvi said. “It’s really exciting to have GPT‑5.5 supervise those workflows.”

“GPT-5.5 is extremely good at knowledge work augmentation. It’s a quantum shift in how we can do knowledge work for us.”

— Arnav Singhvi, Research Engineer