Databricks and GPT-5.5 for Enterprise Agent Workflows

GPT‑5.5 has set a new SOTA (state of the art) in OfficeQA Pro, the benchmark Databricks uses for complex enterprise agent tasks.

Contact Sales

Image 1: Main visual for Databricks customer story.

Company size: Enterprise

Region: North America

Industry: Technology

Product: Codex

50%

Accuracy on the OfficeQA Pro benchmark (SOTA)

46%

Reduction in error rate on the OfficeQA Pro benchmark compared with GPT-5.4

Listen to the article

Audio 1

Following the model’s new SOTA performance on OfficeQA Pro, Databricks is rolling out GPT‑5.5 in customer agent workflows. OfficeQA Pro is the benchmark the company uses for complex enterprise document tasks.

OfficeQA Pro evaluates how models parse, retrieve, and reason with evidence in workflows involving scanned PDFs, legacy documents, and long-context documents. These are the kinds of tasks that often break real-world agent systems.

In agent-harness scenarios, GPT‑5.5 reduced error rate by 46% compared with GPT‑5.4, becoming the first model to exceed 50% accuracy on OfficeQA Pro.

“Codex powered by 5.5 is state of the art among all agents and models.”

– Arnav Singhvi, Research Engineer

Video 1

00:00

SOTA performance on OfficeQA Pro

OfficeQA Pro contains a large volume of scanned and legacy enterprise documents, and even a small extraction error during parsing can cascade into downstream workflow failures. “Just failing to extract a number or value can change the entire trajectory of the agent afterward,” Singhvi explains.

Databricks found that GPT‑5.5 showed its biggest gains in these parsing-heavy workflows. “With earlier models like 5.4, it wasn’t able to parse every number correctly, but 5.5 seems to show a dramatic improvement in parsing old documents and scanned PDFs,” Singhvi says.

The team also saw improvements in multi-step task orchestration. “One thing we saw with 5.4 was that it would sometimes take unnecessary detours in retrieval, which led to very inefficient trajectories,” Singhvi says.

Compared with earlier models, GPT‑5.5 is more reliable at retrieving relevant context and completing complex workflows without additional supervision.

Bringing GPT‑5.5 into production workflows

Databricks now offers GPT‑5.5 through AI Unity Gateway, allowing customers to use the model in workflows built with AgentBricks and Agent Supervisor API. In these systems, GPT‑5.5 orchestrates parsing, retrieval, and execution across specialized agents.

“Many customers are using AgentBricks and Agent Supervisor API to build custom agent workflows,” Singhvi says. “Being able to supervise those workflows with GPT‑5.5 is incredibly compelling.”

“GPT-5.5 is exceptionally good at augmenting knowledge work. It’s a paradigm shift for how we operate our intellectual work.”

— Arnav Singhvi, Research Engineer