Skip to content

Zhixiang Future Releases HiDream-O1-Image-Pro, a 200+ Billion Parameter Image Model, as Funding Accelerates

· 量子位
国内AI

HiDream-O1-Image-Pro, a 200B+ Parameter Image Foundation Model from Zhixiang Future, Is Released as Financing Accelerates

Moving toward a world model.

On May 19 in Beijing, Zhixiang Future held its first Open Day under the theme “Imaging the World.” At the event, the company officially unveiled HiDream-O1-Image-Pro, its image foundation model built on the next-generation native omni-modal model architecture Unified Transformer (UiT). This native omni-modal image foundation model, with more than 200 billion parameters, not only set new SOTA records across multiple benchmarks, but also marks Zhixiang Future’s advance into the “native omni-modal” stage of unified modeling for image, video, text, audio, and other modalities.

At the same time, Zhixiang Future announced the completion of a new round of financing worth hundreds of millions of yuan, with participation from Sequoia China, Jinpu Investment, Caixin Capital, Fujubi Capital, and others. This is Zhixiang Future’s second financing round completed within half a month, underscoring the capital market’s continued optimism about native omni-modal foundation models. As visual generation, embodied intelligence, and other frontier technologies accelerate their convergence, world models have become an important direction for AI evolution. Zhixiang Future’s continued breakthroughs in underlying model architecture, productization capabilities, and industrial ecosystem development have also won further market recognition.

200B+ Parameter Image Foundation Model HiDream-O1-Image-Pro Released, Native Omni-Modal Architecture Upgraded Across the Board

Today, image generation models are moving from the traditional U-Net architecture into the era of Diffusion Transformers (DiT). Mainstream approaches represented by latent diffusion models (LDMs) compress images with VAE and encode text with an independent language model, making notable gains in efficiency and generation quality. However, the separate encoding of images and text also creates inherent bottlenecks in complex semantic understanding, high-fidelity detail restoration, precise text rendering, and multi-task generalization.

To address this challenge, Zhixiang Future has officially released HiDream-O1-Image-Pro, a closed-source image foundation model with more than 200 billion parameters based on a native omni-modal architecture. Unlike traditional fragmented, multi-module stitching-based encoding paradigms, HiDream-O1-Image-Pro brings raw image pixels, discrete text tokens, and task conditions into a unified continuous shared token space, enabling deep fusion of images, text, and multi-task conditions at the representational level. This architectural breakthrough further unleashes the model’s generation and generalization capabilities, allowing it to achieve new SOTA performance in general text-to-image generation, high-fidelity text rendering, diverse scene generation, and image editing, and demonstrating Zhixiang Future’s leading exploration of native omni-modal foundation model architecture.

Mei Tao, Founder and CEO of Zhixiang Future, said the company chose the native omni-modal path because of the team’s long-term judgment on the integration of visual generation and the physical world: “Many so-called ‘multimodal foundation models’ today are still, in essence, just ‘single-modal stitching.’ Native multimodality, by contrast, means building the ‘rules of the world’ into the model from the very beginning—it knows physical laws, spatial relationships, and causal logic, so it can truly understand the world, reason about the world, and reconstruct the world, rather than merely ‘generate content.’ That’s why we believe native omni-modality is the necessary path to AGI.”

Yao Ting, Co-founder and CTO of Zhixiang Future, introduced that not long ago, HiDream-O1-Image, built on the native omni-modal architecture, topped the global open-source rankings on the well-known independent evaluation platform Artificial Analysis with its 8B open-source version. It outperformed mainstream open-source models such as Z-Image Turbo, Qwen-Image, and FLUX.2 [dev], and became the model version with the smallest publicly available parameter count among the top 20 entries on the leaderboard. The newly released HiDream-O1-Image-Pro is the closed-source version, with more than 200 billion parameters. It has established new SOTA results across complex text rendering, instruction-based editing, and multi-subject personalization tasks, fully validating the architecture’s tremendous scalability.

Yao Ting said: “Under the native omni-modal (UiT) architecture, all modalities grow up together from the very beginning. The benefit is that once all modalities are connected, the model can truly achieve Any to Any, supporting any input and any output. That is exactly the capability a world model needs—to understand, generate, and predict different states of the real world within a unified architecture.”

From Visual Generation to World Models: The Industry Discusses the Key Path to AGI

Today, the competition among foundation models is shifting from language understanding and content generation to understanding, generating, and predicting the real physical world. Around world models, many technical approaches have emerged across the industry, but the shared goal is the same: to make AI do more than generate content, and instead build internal representations of the state of the world and the laws governing its changes.

At the Open Day roundtable, Wang Bing, Partner at Orient Fusion Capital; Fu Jianlong, Chief Researcher at Microsoft Research Asia; Ning Jiangbin, Senior Solutions Director at Alibaba Cloud; Pan Yingwei, Technology Partner at Zhixiang Future; and Hong Hu, Founder of AI Nao, held a discussion on “From Multimodality to Omni-Modality: Building World Models and Moving Toward AGI.” From the perspectives of AI investment, embodied intelligence, AI infrastructure, and native omni-modal technology practice, the guests shared their views on the development path of world models.

The participants agreed that AI is moving from “generating content” toward “understanding the world.” The convergence of visual generation, agents, embodied intelligence, and multimodal models points to the same key capability: whether the model can understand environmental states across different modalities, predict how those states change, and form a unified cross-modal representation.

Therefore, visual generation is not merely a content production tool. It naturally needs to learn spatial structures, object relationships, motion trajectories, and state changes, and it also has the foundation to evolve into a world model. The value of a native omni-modal architecture lies precisely in providing a unified modeling framework for images, videos, text, audio, and even actions and embodied data, enabling the model to move from single-modal capability to a more complete world-modeling ability.

Multiple Financing Rounds Completed Within Half a Month, Three Major Agent Products Continue Expanding the Commercial Ecosystem

Not long ago, Zhixiang Future announced the completion of more than 500 million yuan in financing, with investors including Anhui Provincial Investment and Operation, Hefei Investment and Operation, Orient Fusion Capital, and other top-tier investment institutions. At the Open Day, Zhixiang Future revealed that financing continues to accelerate, with a new round completed within half a month, involving Sequoia China, Jinpu Investment, Caixin Capital, and Fujubi Capital.

Public information shows that Jinpu Investment is the manager of the Shanghai Financial Development Investment Fund. In the first phase of the fund, 13 portfolio companies have gone public through IPOs or M&A. The firm has made deep investments across multiple frontier AI areas, including compute infrastructure, foundation models, and agent applications. Caixin Capital is the core industrial investment platform under Caixin Group, a state-owned enterprise under Changde City. It is committed to serving the real economy and driving technological innovation through capital, with a focus on investing in hard-tech areas with clear industrialization prospects, such as artificial intelligence and embodied intelligence. Fujubi Investment focuses on value discovery among leading companies in frontier niche sectors and has broad investments in strategic emerging industries such as intelligent manufacturing, new energy, new materials, biopharmaceuticals, and artificial intelligence. With the entry of new investors such as Sequoia China, Jinpu Investment, Caixin Capital, and Fujubi Capital, Zhixiang Future has formed a diversified capital base supported by continued backing from industrial funds in Anhui, Shanghai, Hunan, and Hangzhou, as well as participation from leading market-oriented VCs including Sequoia China, Orient Fusion Capital, Fenghua Capital, and Dunhong Capital.

As financing pace accelerates, Zhixiang Future has established a “model + agent” dual-engine strategy, using models as the foundation and agent applications as the vehicle to drive commercialization, and has formed a clear “1+1+3” business structure: the base layer is 1 HiDream series foundation model, the middle layer is 1 capability platform (the HiHarness enterprise service platform), and the upper-layer agent applications cover 3 core scenarios: commercial marketing, film and television creation, and social media creation.

At the Open Day, the three product heads of Zhixiang Future introduced progress on the agent application products, fully demonstrating the company’s “battle-ready” commercialization capabilities. The commercial marketing agent HiBurst now covers scenarios such as cross-border e-commerce content marketing, media operations, and app globalization, supporting mainstream platforms including TikTok, Meta, Douyin, and Xiaohongshu. It has become an official TikTok Top 5 service provider, produces more than one million e-commerce marketing videos annually, and covers GMV of more than 100 million yuan. “Frame Praise,” the world’s first professional-grade AI film and television creation and collaboration agent, provides professional film and television production teams with a collaborative creation tool that combines high-quality output with high efficiency, thanks to its film-level image generation quality and end-to-end workflow from “idea to storyboard to final cut.” To date, the platform has produced more than 5,000 minutes of short-form comic dramas, and more than 1,000 professional teams and ecosystem partners have joined the platform. The social media creation agent vivago recently completed a product upgrade and, with its stable end-to-end long-thinking capability for minute-level story video generation, quickly rose to No. 1 on the Product Hunt daily chart. Currently, vivago serves more than 40 million professional and individual users across more than 100 countries and regions.

At the event, Zhixiang Future announced strategic partnerships with Shanghai Film Group’s Shanghai Film New Vision Fund, the largest marketing communications group in China BlueFocus, AI film and television leader Beijing Jetsen Century, and the leading cross-border healthcare services company Bei’er Health. The parties will work together on areas including foundation model capability integration, agent application development, and joint development of industry scenarios, jointly driving the industrial deployment of native omni-modal foundation models across film and television creation, commercial marketing, cross-border e-commerce, IP operations, healthcare, and other sectors.

From Visual Generation to Building the World

From the release of HiDream-O1-Image-Pro, to the deployment of three major agent products, to ecosystem collaboration with industry partners, Zhixiang Future is forming a clear path: based on a native omni-modal architecture, it will continue to enhance visual generation capabilities and further evolve toward the unified understanding, generation, and prediction capabilities required by world models.

This is also what Zhixiang Future means by “Imaging the World”: not stopping at “generating visual content,” but enabling AI to gradually gain the ability to understand the world, generate the world, and build the world through native omni-modal modeling. Going forward, Zhixiang Future will continue to focus on the UiT native omni-modal architecture, driving the co-evolution of models, agents, and industry scenarios, and moving toward a more complete world model.