Skip to content

Zhixiang Future Announces “HiDream-O1-Image-Pro,” a 200+B Parameter Large-Scale Image Model, with Funding Momentum Continuing to Accelerate

· 量子位
国内AI

4All API and More: Zhixiang Future Announces HiDream-O1-Image-Pro, an Image Foundation Model with Over 200 Billion Parameters, as Fundraising Accelerates

Moving forward toward world models.

May 19, Beijing — Zhixiang Future held its first Open Day under the theme “Imaging the World.” At the event, the company officially launched HiDream-O1-Image-Pro, an image foundation model built on its new-generation native omni-modal model architecture, Unified Transformer (UiT). This native omni-modal image foundation model, with over 200 billion parameters, not only set new SOTA records across multiple benchmarks, but also signaled that Zhixiang Future is moving into the “native omni-modal” stage of integrated modeling across image, video, text, and audio.

At the same time, Zhixiang Future announced that it had recently completed another funding round worth several hundred million yuan, with participation from multiple institutions including Shenzhen Capital Group, Jinpu Investment, Caixin Capital, and Fujuju Capital. This means Zhixiang Future completed financing again within half a month, underscoring the capital market’s continued confidence in the native omni-modal foundation model direction. As advanced technologies such as visual generation and embodied intelligence accelerate their convergence, world models have become a key direction for AI evolution, and Zhixiang Future’s continued breakthroughs in foundation model architecture, productization, and industrial ecosystem building have once again been recognized by the market.

Launching HiDream-O1-Image-Pro, an Image Foundation Model with Over 200 Billion Parameters, and Overhauling the Native Omni-Modal Architecture

Today, image generation models are shifting from the traditional U-Net architecture to the era of diffusion Transformer (DiT). The mainstream approach represented by latent diffusion models (LDMs) has made major progress in both efficiency and generation quality by compressing images with VAE and encoding text with a separate language model. However, the practice of separately encoding image and text has imposed structural limitations on models in areas such as complex semantic understanding, high-fidelity detail reproduction, accurate text rendering, and multi-task generalization.

To address these challenges, Zhixiang Future officially launched HiDream-O1-Image-Pro, a closed image foundation model with over 200 billion parameters, built on a native omni-modal architecture. Unlike the conventional fragmented encoding paradigm that stitches together multiple modules, HiDream-O1-Image-Pro unifies raw image pixels, discrete text tokens, and task conditions into a continuous shared token space, deeply integrating image, text, and multi-task conditions at the representational level. This architectural breakthrough further unlocks the model’s generation and generalization capabilities, pushing it to new SOTA levels in tasks such as general text-to-image generation, high-fidelity text rendering, diverse scene generation, and image editing. It also demonstrates Zhixiang Future’s leading exploration of native omni-modal foundation model architecture.

Meitao, founder and CEO of Zhixiang Future, said that the decision to pursue the native omni-modal path was based on a long-standing judgment about how to connect visual generation with the physical world. “Many of today’s ‘multimodal foundation models’ are, in essence, just ‘collections of unimodal models.’ Native multimodality, by contrast, is about baking the ‘rules of the world’ into the model from the very beginning. Only by understanding physical laws, spatial relationships, and causal logic can a model truly understand, reason about, and reconstruct the world. It is not just about ‘generating content.’ That is why we believe native omni-modal is the only path to AGI.”

Yao Ting, cofounder and CTO of Zhixiang Future, noted that the recently released HiDream-O1-Image, which adopts a native omni-modal architecture, became the world’s No. 1 open-source model on the text-to-image leaderboard of Artificial Analysis, a globally recognized independent evaluation platform, in its 8B open-source version, outperforming mainstream open-source models such as Z-Image Turbo, Qwen-Image, and FLUX.2 [dev]. It was also the model with the smallest number of publicly available parameters among the top 20 on that ranking. The newly announced HiDream-O1-Image-Pro is the closed-source version, with over 200 billion parameters. It has established new SOTA results across tasks such as complex text rendering, instruction-based editing, and multi-subject personalization, fully demonstrating the strong scalability of the native omni-modal architecture.

Yao Ting said: “In the native omni-modal (UiT) architecture, all modalities are like childhood friends who grew up together from the very beginning. The advantage is that, once all modalities are connected, we can truly achieve Any to Any — that is, any input can produce any output. This is exactly the capability required by a world model: understanding, generating, and predicting the various states of the real world within a unified architecture.”

From Visual Generation to World Models: An Important Path Toward AGI Widely Debated Across the Industry

Today, the focus of large model competition is shifting from language understanding and content generation to understanding, generating, and predicting the real physical world. Multiple technical approaches to world models have emerged across the industry, but they share a common goal: to move AI beyond being merely a content generator and give it internal representational capabilities for world states and the laws governing their changes.

At the roundtable forum during Open Day, Wang Bing, Partner at DCM? Actually 東方富海, Fu Jianlong, Chief Researcher at Microsoft Research Asia, Ning Jiangbin, Senior Solution Director at Alibaba Cloud, Pan Yingwei, Technology Partner at Zhixiang Future, and Hong Hu, founder of AI 闹, discussed the theme “From multimodal to omni-modal: building world models and moving toward AGI.” From their respective perspectives, the speakers shared views on the development path of world models, covering AI investment, embodied intelligence, AI infrastructure, and the practice of native omni-modal technology.

The participants agreed that AI is moving from the stage of “generating content” to the stage of “understanding the world.” The convergence of visual generation, agents, embodied intelligence, and multimodal models all points to the same key capability: whether a model can understand environmental states across different modalities, predict how those states change, and form a unified cross-modal representation.

For that reason, visual generation is not merely a content creation tool. It inherently requires learning spatial structures, object relationships, motion trajectories, and state transitions, giving it a natural foundation for evolving into world models. The value of native omni-modal architecture lies in its ability to provide a unified modeling framework for images, video, text, audio, and even actions and embodied data, enabling models to evolve from single-modality capabilities toward more complete world modeling capabilities.

Completing Multiple Funding Rounds Within Half a Month, and the Three Core Agent Products Continue to Expand the Commercial Ecosystem

Recently, Zhixiang Future announced that it had completed a funding round of over 500 million yuan, with participation from leading investment institutions such as Anhui State Capital Investment, Hefei State Capital Investment, and DCM? Actually 東方富海. At Open Day, the company further revealed that fundraising was accelerating, and that it had completed another round within half a month, with participation from Shenzhen Capital Group, Jinpu Investment, Caixin Capital, and Fujuju Capital.

According to public information, Jinpu Investment is the management company of the Shanghai Financial Development Investment Fund, and 13 companies from its first fund portfolio have gone public through IPOs or M&A. It has made deep investments in multiple frontier AI areas, including computing infrastructure, foundation models, and intelligent agent applications. Caixin Capital is the core industrial investment platform under Caixin Group, a state-owned enterprise in Changde, and is committed to supporting the real economy and driving technological innovation with capital. It focuses on hard-tech sectors with clear industrialization prospects, such as artificial intelligence and embodied intelligence. Fujuju Capital concentrates on discovering value in leading companies in cutting-edge niche sectors, with a broad presence in strategic emerging industries such as smart manufacturing, new energy, new materials, biopharmaceuticals, and artificial intelligence. With the addition of new investors such as Shenzhen Capital Group, Jinpu Investment, Caixin Capital, and Fujuju Capital, Zhixiang Future has formed a diversified investor base that includes industrial funds from Anhui, Shanghai, Hunan, Hangzhou, and other regions, as well as top-tier market-oriented VCs such as Shenzhen Capital Group, DCM? Actually 東方富海, Fenghua Capital, and Dunhong Capital.

As fundraising accelerates, Zhixiang Future has launched a dual-engine strategy of “models + agents,” using model development as the foundation and intelligent agent applications as the wheel driving implementation and monetization. It has also established a clearer “1+1+3” business structure: the bottom layer is one HiDream series foundation model; the middle layer is one capability base, the HiHarness enterprise service platform; and the upper-layer agent applications cover three core scenarios: commercial marketing, video production, and social media creation.

On the day of Open Day, three product leads from Zhixiang Future introduced the progress of the company’s agent application products, comprehensively showcasing its “ready-to-deploy” capabilities in commercialization. The commercial marketing agent, HiBurst, covers scenarios such as cross-border e-commerce content marketing, media operations, and app overseas expansion, and supports major platforms including TikTok, Meta, Douyin, and Xiaohongshu. It has been selected as a Top 5 official service provider by TikTok, generates more than 1 million e-commerce marketing videos annually, and already covers GMV exceeding 1 billion yuan. Frizan, the world’s first professional-grade AI video production and co-creation agent, provides professional video teams with a co-creation tool that combines high quality and high efficiency through cinematic image generation and a core workflow that connects ideation, storyboarding, and finished video. To date, it has produced a cumulative total of more than 5,000 minutes of short-form drama comics, and the number of professional teams and ecosystem partners on the platform has exceeded 1,000. The social media creation agent vivago recently completed a product upgrade and can stably generate story videos of several minutes thanks to its end-to-end long-thinking capability, quickly reaching No. 1 on Product Hunt’s daily ranking. Today, vivago is used by more than 40 million professional and individual users across over 100 countries and regions.

At the event, Zhixiang Future also announced strategic partnerships with Shanghai Film Group and Shanghai Film New Vision Fund, BlueFocus and Copico, the largest marketing communications group in China, Beijing Jetsen Century, a leading AI video company, and Bei’er Health, an advanced enterprise in cross-border medical services. The parties will deepen cooperation in areas such as foundation model capability integration, agent application development, and co-building industry scenarios, jointly promoting the industrialization of native omni-modal foundation models in video production, commercial marketing, cross-border e-commerce, IP operations, and healthcare.

From Visual Generation to Building the World

From the launch of HiDream-O1-Image-Pro, to the deployment of three agent products, to ecosystem collaboration with industrial partners, Zhixiang Future is gradually forming a clear roadmap: continuously strengthening visual generation capabilities on the basis of a native omni-modal architecture, and further evolving toward the unified understanding, generation, and prediction capabilities required by world models.

This is what Zhixiang Future means by “Imaging the World.” It is not just about “generating visual content,” but about using native omni-modal modeling to gradually give AI the ability to understand the world, generate the world, and build the world. Going forward, Zhixiang Future will continue to center on the UiT native omni-modal architecture, driving the co-evolution of models, agents, and industry scenarios, and moving toward a more complete world model.