Baidu unveils ERNIE 5, which outperforms GPT-5 in graphics, document understanding and more.

Just hours after OpenAI updated its flagship GPT-5 base model to GPT-5.1, promising reduced overall token usage and a nicer personality with more predefined options, Chinese search giant Baidu unveiled its next-generation base model, ERNIE 5.0, along with a suite of AI product upgrades and strategic international expansions.
The goal: to position itself as a global competitor in the increasingly competitive enterprise AI market.
Announced at the company’s Baidu World 2025 event, ERNIE 5.0 is a proprietary, natively omnimodal model designed to jointly process and generate content across text, images, audio and video.
Unlike Baidu’s recently released ERNIE-4.5-VL-28B-A3B-Thinking, which is open source under a business-friendly and permissive Apache 2.0 license, ERNIE 5.0 is a proprietary model and is only available through Baidu’s ERNIE Bot website (I had to manually select it from the model selector drop-down list) and the Qianfan cloud platform’s application programming interface (API). for corporate clients.
Along with the model launch, Baidu introduced major updates to its digital human platform, no-code tools, and general-purpose AI agents, all aimed at expanding its AI footprint beyond China.
The company also introduced ERNIE 5.0 Preview 1022, a variant optimized for text-intensive tasks, alongside the General Preview model that balances all modalities.
Baidu highlighted that ERNIE 5.0 represents a shift in how intelligence is deployed at scale, with CEO Robin Li saying: “When you internalize AI, it becomes a native capability and transforms intelligence from a cost to a source of productivity.
Where ERNIE 5.0 outperforms GPT-5 and Gemini 2.5 Pro
ERNIE 5.0 benchmark results suggest that Baidu has achieved parity, or near parity, with top Western foundation models across a wide range of tasks.
In public benchmark slides shared at the Baidu World 2025 event, ERNIE 5.0 Preview outperformed or matched OpenAI’s GPT-5-High and Google’s Gemini 2.5 Pro in multimodal reasoning, document understanding and image-based quality assurancewhile demonstrating strong language modeling and code execution capabilities.
The company emphasized its ability to manage joint inputs and outputs across modalities, rather than relying on post-hoc modality fusion, which it touted as a technical differentiator.
On visual tasks, ERNIE 5.0 achieved the highest scores on OCRBench, DocVQA and ChartQA, three benchmarks that test recognition, understanding and reasoning of structured document data.
Baidu claims the model beat GPT-5-High and Gemini 2.5 Pro on these document- and graphics-based benchmarks, areas it describes as being critical to enterprise applications such as automated document processing and financial analysis.
In image generation, ERNIE 5.0 matched or exceeded Google’s Veo3 in all categories, including semantic alignment and image quality, according to internal evaluation based on Baidu’s GenEval. Baidu claimed that the model’s multimodal integration allows it to generate and interpret visual content with greater contextual awareness than models relying on modality-specific encoders.
For audio and speech tasks, ERNIE 5.0 demonstrated competitive results on the MM-AU and TUT2017 audio comprehension tests, as well as answering questions from spoken language input. Its audio performance, while less emphasized than vision or text, suggests a broad capability footprint intended to support full-spectrum, multimodal applications.
In language tasks, the model showed excellent results in following instructions, answering factual questions, and mathematical reasoning, essential areas that define the usefulness of large language models for business.
The Preview 1022 variant of ERNIE 5.0, designed for text performance, showed even stronger language-specific results during developer early access. Although Baidu does not claim broad superiority in general linguistic reasoning, its internal evaluations suggest that ERNIE 5.0 Preview 1022 closes the gap with leading English-speaking models and surpasses them in terms of Chinese language performance.
Although Baidu has not publicly released full benchmark details or raw scores, its performance positioning suggests a deliberate attempt to present ERNIE 5.0 not as a niche multimodal system, but as a flagship model competitive with the larger closed models in general reasoning.
Where Baidu claims to have a clear lead is in understanding structured documents, visual reasoning of graphics, and integrating multiple modalities into a single, native modeling architecture.. Independent verification of these results remains pending, but the breadth of claimed capabilities positions ERNIE 5.0 as a serious alternative in the multimodal foundation model landscape.
Business Pricing Strategy
ERNIE 5.0 is positioned at end of bonus of the pricing structure of Baidu’s model. The company released specific pricing for using the API on its Qianfan platform, bringing the cost in line with other leading offerings from Chinese competitors like Alibaba.
|
Model |
Entry cost (per 1,000 tokens) |
Exit Cost (per 1,000 tokens) |
Source |
|
ERNIE 5.0 |
$0.00085 (¥0.006) |
$0.0034 (¥0.024) |
Qianfan |
|
ERNIE 4.5 Turbo (ex.) |
$0.00011 (¥0.0008) |
$0.00045 (¥0.0032) |
Qianfan |
|
Qwen3 (ex. encoder) |
$0.00085 (¥0.006) |
$0.0034 (¥0.024) |
Qianfan |
The cost contrast between ERNIE 5.0 and earlier models such as ERNIE 4.5 Turbo highlights Baidu’s strategy of differentiating between high-volume, low-cost models and high-capacity models designed for complex tasks and multimodal reasoning.
Compared to other American alternatives, its price remains average:
|
Model |
Entry (/1M tokens) |
Output (/1M tokens) |
Source |
|
GPT-5.1 |
$1.25 |
$10.00 |
OpenAI |
|
ERNIE 5.0 |
$0.85 |
$3.40 |
Qianfan |
|
ERNIE 4.5 Turbo (ex.) |
$0.11 |
$0.45 |
Qianfan |
|
Close job 4.1 |
$15.00 |
$75.00 |
Anthropic |
|
Gemini 2.5 Pro |
$1.25 (≤200,000) / $2.50 (>200,000) |
$10.00 (≤200,000) / $15.00 (>200,000) |
Google Vertex AI pricing |
|
Grok 4 (grok-4-0709) |
$3.00 |
$15.00 |
xAI API |
Global expansion: products and platforms
Alongside the launch of the model, Baidu is expanding internationally:
-
GenFlow 3.0which now has over 20 million users, is the company’s largest general-purpose AI agent and offers enhanced memory and multi-modal task management.
-
Famousa self-evolving agent capable of dynamically solving complex problems, is now commercially available by invitation.
-
Fearthe international version of Baidu’s no-code builder Miaoda, is available worldwide via medo.dev.
-
Oréatea productivity workspace supporting documents, slides, images, videos and podcasts, has reached over 1.2 million users worldwide.
Baidu’s digital human platform, already deployed in Brazil, is also part of this global initiative. According to company data, 83% of livestreamers at this year’s “Double 11” trade event in China used Baidu’s digital human technology, contributing to a 91% increase in GMV.
Meanwhile, Baidu’s autonomous ride-hailing service Apollo Go has surpassed 17 million rides, operating driverless fleets in 22 cities and claiming the title of the world’s largest robotaxi network.
Open source vision language model attracts industry attention
Two days before the ERNIE 5.0 flagship event, Baidu also released an open source multimodal model under the Apache 2.0 license: ERNIE-4.5-VL-28B-A3B-Thinking.
As my colleague Michael Nuñez of VentureBeat reported, the model only activates 3 billion parameters while retaining a total of 28 billion, using a mixture of experts (MoE) architecture for efficient inference.
The main technical innovations include:
-
“Thinking with Pictures,” which enables dynamic, zoom-based visual analysis
-
Support for graphics interpretation, document comprehension, visual anchoring, and temporal awareness in video
-
Running on a single 80GB GPU, making it accessible to mid-sized organizations
-
Full compatibility with Baidu Transformers, vLLM and FastDeploy toolkits
This release adds pressure on closed source competitors. With the Apache 2.0 license, ERNIE-4.5-VL-28B-A3B-Thinking becomes a viable base model for commercial applications without licensing restrictions – something few high-performance models in this class offer.
Community Feedback and Baidu Response
Following the launch of ERNIE 5.0, developer and AI reviewer Lisan al Gaib (@scaling01) posted a mixed review of
“The ERNIE 5.0 benchmarks seemed crazy until I tested them…unfortunately it’s either brain damaged RL or they have a serious problem with their chat platform/system prompt,” Lisan wrote.
Within hours, Baidu’s developer-focused support account, @ErnieforDevs, responded:
“Thanks for your feedback! This is a known bug; certain syntaxes may trigger it consistently. We are working on a fix. You can try rewording or modifying the prompt to avoid it for now.”
This rapid turnaround reflects Baidu’s growing emphasis on communicating with developers, especially as it courts international users through proprietary and open source offerings.
Outlook for Baidu and its ERNIE Foundational LLM Family
Baidu’s ERNIE 5.0 marks a strategic escalation in the global foundation model race. With performance that puts it on par with the most advanced systems from OpenAI and Google, and a mix of premium pricing and open-access alternatives, Baidu signals its ambition to become not only a national leader in AI, but also a credible global infrastructure provider.
At a time when enterprise AI users increasingly demand multi-modal performance, flexible licensing, and deployment efficiency, Baidu’s dual-track approach (premium hosted APIs and open source versions) could broaden its appeal among the enterprise and developer communities.
It remains to be seen whether the performance announced by the company stands up to third-party testing. But in a landscape shaped by rising costs, model complexity and computational bottlenecks, ERNIE 5.0 and its supporting ecosystem give Baidu a competitive position in the next wave of AI deployment.



