Criticism of AI Capabilities: Risk of Latency, Rising Costs, and Coming Stop Point in Price Rises

The last big topic in AI is not model size or multimodality, but the capability crisis. At VentureBeat’s latest AI Impact stop in New York, Val Bercovici, Director of AI at WEKA, joined VentureBeat CEO Matt Marshall to discuss what it really takes to scale AI amid rising latency, cloud lock-in, and spiraling costs.
According to Bercovici, these forces are pushing AI toward its own version of price gouging. Uber introduced price gouging, bringing real-time market rates to ridesharing for the first time. Today, Bercovici says, AI is heading toward the same economic bottom line – particularly in inference – when the focus is on profitability.
"We don’t have real market rates today. We have subsidized rates. This has been necessary to enable much of the innovation that is happening, but sooner or later – given the billions of dollars of investment we’re talking about right now and limited energy operating expenses – real market rates are going to show up; maybe next year, definitely by 2027," he said. "When they do, it will fundamentally change this industry and drive an even deeper focus on efficiency."
The economy of the symbolic explosion
"The first rule is that this is an industry where more is more. More tokens equals exponentially higher trading value," » said Bercovici.
But until now, no one has figured out how to make this sustainable. The classic business triad – cost, quality and speed – translates in AI to latency, cost and accuracy (especially in output tokens). And accuracy is non-negotiable. This is true not only for consumer interactions with agents like ChatGPT, but also for high-stakes use cases like drug discovery and commercial workflows in heavily regulated industries like financial services and healthcare.
"It’s not negotiable," » said Bercovici. "You need to have a large number of tokens for high inference accuracy, especially when you add security to the mix, guardrail models, and quality models. Then you trade off latency and cost. This is where you have some flexibility. If you can tolerate high latency, and sometimes you can for consumer use cases, then you can have a lower cost, with free tiers and low cost premium tiers."
However, latency is a critical bottleneck for AI agents. “These agents no longer act in a single direction. You either have a swarm of agents or no agent activity at all,” Bercovici noted.
In a swarm, groups of agents work in parallel to achieve a larger goal. An orchestrator agent (the most intelligent model) takes center stage and determines key subtasks and requirements: architecture choice, cloud or on-premises execution, performance constraints, and security considerations. The swarm then executes all subtasks, effectively rotating many concurrent inference users in parallel sessions. Finally, evaluator models judge whether the overall task was successfully completed.
“These swarms go through what’s called multiple rounds, hundreds or even thousands of prompts and responses until the swarm comes together to come up with an answer,” Bercovici said.
“And if you have a compound delay in those thousands of rounds, it becomes untenable. So the latency is really, really important. And that usually means having to pay a high price today that is subsidized, and that’s what will have to come down over time.”
Reinforcement learning as a new paradigm
Until about May of this year, the officers were not performing well, Bercovici said. Then, popups became large enough and GPUs available enough to support agents that could perform advanced tasks, like writing reliable software. It is now estimated that in some cases, 90% of software is generated by coding agents. Now that agents have come of age, Bercovici noted, reinforcement learning is the new conversation among data scientists at some of the leading labs, like OpenAI, Anthropic, and Gemini, who see it as a critical avenue in AI innovation.
"The current season of AI is that of reinforcement learning. It blends many elements of training and inference into a unified workflow,” Bercovici said. “It’s the latest and greatest scaling law for this mythical milestone we’re all trying to reach called AGI – artificial general intelligence,” he added. "What fascinates me is that you have to apply all the best practices in model training, as well as all the best practices in model inference, to be able to iterate through these thousands of reinforcement learning loops and move the whole field forward."
The path to AI profitability
There is no single answer when it comes to building infrastructure to make AI profitable, Bercovici said, because it is still an emerging field. There is no one-size-fits-all approach. All-on-premises may be the right choice for some, especially pioneering model builders, while being cloud native or operating in a hybrid environment may be a better path for organizations looking to innovate in an agile and responsive way. Regardless of which path they initially choose, organizations will need to adapt their AI infrastructure strategy as their business needs evolve.
"Unit economics is what fundamentally matters here," Bercovici said. "We are definitely in a boom, or even a bubble, one might say, in some cases, as the underlying economics of AI are subsidized. But that doesn’t mean that if tokens become more expensive, you will stop using them. You’ll just get very specific information about how you use them."
Executives should focus less on pricing individual tokens and more on transaction-level economics, where efficiency and impact become visible, Bercovici concludes.
The crucial question businesses and AI companies should ask themselves, Bercovici said, is: “What is the real cost to the economy of my unit?” »
Seen in this light, the way forward is not to do less with AI, but to do it smarter and more efficiently at scale.


