The Math on AI Agents Doesn’t Add Up

0
business-news-2-768x548.jpg


The big AI companies promised us that 2025 would be “the year of the AI ​​agents”. It turned out that the year of talking about AI agents, and kicks the chin for that transformational moment to 2026 or maybe later. But what if the answer to the question “When will our lives be fully automated by generative AI robots that perform our tasks for us and basically run the world?” is, so New Yorker cartoon“How about never?”

That was basically the message of a paper published without much fanfare several months ago, in the middle of the overhyped year of “agentic AI.” with the title “Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models,” it claims to show mathematically that “LLMs are incapable of performing computational and agentic tasks beyond a certain complexity.” Although the science is beyond me, the authors – a former SAP CTO who studied AI under one of the founding intellects of the field, John McCarthy, and his teenage prodigy – have the vision of the agent paradise with the certainty of mathematics. Even reasoning models that go beyond the pure word prediction process of LLMs, they say, will not solve the problem.

“There is no way they can be reliable,” Vishal Sikka, the father, tells me. After a career that, in addition to SAP, included a stint as Infosys CEO and an Oracle board member, he is currently named head of an AI services startup Vianai. “So we should forget about AI agents running nuclear power plants?” I ask. “Exactly,” he says. Maybe you can get it to save some papers or something to save time, but you may have to develop some mistakes.

The AI ​​industry begs to differ. For one thing, there has been a great success in agent AI coding, which started last year. Just this week at Davos, Google's Nobel-winning head of AI, Demis Hassabis, reported breakthroughs in minimizing hallucinations, and hyperscalers and startups press the agent story. Now they have some backup. Called a startup Harmonious reports a breakthrough in AI coding that also relies on math — and tops benchmarks reliability.

Harmonic, which was co-founded by Robinhood CEO Vlad Tenev and Tudor Achim, a Stanford-educated mathematician, claims that this recent improvement of its product called Aristotle (no hubris there!) is an indication that there are ways to ensure the reliability of AI systems. “Are we doomed to be in a world where AI just generates slop and people can't really control it? That would be a crazy world,” says Achim. Harmonic's solution is to use formal methods of mathematical reasoning to verify the output of an LLM. Specifically, it encodes outputs in the Lean programming language, which is known for its ability to verify the encoding. To be sure, Harmonic's focus thus far has been narrow—its main mission is the pursuit of “mathematical superintelligence,” and coding is a somewhat organic extension. Things like history essays – which cannot be mathematically verified – are out of bounds. For now.

However, Achim does not seem to think that reliable agentic behavior is as much of a problem as some critics believe. “I would say that most models at this point have the level of pure intelligence needed to reason through a travel itinerary,” he says.

Both sides are right – or maybe even on the same side. On the one hand, everyone agrees that hallucinations will remain an annoying reality. In a paper published last September, OpenAI scientists wrote, “Despite significant progress, hallucinations continue to plague the field, and are still present in the latest models.” They proved that unfortunate claim by asking three models, including ChatGPT, to provide the title of the lead author's thesis. All three created false titles and all reported the year of publication incorrectly. In a blog about the paper, OpenAI stated gloomily that in AI models “accuracy will never reach 100 percent.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *