AI scientist: ‘We need to think outside the large language model box’

Building blocks falling — PM Images/Getty Images

Generative artificial intelligence (Gen AI) developers continuously push the boundaries of what’s possible, such as Google’s Gemini 1.5, which can juggle a million tokens of information at a time.

However, even this level of development is not enough to make real progress in AI, say competitors who go toe-to-toe with Google.

Also: 3 ways Meta’s Llama 3.1 is an advance for Gen AI

“We need to think outside the LLM box,” said Yoav Shoham, co-founder and co-CEO of AI21 Labs, in an interview with ZDNET.

AI21 Labs, a privately backed startup, competes with Google in LLMs, the large language models that are the bedrock of Gen AI. Shoham, who was once principal scientist at Google, is also an emeritus professor at Stanford University.

Also: AI21 and Databricks show open source can radically slim down AI

“They’re amazing at the output they put out, but they don’t really understand what they’re doing,” he said of LLMs. “I think that even the most diehard neural net guys don’t think that you can only build a larger language model, and they’ll solve everything.”

shoham-et-al-2024-common-failures-of-llms — AI21 Labs researchers highlight basic errors of OpenAI’s GPT-3 as an example of how models stumble on basic questions. The answer, the company argues, is augmenting LLMs with something else, such as modules that can operate consistently.

AI21 Labs

Shoham’s company has pioneered novel approaches to Gen AI that go beyond the traditional “transformer”, the core element of most LLMs. For example, the company in April debuted a model called Jamba, an intriguing combination of transformers with a second neural network called a state space model (SSM).

The mixture has allowed Jamba to top other AI models in important metrics.

Shoham asked ZDNET for an extensive explanation of one important metric: context length.

The context length is the amount of input — in tokens, usually words — that a program can handle. Meta’s Llama 3.1 offers 128,000 worth of tokens in the context window. AI21 Labs’s Jamba, which is also open-source software, has double that figure — a 256,000-token context window.

prof-yoav-shoham-credit-roei-shor-photography — Shoham. “Even the most diehard neural net guys don’t think that you can only build a larger language model, and they’ll solve everything.”

Roei Shor Photography

In head-to-head tests, using a benchmark test constructed by Nvidia, Shoham said the Jamba model was the only model other than Gemini that could maintain that 256K context window “in practice”. Context length can be advertised as one thing, but can fall apart as a model scores lower as context length increases.

Also: 3 ways Meta’s Llama 3.1 is an advance for Gen AI

“We are the only ones with truth in advertising,” as far as context length, said Shoham. “All the other models degrade with increased context length.”

Google’s Gemini can’t be tested beyond 128K, said Shoham, given the limits imposed on the Gemini application programming interface by Google. “They actually have a good effective context window, at least, at 128K,” he said.

Jamba is more economical than Gemini for the same 128K window, said Shoham. “They’re about 10 times more expensive than we are,” in terms of the cost to serve up predictions from Gemini versus Jamba, the practice of inference, he said.

All of that, Shoham emphasized, is a product of the “architectural” choice of doing something different, joining a transformer to an SSM. “You can show exactly how many [API] calls are made” to the model, he told ZDNET. “It’s not just the cost, and the latency, it’s inherent in the architecture.”

Shoham has described the findings in a blog post.

However, none of that progress matters unless Jamba can do something superior. The benefits of having a large context window become apparent, said Shoham, as the world moves to things such as retrieval-augmented generation (RAG), an increasingly popular approach of hooking up an LLM to an external information source, such as a database.

Also: Make room for RAG: How Gen AI’s balance of power is shifting

A large context window lets the LLM retrieve and sort through more information from the RAG source to find the answer.

“At the end of the day, retrieve as much as you can [from the database], but not too much,” is the right approach to RAG, said Shoham. “Now, you can retrieve more than you could before, if you’ve got a long context window, and now the language model has more information to work with.”

Asked if there is a practical example of this effort, Shoham told ZDNET: “It’s too early to show a running system. I can tell you that we have several customers, who have been frustrated with the RAG solutions, who are working with us now. And I am quite sure we’ll be able to publicly show results, but it hasn’t been out long enough.”

Jamba, which has seen 180,000 downloads since it was put on HuggingFace, is available on Amazon’s AWS’s Bedrock inference service and Microsoft Azure, and “people are doing interesting stuff with it,” said Shoham.

However, even an improved RAG is not ultimately the salvation for the various shortcomings of Gen AI, from hallucinations to the risk of generations of the technology descending into gibberish.

“I think we’re going to see people demanding more, demanding systems not be ridiculous, and have something that looks like real understanding, having close to perfect answers,” said Shoham, “and that won’t be pure LLMs.”

Also: Beware of AI ‘model collapse’: How training on synthetic data pollutes the next generation

In a paper posted last month on the arXiv pre-print server, with collaborator Kevin Leyton-Brown, entitled ‘Understanding Understanding: A Pragmatic Framework Motivated by Large Language Models’, Shoham demonstrated how, across numerous operations, such as mathematics and manipulation of table data, LLMs produced “convincing-sounding explanations that aren’t worth the metaphorical paper they’re written on.”

“We showed how naively hooking [an LLM] up to a table, that table function will give success 70% or 80% of the time,” Shoham told ZDNET. “That is often very pleasing because you get something for nothing, but if it’s mission-critical work, you can’t do that.”

Such failings, said Shoham, mean that “the whole approach to creating intelligence will say that LLMs have a role to play, but they’re part of a bigger AI system that brings to the table things you can’t do with LLMs.”

Among the things required to go beyond LLMs are the various tools that have emerged in the past couple of years, Shoham said. Elements such as function calls let an LLM hand off a task to another kind of software specifically built for a particular task.

“If you want to do addition, language models do addition, but they do it terribly,” said Shoham. “Hewlett-Packard gave us a calculator in 1970, why reinvent that wheel? That’s an example of a tool.”

Using LLMs with tools is broadly grouped by Shoham and others under the rubric “compound AI systems”. With the help of data management company Databricks, Shoham recently organized a workshop on prospects for building such systems.

An example of using such tools is presenting LLMs with the “semantic structure” of table-based data, said Shoham. “Now, you get to close to a hundred percent accuracy” from the LLM, he said, “and this you wouldn’t get if you just used a language model without additional stuff.

Beyond tools, Shoham advocates for scientific exploration of other directions outside the pure deep-learning approach that has dominated AI for over a decade.

“You won’t get robust reasoning just by back-prop and hoping for the best,” said Shoham, referring to back-propagation, the learning rule by which most of today’s AI is trained.

Also: Anthropic brings Tool Use for Claude out of beta, promising sophisticated assistants

Shoham was careful to avoid discussing the next product initiatives. However, he hinted that what may be needed is represented — at least philosophically — in a system he and colleagues introduced in 2022 called an MRKL (Modular Reasoning, Knowledge, and Language) System.

The paper describes the MRKL system as being both “Neural, including the general-purpose huge language model as well as other smaller, specialized LMs,” and also, “Symbolic, for example, a math calculator, a currency converter or an API call to a database.”

That breath is a neuro-symbolic approach to AI. And in that way, Shoham is in accord with some prominent thinkers who have concerns about the dominance of Gen AI. Frequent AI critic Gary Marcus, for example, has said that AI will never reach human-level intelligence without a symbol-manipulation capability.

MRKL has been implemented as a program called Jurassic-X, which the company has tested with partners.

Also: OpenAI is training GPT-4’s successor. Here are 3 big upgrades to expect from GPT-5

An MRKL system should be able to use the LLM to parse problems that involve tricky phrasing, such as, “Ninety-nine bottles of beer on the wall, one fell, how many bottles of beer are on the wall?” The actual arithmetic is handled by a second neural net with access to arithmetic logic, using the arguments extracted from the text by the first model.

A “router” between the two has the difficult task of choosing which things to extract from the text parsed by the LLM and choosing which “module” to pass the results to in order to perform the logic.

That work means that “there is no free lunch, but that lunch is in many cases affordable,” write Shoham and team.

From a product and business standpoint, “we’d like to, on a continued basis, provide additional functionalities for people to build stuff,” said Shoham.

The important point is that a system like MRKL does not need to do everything to be practical, he said. “If you’re trying to build the universal LLM that understands math problems and how to generate pictures of donkeys on the moon, and how to write poems, and do all of that, that can be expensive,” he observed.

“But 80% of the data in the enterprise is text — you have tables, you have graphs, but donkeys on the moon aren’t that important in the enterprise.”

Given Shoham’s skepticism about LLMs on their own, is there a danger that today’s Gen AI could prompt what’s referred to as an AI winter, a sudden collapse in activity as interest, and funding, dries up entirely?

“It’s a valid question, and I don’t really know the answer,” he said. “I think it’s different this time around in that, back in the 1980s,” during the last AI winter, “not enough value had been created by AI to make up for the unfounded hype. There’s clearly now some unfounded hype, but my sense is that enough value has been created to see us through it.”