A new study by Apple’s artificial intelligence scientists revealed that engines based on large language models, such as those from Meta and OpenAI, still lack basic reasoning skills.
The research team has proposed a new benchmark, GSM-Symbolic, to aid in measuring the reasoning capabilities of various large language models (LLMs). Their initial tests show that minor changes in query wording can result in significantly different answers, undermining model reliability.
The scientists examined the "fragility" of mathematical reasoning by adding contextual information to their queries, which a human would understand but which should not affect the underlying mathematics of the solution.
This led to inconsistent answers. "Specifically, the performance of all models declines [even] when only the numerical values in the question are altered in the GSM-Symbolic benchmark," the team noted.
"Furthermore, the fragility of mathematical reasoning in these models [demonstrates] that their performance significantly deteriorates as the number of clauses in a question increases."
The study found that adding even one sentence that seems relevant to a math question can reduce the answer’s accuracy by up to 65 per cent.
"There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bits of irrelevant info can give you a different answer," the study concluded.
A particular example highlights an absence of critical thinking. The team created a math problem, "GSM-NoOp", similar to elementary "word problems".
The task began with information to formulate a result: "Oliver picks 44 kiwi fruit on Friday. Then he picks 58 on Saturday. On Sunday, he picks double the number of kiwis he did on Friday." It then added an irrelevant clause: "Five of them were a bit smaller than average," before asking, "How many kiwi fruit does Oliver have?"
OpenAI's model and Meta's Llama3-8b incorrectly subtracted the five smaller kiwis from the total result.
The study's findings align with a 2019 study showing AI models could be confused by added background information about the Super Bowl quarterbacks, resulting in incorrect answers.
"We found no formal reasoning in language models," the recent study concluded. The behaviour of LLMs "is better explained by sophisticated pattern matching," which is "so fragile that changing names can alter results."