A large language model shouldn’t even attempt to do math imo. They made an expensive hammer that is semi good at one thing (parroting humans) and now they’re treating every query like it’s a nail.
Why isn’t OpenAi working more modular whereby the LLM will call up specialized algorithms once it has identified the nature of the question? Or is it already modular and they just suck at anything that cannot be calibrated purely with brute force computing?
Why isn’t OpenAi working more modular whereby the LLM will call up specialized algorithms once it has identified the nature of the question?
Precisely because this is a LLM. It doesn’t know the difference between writing out a maths problem, a recipe for cake or a haiku. It transforms everything into the same domain and is doing fancy statistics to come up with a reply. It wouldn’t know that it needs to invoke the “Calculator” feature unless you hard code that in, which is what ChatGPT and Gemini do, but it’s also easy to break.
Sort of. There’s a relatively new type of LLM called “tool aware” LLMs, which you can instruct to use tools like a calculator, or some other external program. As far as I know though, the LLM has to be told to go out and use that external thing, it can’t make that decision itself.
Can the model itself be trained to recognize mathematical input and invoke an external app, parse the result and feed that back into the reply? No.
Can you create a multi-layered system that uses some trickery to achieve this effect most of the time? Yes, that’s what OpenAI and Google are already doing by recognizing certain features of the users’ inputs and changing the system prompts to force the model to output Python code or Markdown notation that your browser then renders using a different tool.
In fairness the example I’ve seen MS give was taking a bunch of reviews and determining if the review was positive or negative from the text.
It was never meant to mangle numbers, but we all know it’s going to be used for that anyway, because people still want to believe in a future where robots help them, rather than just take their jobs and advertise to them.
I would rather not have it attempt something that it can’t do, no direct result is better than a wrong result imo. Here it’s correctly identifying that it’s a calculation question and instead of suggesting using a formula, it tries to hallucinate a numerical answer itself. The creators of the model seem to have a mindset that the model must try to answer no matter what, instead of training it to not answer questions that it can’t answer correctly.
As far as I can tell, the copilot command has to be given a range of data to work with, so here it’s pulling a number out of thin air. Be nice if the output from this was just “please tell this command which data to use” but as always it doesn’t know how to say “I don’t know”…
Mostly because it never knew anything to start with.
I assume you mean DeepSeek? And it doesn’t look like it, according to what I could find, their biggest innovation was “reinforcement learning to teach a base language model how to reason without any human supervision”. https://huggingface.co/blog/open-r1
Some others have replied that chatgpt and copilot are already modular: they use python for arithmetic questions. But that apparently isn’t enough to be useful.
I hadn’t heard of that protocol before, thanks. It holds a lot of promise for the future, but getting the input right for the tools is probably quite the challenge. And it also scares me because I now expect more companies to release irresponsible integrations.
Yep. Instead of focusing on humans communicating more effectively with computers, which are good at answering questions that have correct, knowable answers, we’ve invented a type of computer that can be wrong because maybe people will like the vibes more? (And we can sell vibes)
the LLM will call up specialized algorithms once it has identified the nature of the question?
Because there is no “it” that “calls” or “identify” even less the “nature of the question”.
This requires intelligence, not probability on the most likely next token.
You can do maths with words and you can write poem with numbers, either requires to actually understand, that’s the linchpin, rather than parse and provide an answer based on what has been written so far.
Sure the model might string together tokens that sound very “appropriate” to the question, in the sense that it fits within the right vocabulary, and if its dataset the occurrence was just frequent enough it even be correct, but that’s still not understanding even 1 single word (or token) within either the question or the corpus.
An llm can, by understanding some features of the input, predict a categorisation of the input and feed it to do different processors. This already works. It doesn’t require anything beyond the capabilities llms actually have, and isn’t perfect.
It’s a very good question why this hasn’t happened here; an llm can very reliably give you the answer to “how do you sum some data in python” so it only needs to be able to do that in excel and put the result into the cell.
There are still plenty of pitfalls. This should not be one of them, so that’s interesting.
LLMs have already become this weird mesh of different services tied together to look more impressive. OpenAIs models can’t do math and farm it out to python for accuracy.
So it’s going to write python functions to calculate the answer where all the variables are stored in an Excel spreadsheet a program that can already do the calculations? And how many forests did we burn down for that wonderful piece of MacGyvered software I wonder.
I’m wondering if this is a sleight-of-hand trick by the poster, then.
If they typed the “1” field to Text and left the 2 and 3 as numeric, then ran Copilot on that. In that case, its more an indictment of Excel than Copilot, strictly speaking. The screen doesn’t make clear which cells are numbers and which are text.
I don’t think there’s an explanation that doesn’t make this copilot’s fault. Honestly JavaScript shouldn’t allow math between numbers and strings in the first place. “1” + 1 is not a number, and there’s already a type for that: NaN
Regardless, the sum should be 5 if the first cell is text, so it’s incorrect either way.
Honestly JavaScript shouldn’t allow math between numbers and strings in the first place.
You can explicitly type values in more recent versions of JavaScript to avoid this, if you really don’t want to let concatenation be the default. So, again, this feels like an oversight in integration rather than strict formal logic.
A large language model shouldn’t even attempt to do math imo. They made an expensive hammer that is semi good at one thing (parroting humans) and now they’re treating every query like it’s a nail.
Why isn’t OpenAi working more modular whereby the LLM will call up specialized algorithms once it has identified the nature of the question? Or is it already modular and they just suck at anything that cannot be calibrated purely with brute force computing?
Precisely because this is a LLM. It doesn’t know the difference between writing out a maths problem, a recipe for cake or a haiku. It transforms everything into the same domain and is doing fancy statistics to come up with a reply. It wouldn’t know that it needs to invoke the “Calculator” feature unless you hard code that in, which is what ChatGPT and Gemini do, but it’s also easy to break.
Can’t it be trained to do that?
Sort of. There’s a relatively new type of LLM called “tool aware” LLMs, which you can instruct to use tools like a calculator, or some other external program. As far as I know though, the LLM has to be told to go out and use that external thing, it can’t make that decision itself.
Can the model itself be trained to recognize mathematical input and invoke an external app, parse the result and feed that back into the reply? No.
Can you create a multi-layered system that uses some trickery to achieve this effect most of the time? Yes, that’s what OpenAI and Google are already doing by recognizing certain features of the users’ inputs and changing the system prompts to force the model to output Python code or Markdown notation that your browser then renders using a different tool.
In fairness the example I’ve seen MS give was taking a bunch of reviews and determining if the review was positive or negative from the text.
It was never meant to mangle numbers, but we all know it’s going to be used for that anyway, because people still want to believe in a future where robots help them, rather than just take their jobs and advertise to them.
I would rather not have it attempt something that it can’t do, no direct result is better than a wrong result imo. Here it’s correctly identifying that it’s a calculation question and instead of suggesting using a formula, it tries to hallucinate a numerical answer itself. The creators of the model seem to have a mindset that the model must try to answer no matter what, instead of training it to not answer questions that it can’t answer correctly.
As far as I can tell, the copilot command has to be given a range of data to work with, so here it’s pulling a number out of thin air. Be nice if the output from this was just “please tell this command which data to use” but as always it doesn’t know how to say “I don’t know”…
Mostly because it never knew anything to start with.
Wasn’t doing this part of the reason OpenSeek was able to compete with much smaller data sets and less hardware requirements?
I assume you mean DeepSeek? And it doesn’t look like it, according to what I could find, their biggest innovation was “reinforcement learning to teach a base language model how to reason without any human supervision”. https://huggingface.co/blog/open-r1
Some others have replied that chatgpt and copilot are already modular: they use python for arithmetic questions. But that apparently isn’t enough to be useful.
I feel like the best modular AI right now is from Gemini. It can take a scan of a document and turn it into a CSV, which I was surprised by.
I figure it must have multiple steps, OCR, text interpretation, recognizing a table, then piping the text to some sort of CSV tool.
They have (well, anthropic)
It’s called MCP
Nowadays you just give the AI access to a calculator basically… or whatever other tools it needs. Including other models to help it answer something.
I hadn’t heard of that protocol before, thanks. It holds a lot of promise for the future, but getting the input right for the tools is probably quite the challenge. And it also scares me because I now expect more companies to release irresponsible integrations.
You have to admire the gall of naming your system after the evil AI villain of the original Tron movie.
The torment nexus strikes again.
Because we vibes-coded the OpenAI model with OpenAI and it didn’t think this was the optimal way to design itself.
Yep. Instead of focusing on humans communicating more effectively with computers, which are good at answering questions that have correct, knowable answers, we’ve invented a type of computer that can be wrong because maybe people will like the vibes more? (And we can sell vibes)
Because there is no “it” that “calls” or “identify” even less the “nature of the question”.
This requires intelligence, not probability on the most likely next token.
You can do maths with words and you can write poem with numbers, either requires to actually understand, that’s the linchpin, rather than parse and provide an answer based on what has been written so far.
Sure the model might string together tokens that sound very “appropriate” to the question, in the sense that it fits within the right vocabulary, and if its dataset the occurrence was just frequent enough it even be correct, but that’s still not understanding even 1 single word (or token) within either the question or the corpus.
An llm can, by understanding some features of the input, predict a categorisation of the input and feed it to do different processors. This already works. It doesn’t require anything beyond the capabilities llms actually have, and isn’t perfect. It’s a very good question why this hasn’t happened here; an llm can very reliably give you the answer to “how do you sum some data in python” so it only needs to be able to do that in excel and put the result into the cell.
There are still plenty of pitfalls. This should not be one of them, so that’s interesting.
OpenAI already makes it write Python functions to do the calculations.
This is the right answer.
LLMs have already become this weird mesh of different services tied together to look more impressive. OpenAIs models can’t do math and farm it out to python for accuracy.
So it’s going to write python functions to calculate the answer where all the variables are stored in an Excel spreadsheet a program that can already do the calculations? And how many forests did we burn down for that wonderful piece of MacGyvered software I wonder.
The AI bubble cannot burst soon enough.
Too Big To Fail. Sorry. 1+2+3=15 now.
“1”+(2+3) is “15” in JavaScript.
I’m wondering if this is a sleight-of-hand trick by the poster, then.
If they typed the “1” field to Text and left the 2 and 3 as numeric, then ran Copilot on that. In that case, its more an indictment of Excel than Copilot, strictly speaking. The screen doesn’t make clear which cells are numbers and which are text.
I don’t think there’s an explanation that doesn’t make this copilot’s fault. Honestly JavaScript shouldn’t allow math between numbers and strings in the first place. “1” + 1 is not a number, and there’s already a type for that: NaN
Regardless, the sum should be 5 if the first cell is text, so it’s incorrect either way.
You can explicitly type values in more recent versions of JavaScript to avoid this, if you really don’t want to let concatenation be the default. So, again, this feels like an oversight in integration rather than strict formal logic.