On LLMs saying (or failing to say) "I don't know"

A group of people sitting around an electricity tower and surrounded by electronic devices and paper
Yutong Liu & Digit / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/

When we say that "LLMs won't admit it when they don't know something", what do we actually mean? 

It seems to indeed be the case that LLMs won't output the sentence "I don't know" back as an answer to your prompt, but the reason why this is the case has nothing to do with "overconfidence", or with "lying", or with an intent to "persuade".

LLMs "fail" to produce "I don't know" as an output by design. Conceptualizing this process as LLMs not "admitting it" when they "don't know" something (due to perceived overconfidence or perceived intent to mislead or persuade) is, in essence, a category mistake

With these statements I do not aim to devalue LLMs but, rather, I'm merely pointing to the basic underlying mechanisms that lead to their textual outputs as an essential aspect to consider towards understanding their lack of "I don't know" answers.

Given LLMs will produce text based on a probability distribution of words (or other linguistic units) occurring together in specific sequences that follow structural patterns of natural language [1], it stands to reason that LLMs will always find at least one string of words that has a high probability of occurrence in a sequence irrespective of whether the content of that output as read by a human matches the reality of the world and/or is considered accurate.  

In this sense, LLMs do not technically "make mistakes" because they have no embedded understanding of what is accurate beyond specific outputs that have been trained to occur with a higher probability (namely, that have been reinforced). They cannot reach the conclusion that they do not know something, because there is nothing to "know"; there are just more or less probable outputs that follow the given prompt that was input by the user. 

For this to make sense, one has to consider what knowing something means. For example, I could answer "4" when prompted by the question "2 + 2 = ?", but the reason I may conclude I know the answer as being 4 could be related to me having memorized the statement "2 + 2 = 4", or to me having understood that the process of addition (in simple terms) involves the concept of "unit", and that if I have multiple individual "units", I have a sum total of "units" that can be quantified. In that sense, I'd know that the statement "2 + 2" is equal to the statement "1 + 1 + 1 + 1" which, given I understand addition, would "= 4". 

Think for a moment about how hard it is to conclude that you do not know something (instead of, say, outwardly expressing that you don't know because of external pressures (e.g. fear of failure, a focus on how you may come across, feeling intimidated by being in dialogue with someone who is an expert on the topic you're attempting to discuss)).

Now think about how difficult it is to determine all that you may not know about everything and anything that you may not have even begun to consider (the good old "you don't know what you don't know").

The ability to determine that you do not know something is dependent upon what is termed metacognition, or metacognitive skills [2]. These are a set of skills that require training and practice.

As Data from Star Trek "The New Generation" put it, "the beginning of wisdom is: I do not know".

For you to learn about something (anything), you'd have to depart from a state of not knowing.

In order to be able to determine that one does not know something, one needs to first align what one thinks one knows with regards to the current situation at hand, and one has to be able to scrutinize what one thinks one knows in order to determine whether that perceived knowledge is accurate or whether it is not (given the information one has, both externally and internally).

For example, I could determine I believe something to be the case, but I either do not know why I believe it (which should indicate a need for further scrutiny), or I recognize that it is indeed a belief that is not rooted in any evidence I can think of.

This process of self-scrutinizing and self-inquiry takes high degrees of effort and practice and, in turn, necessitates the ability to critically and intentionally self-evaluate the knowledge one thinks one has, and in turn to be able to have a degree of skepticism of what one thinks one knows.

LLMs are not operating in this capacity and, as such, simply cannot determine whether they do or do not "know" anything at all. 

So, instead of focusing on this seeming feature of LLMs as the models not "admitting" when they're "wrong", we should understand that with the current algorithms, there is no way for an LLM to either know or not know something, whatever illusion of "knowledge" their outputs may evoke in us as readers.

Thus, the LLM will always produce an output, will always have at least one string of most probable linguistic units and, as such, will always be able to output a naturally-sounding answer to a prompt irrespective of what may or may not be factual. The LLM is not "lying" to you because it somehow "can't admit" that it is wrong or doesn't know (an illusion that stems from the anthropomorphism of LLMs). In fact, always coming up with an output in response to a prompt indicates that the LLM is just working as it's supposed to.

Finally, I'd like to highlight the inherent power that the statement "I don't know" has both for the integrity of our dialogic environments and for our learning processes.

Instead of expecting a teacher or professor to always know (and, by extension, to possibly get things wrong in place of saying "I don't know"), we should normalize not knowing as a door that opens in front of us, instead of as a door that closes shut only to leave us hanging on some perceived state of failure

Getting things wrong is very human, as is (eventually) realizing that one doesn't know something; yet, these processes of erring and not knowing are qualitatively very different from an LLM's output not aligning with what could be considered "accurate". To reiterate, LLMs are designed to fill in a sequence of most probable items to follow a given prompt; understanding (as we understand it) is not part of the equation, but it is something we as humans are equipped to do on our own.

Importantly, the point of learning is not to simply amass information but, rather, it is to learn the tools that can bring us from "I don't know" to "I don't know but I know how to figure it out". Just as with the 2 + 2 = 4 example, encoding information (I can memorize that "4" is the last item of the "2 + 2 =" sequence) is not the same as understanding how that outcome came to be.

"I don't know", in a sense, may be one of the most human assertions of all. But, you know... I don't know for sure!


Notes

[1] distributions which can be further fine-tuned through methods such as reinforcement learning - human feedback, system prompts, and other modes of training to increase likelihood of one type of output over another.

[2] For more on metacognition and its relationship to critical thinking, see the chapter "Metacognition and Critical Thinking: Some Pedagogical Imperatives" by Peter Ellerton (from The Palgrave Handbook of Critical Thinking in Higher Education, 2015; M. Davies et al. (eds.))