Uncertainty, Confidence, and Hallucination in Large Language Models
How to Spot When Your Large Language Model is Misleading You
Table of Content
LLM Is Just Making Stuff Up
Ever have a conversation with a large language model that sounds super confident, spitting out facts that seem...well, a little fishy? 🐟 You're not alone. One of the biggest challenges in working with Large Language Models (LLMs) is verifying the correctness of their output. Despite their advanced capabilities, LLMs can sometimes generate information that appears accurate but is fabricated. This phenomenon, known as 👉 hallucination, can lead to misinformation and erode trust in AI systems.
Hallucination in AI is not a new phenomenon. Deep learning models, in general, are notorious for their over-confidence in predictions. For instance, in classification tasks, these models can assign a very high probability to a label prediction, even when the prediction is incorrect [1]. Deep learning models can be misleading in how powerful they truly are.
In the context of AI-generated text, large language models (LLMs) can produce content that appears real and coherent, yet is irrelevant and unacceptable. Recent papers [2] categorize LLM hallucinations into 2 main types:
Factuality Hallucination: This is like the LLM making stuff up entirely. It might sound convincing, but the information is just plain wrong. Think of it like telling you the capital of France is New York City (👉 factual inconsistency, i.e., simply wrong) or the Roman Empire was the first civilization to discover Antarctica (👉 factual fabrication, i.e., no evidence).
Faithfulness Hallucination: This happens when the LLM strays from your topic or instructions. It might weave a good story but doesn't answer your question or follow the original idea. Imagine asking for a recipe and getting a poem about kitchens instead (👉 instruction inconsistency). Another example is that LLMs can summarize an input document, perpetuating any incorrect or unsupported information it contains (👉 context inconsistency) or perform wrong mathematics derivation (👉 logical inconsistency).
👀 The dangerous thing is that the generated text looks really smooth and confident, which make it hard to know if the content is hallucinated or not.
Imagine the catastrophic consequences of deploying LLMs in real-world applications without addressing their hallucination issues. For example, an LLM might mistakenly diagnose a benign condition when the symptoms indicate a serious illness, putting a patient's life at risk. The real-world impact of these errors is evident: Google's early LLM, BARD, cost the company $100 billion due to a critical hallucination error.
A reliable AI system should be upfront about its limitations. Imagine asking a friend a question, and they spout out an answer with zero hesitation, even if they're not entirely sure. Not ideal, right? The same goes for AI. The best AI systems are those that can signal their uncertainty when they're unsure. We don't want an AI that's either arrogantly confident about everything or so timid it never takes a guess. Hallucination or not, the key issue is identifying when an LLM’s output is unreliable. Now, the real questions are:
🧠 Can we detect when LLMs are generating misleading content? Or even better, can we mitigate the hallucination or dehallucinate LLMs’ output?
Today's focus is on the first question. We'll explore the second question in the next post.
Detecting Deception: Tools and Methods for Identifying LLM Falsehoods
We all know that feeling – you ask a language model a question, and it answers with booming confidence. So, 🧠 how can we tell if an LLM is just making stuff up? There are two key approaches to detect LLM "deception":
Score-based Methods: One way to sniff out an LLM's fib is to look at how uncertain it is about its answer. Imagine a friend who gives you an answer with a shrug and a mumbled "maybe." That might raise a red flag, right? Similarly, LLMs that express high uncertainty about their output are more likely to be unreliable. By measuring this uncertainty as a score, we can get a sense of how trustworthy the information might be. In this vein, several approaches emerge, some inspired by heuristic methods in uncertainty estimation for deep learning, while others rely more on theoretical principles.
Calling in the Backup: Another approach involves using external models called conformal predictors. Think of them as using AI to control AI. These models analyze the LLM's information and predict whether the output is real or fabricated (hallucinated). Two approaches to Conformal Prediction:
LLM Evaluator: This method utilizes another LLM (or potentially the same one) to evaluate the generated text itself. This approach bypasses the need for handcrafted features but potentially introduces additional complexity.
Simple Conformal Predictor: This approach leverages well-established methods like linear or logistic regression. However, it relies heavily on extracting informative features from the LLM's output.
The effectiveness of these methods depends on what information we have access to about the LLM itself. For example, the prediction will be more accurate if we can peek "inside" the LLM and see its internal workings (like a white box). However, if the LLM is a black box (we can't see its inner workings), we might need to ask it multiple times to get a clearer picture. Crucially, all methods assume that LLMs have some awareness of their uncertainty or confidence levels, meaning they have a rough idea of how accurate their outputs are. Without this self-awareness, estimating uncertainty or predicting correctness solely by observing the LLMs is impossible.
👀 Fortunately, recent evidence has pointed out that this assumption is practical and LLMs are aware of what they know or don’t know. All we need to do is find good ways to extract or trigger this information.
To sum up, in any method, we just need to craft a score to measure the uncertainty/confidence of LLMs. For methods that do not use external conformal predictors, the score can be computed as a scalar using different approaches. The score then can be calibrated with a training dataset to find a proper threshold for detection decision-making. For conformal prediction approaches, the score can be extracted from the predictor’s prediction logits. We also need a training dataset to train the predictor. Sometimes, if the logits or the score is good, we can just use a default threshold of 0.5 without calibration. The general framework for detection is depicted below:
Score-based Approaches for Uncertainty Estimation in LLMs
Heuristic Uncertainty as a Clue
Just like other powerful AI models, LLMs have a built-in tool to estimate how likely their answers are to be correct. This tool is embedded in the final layer, known as the softmax layer, which calculates the probability of each token in the vocabulary appearing at the current timestep. Typically, the token with the highest probability is the one you see in the LLMs' output.
Unfortunately, as discussed earlier, the built-in probability is not reliable. It does not reflect the reasonable confidence LLMs should have. For example, an LLM might assign a high probability to a factually incorrect answer, simply because the answer aligns with the patterns it has observed in its training data. Furthermore, the built-in probability only applies to individual tokens, not the overall coherence or accuracy of the entire response. This limitation hinders our ability to gauge the trustworthiness of a complete sentence. Ideally, we need methods to assess the probability of the entire content being factually sound and logically consistent.
Fortunately, the field of deep learning offers established methodologies for uncertainty estimation. As Huang et al. (2023) highlight in their comprehensive survey, three key approaches can be directly adapted from this literature to quantify the uncertainty associated with LLM responses [2].
👉 Probability Aggregation: This approach combines individual token-level probabilities (often in the form of log probabilities) to estimate a single probability score for the entire sentence and response.
👀 This approach is simple and economical, only requiring one forward pass of LLM to get the log probs. However, it needs access to softmax layer information, which may be unavailable for black box LLM services.
For example, max and average aggregation can be used to estimate the uncertainty of a sentence i:
where 𝑝𝑖𝑗 is the probability of a token at position 𝑗 of a sentence i.
👀 Taking average of log probs is equivalent to measuring uncertainty as 👉Perplexity.
Since LLMs inherently predict probabilities for all possible tokens at each step, they produce a distribution p(xj) for the j-th token. Hence we can leverage entropy, a well-established measure of uncertainty H(Xj) = -Σ p(xj) * log(p(xj)), to estimate sentence-level uncertainty as follows,
👉 Uncertainty through Voting: This trick involves generating multiple responses from the LLM for the same prompt. We then analyze the variance (how different or inconsistent the responses are from each other). The idea is that if the LLM keeps spitting out similar answers, it suggests a more consistent and potentially reliable thought process. The more the responses veer off course, the higher the uncertainty. Diving deeper, we can derive 2 metrics: (1) variation ratio (VR) and (2) variation ratio for original prediction (VRO). In particular, if we sample T sentence responses from the LLM and can measure the difference between 2 responses pi and pj via function dist(), we have:
The main difference between the VRO and VR formulas is that VRO only considers the variance between the original response and any additional generated responses (assigning a weight of 1 to the original response and 0 to the others). Here, the distance function can be the BLEU score which captures lexical matching:
We can also use BERT as the function to capture semantic similarity/difference between responses. SelfCheckGPT paper [3] proposes a similar uncertainty formula using BERT to measure the uncertainty of response ri:
Here, for each of the N sampling iterations, we sample several sentences and select the one most similar to the original response in terms of BERT score then take the average.
👀 To sample different outputs from LLMs, we may need to access to the hyperparameter temperature t. t=0 means the generation will be deterministic and ends up with the same output. t>0 will enable more stochasticity in the generation process. Other than that, voting approach can work well with black box LLMs.
👉 Uncertainty through Perturbation: One fascinating aspect of LLMs is their inherent randomness during text generation. Like a chain reaction, a tiny change in one predicted word can ripple through the entire sequence, potentially leading to completely different meanings. This stochastic nature highlights the sensitivity of LLMs throughout the prediction process and we can measure it by:
Choose a token and replace (perturbed) it with other tokens (top-k highest probabilities). This leads to several responses.
Compute variance as the uncertainty score over the response as in Voting mechanisms above.
🧠 Which token should we mess with? The authors propose 3 ways:
Most Uncertain Spot: This refers to the place in the generated token where the LLM itself seems unsure about what word to pick next (highest entropy, Max).
Most Confident Spot: This is the opposite of point 1, where the LLM seems very certain about the word it chose (lowest entropy, Min).
Biggest Shift: This focuses on the point where the LLM's confidence level changes the most compared to the previous word (maximum change in entropy. MaxDiff).
👀 This approach requires major interference into the LLM computation process, and thus more suitable for white box setting.
The research found that getting the LLM to vote on multiple responses is the best way to gauge uncertainty, followed by tweaking the text and looking at the changes, and lastly, simply looking at the probabilities the LLM assigns to each word.
🧪 Voting > Perturbation > Probability Aggregation
Quantifying Uncertainty with Information Theory
Continuing the line of reasoning that multiple samples help hallucination detection, we can investigate deeper into the hidden states of the LLMs instead of just probing outside. Concretely, Chen et al., (2024) sample responses multiple times, generating multiple hidden states and feature vectors, providing richer information about the LLM's confidence in its responses. The authors propose to use a metric based on the eigenvector of these feature vectors as an uncertainty metric, 👉 EigenScore [7].
In particular, given K hidden state vectors as composing a matrix Z, they compute the covariance matrix:
where Jd = Id − 1d 1d⊤ is the centering matrix and 1d ∈ Rd is the all-one column vector. Then, EigenScore can be calculated as the logarithm of the determinant (LogDet) of the covariance matrix:
👀 The EigenScore represents the differential entropy in the sentence embedding space following Gaussian distribution. Hence, it is reasonable to use it for measuring uncertainty.
The authors also suggest clipping the features during the computation of EigenScore to reduce overconfidence estimation:
where hmin and hmax are hyperparameters that can be tuned or calibrated.
More principled research [4] throws shade on LLMs' hallucinations by revisiting the basics of uncertainty in machine learning. Turns out, there are two big types:
Epistemic Uncertainty: When the LLM just doesn't know enough (think facts or grammar rules). This can happen because it hasn't seen enough training data or just isn't powerful enough yet.
Aleatoric Uncertainty: This is when the question itself is tricky. Imagine there are multiple right answers, making it a guessing game even for the smartest LLM. Note that this kind of uncertainty is common in LLM settings because there can be many ways to generate reasonable responses.
So, the lower the epistemic uncertainty, the more likely the LLM's answer is on point. Since aleatoric uncertainty is not the fault of the model and we cannot do anything about it, it is important to differentiate the two sources of uncertainty.
❌ The problem with heuristic approaches is that they only measure LLM uncertainty as a whole, not inherent ambiguity in the problem itself (aleatoric uncertainty). This can be misleading. For example, a perfect predictor might have high aleatoric uncertainty, while a bad one might only have high epistemic uncertainty. Both would appear equally uncertain under heuristic methods.
Therefore, the authors propose to focus on identifying instances where only the epistemic uncertainty is large, which would suggest that the response is likely hallucinated. To this end, they propose 👉 epistemic uncertainty via an iterative prompting procedure.
Here's the trick to do that: first, they ask the model to respond to a query. Then, they ask for another response to the query plus the first response. After that, they request a third response to the query and the first two responses, continuing this process. If the LLM keeps changing its response across trials, it suggests a lack of confidence in its knowledge. In contrast, if the LLM consistently provides answers insensitive to the concatenation of its previous response, it indicates a stronger grasp of the topic.
👀 In other words, the responses should be independent. This means that the joint distribution of these responses, for a fixed query, must be a product distribution.
To illustrate the point, the authors observe the probabilities of LLM on the correct answers when prompting with the iterative procedure:
👀 Why? Intuitively, if the question is seen during training, the attention key and query weights of the LLM are tuned to be able to project the question to higher attention scores than other sentences. Thus, the question will be attended the most regard less of the context length, and the LLM will always have a chance to look at the quetion to give the right answer. On the contrary, if the question is novel, the weights can not do any thing and as the context get longer, the attention can be anywhere, leading to a wrong attended input for the LLMs to answer.
In short, the iterative prompting procedure gives us a hint at the uncertain behavior of the LLM. Given the right motivation, now we can derive a robust uncertainty score. Formerly, given a query x ∈ X and possible responses Y1, . . . , Yt, a family of prompts F = {Ft : X → X | t ∈ N} is defined with the prompt function Ft(x, Y1, . . . , Yt) as:
Then, we can model the distribution of the sequence of responses given the query x:
👀 The chain rule is approximated because of the use of the prompt function Ft to combine the random variables. Hence, it is pseudo join distribution.
Given the formulation of the joint distribution, it is intuitive to say that the response of LLMs Y1,…,Yn|x is wrong if the LLM’s probability of Y1,…,Yn|x is unlike the ground truth probability of Y1,…,Yn|x. Thus, a metric that can measure the truthfulness of LLM’s output is the KL divergence between LLM’s joint distribution and ground truth joint distribution. Yet, we don’t know ground truth distribution. Fortunately, we can replace the KL with an estimable lower bound:
Computing the exact mutual information requires evaluating Q over its entire support, which can be infinite. Therefore, the authors propose to estimate the term by sampling-based approximation. In particular,
Sample X1, . . . , Xk sequence of responses from the LLM
Construct a set of indices of unique elements S = i ∈ [k] : Xi ≠ Xj ∀j < i
Construct empirical distributions: for all i ∈ S:
Finally, compute the estimated mutual information:
where 𝛾 and k are hyperparameters.
Model-based Hallucination Detection
LLMs as Evaluators
It seems like a chicken-and-egg problem to use LLMs to detect LLMs’ falsehood – 🧠 how can an LLM identify a lie if it chose to generate the lie in the first place? Interestingly, early research has shown that they can do it [5]. It is reasonable since humans also exhibit this behavior. We often make mistakes and only realize them upon reflection. Similarly, when interacting with language models, they can acknowledge and correct their errors when shown their mistakes.
Enhancing validation reliability is possible by employing a more robust LLM to verify the outputs of a less powerful one. This approach, commonly used by the open-source community for benchmarking LLM improvements, involves leveraging a larger or more advanced model to assess the accuracy of a smaller or less-developed model. The detection framework is very simple:
👀 Simple evaluation prompts may not work well all the time, especially when the LLM Evaluator is not stronger than the main LLM.
Improving the accuracy of the LLM evaluator requires special methods [5]. One property the research found out is that LLMs excel at calibrating multiple-choice and true/false questions. Put simply, the probabilities they assign to the options are somewhat reliable.
👀 This property is more evident if the multiple choice has the suitable format. For example, if there is “None of the above” choice, the quality of calibration may be reduced. We may also need to tune the temperature t to have good probabilities.
Thus, they propose a simple trick to use 👉 LLM Prompting without finetuning to make the evaluation more accurate:
Present the response to the LLM Evaluator and ask if the response is True or False.
Measure the the probability P(“True”) that the LLM Evaluator assigns to the token “True”.
An example to illustrate the evaluation prompt:
To enhance accuracy, the authors suggest presenting the model with additional examples for comparison. For example, we can generate a total of 5 responses, and then ask the model to assess the validity of one of them—the original response of LLM.
👀 The result can be further improved with few-shot prompting techniques. In short, we can say that Comparison (Few-shot) > Comparison > One Proposed Answer.
In addition to prompting, the authors also propose 👉 finetuning LLMs for the detection task. Concretely, they train LLMs to predict whether they know the answer to any given free-form question, i.e., estimating P(IK) (“I know”), using 2 approaches:
Value Head Integration: This approach introduces an additional "head" to the LLM architecture. This head is specifically trained to predict P(IK) as a logit value. A key advantage of this method lies in its flexibility. We can probe the value head at any point during text generation, allowing for dynamic uncertainty assessment.
Natural Language Prompt-based Training: This approach leverages natural language processing (NLP) techniques. They train the LLM to respond to the prompt: "With what confidence could you answer this question?" The model's target output is a human-readable percentage value (0% - 100%) reflecting its estimated confidence level in answering the question. This method offers a more intuitive interpretation of uncertainty for users.
❌ Unfortunately, the Natural Language Prompt-based Training approach fails, so the authors only follow the Value Head Integration approach.
When training a model, it's essential to prepare training data. Similar to training other conformal predictors, we need data in a binary classification format:
X: the input and response from the LLM
Y: whether the response is correct or not.
In practice, they generated 30 response samples per question input. If 20 samples were deemed correct, they would have 20 positive-label data points in the training set, indicating the model "knew" the answer. Conversely, 10 incorrect samples resulted in 10 negative-label data points. The LLM is finetuned to output the value head following the ground-truth labels.
The results indicate that finetuning generally helps the model distinguish between correct and incorrect responses. In the in-distribution setting across datasets, the LLM's predicted P(IK) aligns somewhat with the ground truth. However, when generalized to a different dataset (from TriviaQA to Mixed-Arithmetic), this differentiation becomes less clear.
Despite the initial promising results, there are significant limitations with the approach:
❌ High Detection Cost: The cost of detection is high because it relies on LLMs. These models require substantial computational resources and energy, leading to increased expenses in terms of both hardware and operational costs.
❌ Insufficient Reliance on Textual Responses: Simply relying on the textual response to determine truthfulness is inadequate. Textual responses alone cannot comprehensively reveal the correctness of the information because LLMs are very good at making things look real.
Simple Conformal Predictor
The key to catching hallucinations might lie within the LLMs themselves. By peering deeper into their internal workings, we could extract valuable clues about their current state and what they "believe" to be true. This richer information would significantly boost the accuracy of hallucination detection. Think of it this way: with a clearer picture of the LLM's thought process, we wouldn't need such a complex detector. Even a simpler classifier could do the job if we have the right features to analyze. The workflow becomes:
This simple idea is attractive because a simple conformal predictor such as a feed-forward neural network can be used to perform the detection. However, the nature of these features can pose several challenges:
🤔 Is it easy and cost-effective to extract these features?
🤔 Are these features informative and can they generalize well to new prompts and different large language models (LLMs)?
Now, the main question is: 🧠 Which features should we extract? One candidate is 👉 the internal states of the LLMs. Recent works have investigated and declared that the internal states of LLMs are reliable sources of information for truthfulness detection on the final response [6,7].
👀 It is important to note that this must be the internal states, not the response text or the response embedding vector.
As quoted in their paper:
We hypothesize that the truth or falsehood of a statement should be represented by, and therefore extractable from, the LLM’s internal state.
Source: [6]
Ok, let’s use the internal states, which are represented by the hidden layers of the LLMs.🧠 Which layers should we use? Intuitively, the last hidden layer seems like a good candidate – it should theoretically hold all the processed information. But there's a catch: this layer is primarily focused on predicting the next word in the sequence, not necessarily retaining long-term context. Conversely, layers closer to the input are better at extracting basic features from the data, but might not capture the bigger picture.
To find out, the authors try out several hidden layers of the LLMs as the features. For each chosen layer, the feature vector can be simply the average across token timesteps or the last token’s hidden states. The results reveal that the middle layers perform best:
Recently, a more detailed investigation into the hidden states of LLMs aims to find out if these internal states can signal the risk of hallucination based on the given queries [8]. The goal is to see if we can reliably estimate this risk even before the LLM generates a response, i.e., 👉 self-awareness.
Self-awareness is the ability in humans that causes us to hesitate before responding to queries or making decisions in situations where we recognize our lack of knowledge (we know what we don’t know). The authors want to verify that ability in LLMs by studying LLMs’ internal states.
Concretely, they use internal states corresponding to the last token of queries, denoted as xq. The conformal predictor, or estimator, employed is a variant of the multilayer perceptron (MLP) adapted from Llama’s. The estimator is mathematically formulated to predict the hallucination risk H as:
They also prepare a dataset containing both known and unknown queries for the LLMs. The LLMs are expected to be uncertain about the unknown queries, which they have never encountered before. They train the estimator on the dataset and compare it with simple baselines such as Perplexity and Prompting to illustrate the point that internal states are really good indicators of uncertainty for unknown queries.
🧪 Internal-State Conformal Predictor > PPL > ICL Prompt > Zero-shot Prompt.
Hidden states, while powerful tools within Large Language Models (LLMs), come with inherent limitations:
❌ Architecture Dependence: Extracting hidden states is intrinsically tied to specific LLM architectures and models. This creates a roadblock when transferring the extraction process across different LLMs. Each LLM architecture might require unique approaches to access and interpret its hidden states.
❌ Sensitivity and Generalizability: Hidden states are demonstrably sensitive to the input data and the specific LLM they are extracted from. This sensitivity poses a significant challenge to generalizability. Conformal predictors, for instance, trained in a particular dataset's hidden states might not perform well when applied to hidden states derived from a different dataset or LLMs.
In a promising new direction, researchers have proposed an alternative approach that bypasses hidden states altogether. This method, 👉 Lookback Lens, focuses on extracting features directly from the attention scores generated by LLMs. We focus on attention because it reveals how much the LLM considers the given context when generating text. This is especially valuable compared to other internal model workings. Since attention provides a human-understandable measure, it becomes a powerful tool for catching and fixing made-up information (hallucinations) in the generated text.
When a Transformer-based LLM performs attention, it attends to the context tokens and its newly generated tokens. The authors aggregate attention scores at attention head h and layer l, corresponding to the two types of attention:
Here N is the number of tokens in the context and t is the timestep of newly generated tokens. Hence, the lookback ratio LR(l,h t) for head h in layer l at time step t is:
👀 Intuitively, if the LLM focuses more on the context (the ratio is higher), it tends to be more reliable, less hallucinated.
Of course, we can combine different layers, heads, and timesteps to form a combined feature vector for a span of generated text Y={yt , yt+1, ..., yt+T −1}. Given the feature vector, we can employ a simpler classifier (logistic regression) to detect if it is factual or hallucinated.
The experimental results are promising, at least for summarization tasks where attending to the context is crucial to summarize.
🧪 Lookback Lens > Hidden States > Prompt
The results also reveal that Lookback Lens might not always learn the training data perfectly, but it consistently performs better on completely different tasks (out-of-domain tasks). Lookback Lens analyzes attention maps (lookback ratio features), which is more robust than the hidden states, and thus, is powerful and adaptable, making it useful for a wider range of problems.
Given the Lookback ratio as a score, we can measure the factuality or confidence of different generated candidates:
Final Thoughts: The Future of LLM Hallucination Detection
While the methods explored here offer promising avenues for detecting LLM hallucinations, there's still much room for exploration. Future research directions include:
Improved uncertainty estimation for LLMs: Refining techniques for LLMs to better quantify their uncertainty about generated content.
Novel methods for leveraging internal LLM states: Exploring techniques to analyze internal LLM representations to glean deeper insights into the generation process and identify potential hallucinations.
Integration with factual knowledge bases: Developing frameworks that seamlessly integrate LLM outputs with external knowledge sources to verify factual consistency and enhance detection accuracy.
Benchmarking and interpretability: Establishing standardized benchmarks for evaluating hallucination detection methods and fostering interpretable models that provide clear explanations for their decisions.
By addressing these challenges, we can move towards a future where LLMs are reliable partners in human endeavors, offering creative and informative outputs while minimizing the risk of misleading information. This will be crucial for fostering trust and wider adoption of LLM technology across various domains.
References
[1] Guo, Chuan, et al. “On calibration of modern neural networks.” International conference on machine learning. PMLR, 2017.
[2] Huang, Lei, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen et al. "A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions." arXiv preprint arXiv:2311.05232 (2023).
[3] Manakul, Potsawee, Adian Liusie, and Mark JF Gales. "Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models." arXiv preprint arXiv:2303.08896 (2023).
[4] Yadkori, Yasin Abbasi, Ilja Kuzborskij, András György, and Csaba Szepesvári. "To Believe or Not to Believe Your LLM." arXiv preprint arXiv:2406.02543 (2024).
[5] Kadavath, Saurav, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer et al. "Language models (mostly) know what they know." arXiv preprint arXiv:2207.05221 (2022).
[6] Azaria, Amos, and Tom Mitchell. "The internal state of an LLM knows when it's lying." arXiv preprint arXiv:2304.13734 (2023).
[7] Chen, Chao, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. "INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection." In The Twelfth International Conference on Learning Representations.
[8] Ji, Ziwei, Delong Chen, Etsuko Ishii, Samuel Cahyawijaya, Yejin Bang, Bryan Wilie, and Pascale Fung. "LLM Internal States Reveal Hallucination Risk Faced With a Query." arXiv preprint arXiv:2407.03282 (2024).
[9] Chuang, Yung-Sung, Linlu Qiu, Cheng-Yu Hsieh, Ranjay Krishna, Yoon Kim, and James Glass. "Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps." arXiv preprint arXiv:2407.07071 (2024).