Al Noor Ali Aziz Jasem
Department
of Information System, College of Computer Science and Information Technology,
University of Sumer, Dhi Qar, Iraq
Ali963852@gmail.com
Abstract
Hallucination is still one of the most fundamental issues for LLMS,
especially in high-stake applications that require fact reliability and
traceability. Despite that the RAG shows great potential as grounding
mechanisms; it still fails to completely prevent unsupported claims or citation
hallucinations. To address this issue, we present a new experimental paradigm
which combines (1) cross-modal retrieval-based grounding, (2) multi-layer
self-verification, and (3) semantic entropy-based uncertainty gating for
dynamically controlling the effort of verification. Motivated by the semantic
entropy-based hallucination detection techniques, the proposed model triggers
additional validation once the amount of semantics uncertainty surpasses a
learned threshold. The framework focuses on evidence-based accuracy, citation
validity, calibration and computational efficiency. A complete methodological
protocol, performance measures and deployment-friendly architecture are
presented.
Keywords: Large Language Models, Hallucination Detection, Retrieval-Augmented
Generation, Semantic Entropy, Self-Verification, AI Reliability.
I. INTRODUCTION
Massive Language Models (MLMs) are large-scale language models that have
shown exceptional performance in a range of natural language processing tasks
such as question answering, reasoning, summarization and knowledge
consolidation. Nevertheless, these systems share a common drawback which
impairs their trustworthiness: hallucination, i.e. the generation of misleading
content that is unverifiably factually incorrect but with high linguistic
confidence.
Hallucination is very dangerous in critical applications, such as
healthcare support for decision making, legal analysis and financial or
enterprise knowledge management. According to empirical evidence,
hallucinations may emerge by various causes, such as parametric knowledge
extrapolation, distributional shift, retrieval noise and instability in
autoregressive decoding [2], [3]. In addition, hallucinated outpus can look
semantically coherent and grammatically correct and hence naive
confidence-based filtering does not work [13].
Retrieval-Augmented Generation (RAG) is a prominent strategy to reduce
these limitations by grounding model outputs in external corpora [4]. Although
RAG enhances factuality, it does not reduce argument shifting (unsupported
claims), citation inflation, and evidence misalignment. At the same time EUB
methods have been suggested for detecting untrusted GPs. However, existing
token-level entropies are unable to distinguish between superficial lexical variation
and true semantic variation.
More recently, semantic entropy has been introduced as a measure for
uncertainty in meaning space rather than token space [1]. This measure has
shown good consistency with hallucinated outputs. However, current methods
often perform uncertainty estimation as a post-processing step instead of
incorporating it into the generation-control procedure.
A. Limitations of Existing Mitigation Strategies
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) integrates parametric model
knowledge with non-parametric external memory [4]. By conditioning generation
on retrieved documents, RAG aims to ground outputs in verifiable evidence. This
approach improves factual consistency and traceability in many practical
systems.
However, RAG does not eliminate hallucination. Several limitations
persist:
·
Evidence Misalignment: The generator may selectively attend to
irrelevant retrieved passages.
·
Citation Fabrication: Models may generate references that are not
present in the retrieved corpus.
·
Context Window Saturation: Excessive document concatenation may degrade
attention quality.
·
Overconfidence Under Sparse Evidence: The model may still respond
definitively when retrieval confidence is weak.
Thus, retrieval alone does not guarantee reliability.
Uncertainty Estimation
Uncertainty estimation methods attempt to identify unreliable outputs
using probabilistic measures [5]. Conventional approaches compute token-level
entropy:
While useful for measuring lexical dispersion, token entropy fails to
distinguish between:
- Superficial phrasing variability, and
- Genuine semantic disagreement.
Therefore, token-level entropy may underestimate epistemic instability.
Recent advances propose semantic entropy, which estimates uncertainty in
meaning space rather than surface form [1]. By generating multiple independent
responses and clustering them according to semantic equivalence, semantic
entropy captures dispersion at the conceptual level:
where
Empirical evidence shows that high semantic entropy is highly correlated
with the occurrence of hallucination events [1]. But, the majority of previous
work utilize semantic entropy for post-hoc detection, not real-time generation
controlling.
Self-Verification and Self-Consistency
Self-consistency approaches average predictions over multiple paths of
reasoning. Although effective in reasoning tasks, these methods do not directly
check if a claim is supported or refuted by the evidence. Self-verification
baselines seek to critique or modify model predictions but often lack
reasonable activation patterns and can incur significant computational costs.
B. Research Gap
Notwithstanding significant advances in grounding, uncertainty modeling,
and verification three structural constraints persist:
1. Nonintegrated Control: Uncertainty
quantification is seldom integrated into generation control loops.
2. Lack of Adaptive Verification: It is a
widespread practice to utilize verification pipelines uniformly, which incur
unnecessary computation.
3. Scattered Reliability Design: Retrieval
grounding, uncertainty modeling and validation are mostly realized in a
decoupled manner but the user interface would have been considered as an
integrity awareness system from the beginning.
Thus, there is no integrated approach to adaptively focus verification
effort according to semantic-level uncertainty and computation efficiency while
maintaining tight evidence alignment.
C. Objective and Proposed Approach
This work proposes a unified reliability-aware architecture that
integrates:
·
Hybrid retrieval grounding for robust evidence acquisition,
·
Semantic entropy–based uncertainty gating for adaptive control,
·
Conditional self-verification for structured claim validation,
·
Evidence-regulated regeneration for final response refinement.
The central hypothesis is that hallucinations correspond to elevated
semantic instability; therefore, verification effort should be allocated
adaptively according to semantic uncertainty.
Formally, let:
·
·
·
The objective is to generate response
·
Every factual claim in
·
The probability of hallucination is minimized,
·
Confidence is calibrated to correctness likelihood,
·
Expected computational cost is bounded.
D. Contributions
The primary contributions of this work are summarized as follows:
1. Semantic Entropy Gating Mechanism:
We present an adaptive gating mechanism
which activates verification only when semantic entropy is above a learned
threshold.
2. Conditional Self-Verification Pipeline:
We devise a four-phase structured verification
process consisting of claim segmentation, evidence matching, contradiction
detection, and unsupported-claim filtering.
3. Hybrid Retrieval Optimization:
We combine lexical and dense retrieval with
a fusion-based ranking to enhance evidence alignment and reduce the impact of
retrieval-induced hallucination.
4. Reliability–Efficiency Trade-off
Formalization:
We then introduce a computational
expectation model that can reduce the overhead compared to reinforced
multi-pass verification.
5. Deployment-Oriented Architecture:
The system is intended to be scalable for
deployment in an enterprise-wide fashion, with traceability, declination
capability and confidence calibration of reporting.
II. LITERATURE REVIEW
This section synthesizes prior work relevant to
self-reflective LLMs and reliability in generative AI, with emphasis on (i)
hallucination mechanisms and taxonomies, (ii) retrieval-grounded generation,
(iii) uncertainty estimation particularly semantic entropy, (iv)
self-verification and verification-and-validation (V&V) perspectives, and
(v) explainability, fairness, and deployment considerations. The review also
positions the proposed framework within existing research and clarifies the
specific gaps it addresses, as shown in Fig 1.
Fig. 1. Taxonomy of Reliability
Mechanisms in Generative AI (Shape Diagram)
A. Hallucination in LLMs: Definitions, Taxonomies, and
Root Causes
Hallucination has been identified to be a major
reliability challenge in LLM deployments. In a broad survey, hallucination is
formalized as the production of content that contradicts the exemplar evidence
or factual reality and to some passage degree uttered with high confidence [2],
[3]. They present taxonomies that distinguish between intrinsic hallucinations
(from fabricated data with no basis in fact) and extrinsic hallucinations (from
propositions or inferences without evidence) [2], [3] as well as
retrieval-induced hallucinations (mistakes induced or exaggerated by noisy
retrieval contexts). The root cause
detected are lack of parametric representation, autoregressive decoding with
exposure bias, shift in distribution between pretraining and deployment
domains, and generation of fluent continuations by the model even at
low-evidence condition [2], [3]. Several studies also point out that
hallucination is not just a “random error”, but can be structurally systematic
(particularly for knowledge-rich tasks where the model may prioritize coherence
over faithful reflection of truth) [10]. In multilingual settings,
hallucination can be seen as translation confabulation and thus the model is
producing text which was not there in the input, leading to issues with
reliability across languages and tasks [11]. Collectively, they argue in favor
of architectural interventions to suppress hallucinations rather than mere
heuristic filtering at a superficial level, highlighting the importance of
architectures which include verification and uncertainty-awareness over
generation [2], [3], 8].
B. Retrieval-Augmented Generation: Grounding Strengths
and Persistent Weaknesses
One practical direction has been to reduce
hallucination by endowing systems with the ability to ground their responses in
external corpora and making evidence traceable [4]. Generating based on
passages retrieved RAG Relieves the burden of relying on parametric memory that
is rarely perfect and helps better adapt to knowledge especially in
enterprise/scientific domain where knowledge is evolving [4], [16]. Yet
literature constantly suggests that recovery attempt has dubious reliability in
and of itself. Evidence misalignment happens if the model focuses on irrelevant
retrieved passages or if it generalizes from insufficient evidence; citation
hallucination occurs when the generated references do not reference the
retrieved sources or are used in a wrong way [2], [4]. Combining lexical and
dense retrieval is an alternative increasingly used to mitigate the recall and
robust shortcomings of standard models, specifically for paraphrased queries
and long tail terminology [15]. Bruch et al. studies fusion strategies for
hybrid retrieval and demonstrates that rank-fusion methods can stabilize the
quality of retrieval across modalities, which is a crucial requirement for
downstream factuality [15]. Yet with better retrieval, the generator may still
generate false claims based on weak/inconsistent evidence, which suggests that
RAG can also benefit from verification and uncertainty-aware gating rather than
blindly rely on retrieved context [2], [4], [8].
C. Uncertainty Estimation: From Token Entropy to
Semantic Entropy
Estimating uncertainty is frequently suggested to
identify questionable outputs and calibrate system-level decision-making [5].
Earlier work demonstrates that token-level uncertainty scores do not correspond
directly to correctness, as lexical variation can be high even when meaning is
retained and low even when it is lost [13], [5]. Calibration studies also show
that the LLM probability outputs can often be poorly aligned with correctness,
particularly in question answering tasks and that there is likely a need for
better uncertainty indicators and calibration-aware architectures [13].
Semantic entropy offers an important advancement in that it computes
uncertainty of meaning space by using multiple sampled outputs grouped via
semantic equivalence to calculate entropy over semantic clusters [1]. Farquhar
et al. show a strong correlation of semantic entropy with hallucination and
confabulation, especially in long form generation where surface-form metrics
struggle [1]. The traceability of this is that semantic uncertainty can provide a
principled signal for adaptive verification where systems can defer engaging
the costly validation until model-based decision making has low semantic
stability [1], [5]. It is this that inspires the entropy-gate verification
policy of our proposed framework, which attempts to actively control rather
than post-hoc detect during generation.
D. Self-Verification, Verification-and-Validation, and
Self-Reflective LLMs
Self-critical LLM methods aim at enhancing reliability
by allowing the model to criticize, modify, or validate its own inferences.
Although the literature catalogues several “self-correction” behaviors, the
larger safety and trust community is coming to describe reliability in
increasingly V&V-like terms, with emphasis on structured checks, traceable
evidence and quantifiable certitude [8]. The V&V lens indicates that
trustworthiness needs explicit machinery for claim validation, negation
detection and abstention when situated within high-stakes environments [8]. A
common drawback of all self-verification methods though is that verification
generally occurs on all hypotheses with equal weights leading to potential
windfalls in computation, depending on the level of difficulty or quality of
evidence. Also, self-consistency-based approaches enhance the precision of
output when there are multiple outputs without explicit enforcement of evidence
grounding or citation abuse detection [2], [3]. In this context, there is thus
a space for methods providing (i) Grounding (RAG), plus (ii) principled
uncertainty signals such as semantic entropy and (iii) structured verification
routines that are actioned only when the risk is high a niche of integration
outlined by precisely [5], [8] the integration proposal targeted at by our
architectural design.
E. Explainability, Traceability, and Enterprise-Grade
Trust Requirements
Interpretability and traceability are critical in
building trust with deployment. Explainability in LLMs Surveys on
explainability for LLMs highlight the requirement of procedures to justify
outputs, reveal evidence paths and provide support to auditing especially in
fields with compliance mandates [7]. In the generative tradition of Gen-XAI
agendas, there are strong assertions about explanation as supposed to be
accountable for evidence and model behavior rather than just shallow rationales
[9]. These are in keeping, I argue here, with evidence-based generation
citation verification which allows for audit and post-hoc account. The
enterprise deployment literature suggests that trust in and satisfaction with a
system are influenced by system transparency, retrieval quality, and the
linkage to supporting information [16]. Therefore, systems that allow exposure
of citations and verification status are to be favored over those which have
utter fluent outputs without traceability of provenance [7], [9], [16]. The
strength of the proposed framework is to directly consider them by generating
evidence citations, verification results and confidence scores using
uncertainty signals [1, 13].
F. Knowledge Graphs, Bias, Fairness, and Safety
Constraints
Beyond factuality, how trustworthiness is determined
also includes bias, fairness and systemic harm. Survey works on bias and
fairness in LLMs papers about to what extent models may encode dangerous
stereotypes, model performance difference among demographic groups and data
distributional biases inherited from training data [12]. General fairness
surveys in machine learning offer several more frameworks for auditing and
mitigation [14]. These concerns are important in reliability-oriented designs,
since the verification steps and retrieval corpora can also accentuate or
dampen viewpoints resulting in subtle failure modes if this is not checked
[12], [14]. Some work introduces structured knowledge sources like knowledge
graphs to enhance grounding and mitigate hallucinations by restricting
generation to grounded relations [6]. While promising, this approach usually
implies the maintenance of a curated graph that might not be practical in fast
evolving enterprise scenarios. Thus retrieval-based grounding supplemented with
verification is still a viable, practical avenue particularly when combined
with V&V-like safety principles [6], [8]. The proposed framework is still
compatible with knowledge graphs as a potential source of evidence, and remains
general for unstructured corpora [6], as Table 1 indicates.
Table 1. Representative Literature and
Contributions
|
Reference |
Contribution |
Limitation |
|
[2], [3], [11] |
Definitions, causes and task-specific
analysis |
Often descriptive; limited unified
control mechanisms |
|
[4], [15], [16] |
Evidence retrieval and grounded
generation |
Misalignment, citation
hallucination, context saturation |
|
[5], [13] |
UE methods; calibration gaps in QA |
Token-level signals often weak
proxies for truth |
|
[1] |
Meaning-space uncertainty correlates
with hallucination |
Frequently used post-hoc; not always
in control loop |
|
[8] |
Verification, validation,
trustworthiness lens |
Verification overhead and
integration challenges |
|
[7], [9] |
Interpretability, traceability |
Explanations may not enforce
evidence grounding |
|
[12], [14], [19] |
Auditing and mitigation frameworks |
Often not integrated into
reliability pipelines |
|
[6] |
Structured grounding constraints |
Requires graph maintenance; not
always practical |
I. Positioning of the Proposed Work
Strong advances have been made in
individual RAG components: grounding [4], hybrid optimization [15], uncertainty
estimation and calibration [5]. V&V-inspired safety works in semantic
entropy for hallucination detection [1] and V&V inspired safety frameworks
[8]. Yet there is still a vacuum in integrating them into an architecture whose
generation (i) grounds, (ii) estimates semantic uncertainty and only (iii)
triggers structured self-verification when costly or facts are likely
untrustworthy to reduce the risk [1], [4],[8]. This paper thus proposes a
framework that applies semantic entropy as an actual and working control signal
(thus not confined to being purely diagnostic), aligns it with retrieval
grounding and claim-level verification, and empirically investigates the
trade-off of reliability v. efficiency from such decision policy using
evidence-grounded metrics (e.g., mean average precision) as well as calibration
measures [1], [13]. This is an intuitive solution to the problem of
fragmentation witnessed in previous research, and encompasses business
deployment requirements such as traceability, auditing and bounded latency [7],
[16].
III. PROPOSED FRAMEWORK
A. System Objective, Design Principles, and
Implementation
The objective of the framework is to maintain (close
to) zero likelihood for hallucinations, so that Large Language Models (LLMs)
can be deployed with practical computational and financial cost. Hallucinations
Recently, it has been reported (3) that hallucination phenomenon is still a
problem in retrieval-augmented systems. Rather than post-hoc processing
hallucination detection, our model encapsulates reliability into the generative
control loop. This architecture also changes the traditional way of depending
on passive error correction to a proactive reliability-aware generation, as
summarized in Table 2.
Table 2. Dataset Composition and Domain Distribution
|
Dataset |
Queries |
Domain |
|
Natural QA |
1,200 |
General |
|
SciFact |
900 |
Scientific |
|
Enterprise QA |
1,100 |
Policy/Technical |
Fig. 2. System Architecture of the
Proposed Hybrid Retrieval and Entropy-Gated Verification Framework.
These GARIC principles influence the four fundamental
aspects of the framework, which are illustrated in Fig 2. First, we enforce
strong evidence grounding - that is, all factual claims are based on retrieved
documents as in retrieval-augmented approaches [4]. Second, we consider that
the uncertainty at the semantic level is taken into account by considering a
principled estimate for epistemic instability based on semantic entropy [1].
Third, adaptive verification is the method of dynamic computation validation where
a system tests for the validation when uncertainty reaches some appropriate
level to justify overheads [5]. Fourth, expected computational cost is not too
high; We do not want improved reliability to rule out the real-time requirement
of an enterprise deployment. Together, these principles yield a clear trade-off
between robustness and efficiency.
In a formal sense, the system minimizes this empirical
hallucination probability:
subject to a latency constraint:
This formulation converts reliability enhancement into
a constrained optimization problem, balancing factual correctness and
operational scalability.
Implementation Details
The framework was implemented in Python 3.10 using
PyTorch and HuggingFace Transformers. Hybrid retrieval was constructed using:
- BM25
lexical retrieval via rank_bm25
- Dense
embedding retrieval using "Sentence Transformers"
(all-mpnet-base-v2)
- FAISS
indexing for efficient nearest-neighbor search
The base language model is a 7B-parameter
instruction-tuned LLM, which employs 4-bit quantization for better memory
efficiency. Experiments were run on a NVIDIA A100 GPU. The lexical and semantic
ranks are combined in the hybrid retrieval using Reciprocal Rank Fusion. The
estimation of the semantic entropy is performed on the [email protected]
vectors generated by multi-sample generation (k = 5 with a temperature of 0.7)
and agglomerative clustering. This conditional test is turned on if and only if
the semantic randomness exceeds some priori tuned threshold.
Operational Deployment Objective
The system was designed to meet online serving
requirements and its target latency is sub second per query (in particular <
1.2 seconds). By gathering on the entropy for tuning the level of gated
verification, it is guaranteed that computationally expensive verification
takes place only when there are high uncertainty and nodes not visited much
before are less likely to be revisited comparing with conventional multi-pass
verification without gating. The final architecture integrates:
·
Hybrid retrieval for evidence
acquisition,
·
SEG for estimating epistemic risk,
·
Conditional self-verification for
claim-level validation,
·
Regeneration constrained by evidence
to refine the final output.
This unified design enables dynamic reliability control
while maintaining bounded computational overhead, making it suitable for
high-stakes enterprise applications.
B. Formal Problem Modeling (Operationalized)
For each user query
where each
A hallucination event
Let:
where
where the similarity threshold was empirically calibrated to
Thus, a hallucination indicator variable is defined as
Empirical Hallucination Probability
Given a dataset of
where
In the experimental setting (Python implementation over 3,200 queries),
this empirical estimator was used to compute hallucination rates across model
configurations.
Constrained Optimization Objective
The system seeks to minimize empirical hallucination probability while
satisfying latency constraints:
where
Latency was measured end-to-end per query using Python "time.perf_counter()"
across GPU inference, retrieval, entropy sampling, and verification steps, as shown in Table 3.
Table 3. Experimental Model Variants and Configuration
Details
|
Model |
Description |
|
B1 |
LLM only |
|
B2 |
RAG |
|
B3 |
RAG + Unconditional Verification |
|
B4 |
Proposed Entropy-Gated Framework |
Practical Interpretation
Introducing this, the following is an empirical loss plus constraint
formulation to hallucination constraint:
·
The term for the objective reflects realism.
·
The constraint ensures real-time deploy ability.
·
The threshold regulates the strictness of support check.
·
The latency bound (1200 ms) is guided by enterprise deployment need.
Through incorporating semantic similarity validation and entropy-based
gating, such framework approximates constrained risk minimization without the
need for supervised hallucination labels at inference time.
C. Hybrid Retrieval Performance
The
retrieval module combines scoring of lexical similarity including (BM25) based
scores with dense embedding similarity into a single objective, which learns to
trade off recall and semantics coverage. Lexical retrieval gives term-based
matching and dense retrieval can go beyond term surface forms to tap into
semantic relatives. Hybrid search techniques are known to be more robust when
presented with a mixed query type [15]. Stabilizing a weighted score fusion
with the Lexical-Dense for document ranking. The proposed method uses the
Reciprocal Rank Fusion (RRF) for addressing the issue of rank variance caused
by either sparse term overlap or embedding noise [15]. The evidence context for
generation is the top-ranked documents. This hybrid design is in line with
recent RAG systems which contain both parametric and non-parametric knowledge
[4].
Table 4: Top-k retrieval (k = 5) was evaluated using
Recall@5:
|
Retrieval Type |
Recall@5 |
|
BM25 Only |
0.71 |
|
Dense Only |
0.78 |
|
Hybrid (RRF) |
0.86 |
Hybrid retrieval significantly improved evidence coverage, reducing
retrieval-induced hallucinations by approximately 9.4% compared to BM25 alone,
as shown in Table 4.
D. Semantic Entropy Calibration
LLMs are limited by a static context window
size. Naively stacking the retrieval documents may suffer from the saturation
of tokens, and attention distraction. Existing research on retrieval-induced
hallucination [2], [4] pinpoints context interference as a possible source of
error. To address this challenge, the methodology features a step for
optimizing the context by selecting documents according to their degree of
relevancy and number of tokens. This re-ranking is formulated as a constrained
optimization problem on the relevance scores. In a framework of maximizing
evidence density rather than volume, our approach enhances the grounding
quality while eliminating attention fragmentation. For entropy estimation:
· Number of samples
· Temperature
·
Agglomerative clustering (cosine threshold = 0.82)
Entropy threshold
Empirical correlation between entropy and hallucination:
This confirms strong monotonic association.
E. Evidence-Constrained Generation
The generation module generates an initial response
based solely on the Retrieved Evidence. The LLM is also provided with the query
and evidence snippets, as well as the explicit guidance of refraining in case
there is insufficient evidence. Previous work demonstrated that this structured
prompting helps to alleviate the problem of unsupported claims for
retrieval-augmented models [4]. While prompt-based restrictions are not able to
ensure 100% factual compatibility, there is empirical evidence showing that the
inclusion of explicit grounding instructions leads to a significant decrease in
hallucination (see [2]). Thus, the output is biased by evidence generation and
cannot be viewed as unconstrained parametric extrapolation.
F. Semantic Entropy-Based Uncertainty
Estimation
Epistemic instability is detected allowing the model to
sample a set of independent responses through temperature-controlled sampling.
Answers are projected into a semantic space and clustered by concept
similarity. Unlike token level entropy, which exploits lexical variability to
measure uncertainty, semantic entropy captures the distribution of meaning in
meaning space [1]. Semantic entropy is computed as:
The verification module is enabled if the set of
look-ahead symbols have a semantic entropy exceeding a predetermined limit.
This threshold is optimized for low missed hallucinations (false negatives) and
fast verification. The rule operates as a filter gate in the generator. This is
a poised behavior in comparison to the unconditional multi-pass verification
system, which must completely pay for overhead regardless of degree of
certainty. With the entropy-guided gating, the system verifies only dangerous
queries with fixed expected latency.
At inference time, the self-verification module
decomposes the predicted response into atomic claims. Likewise, an automatic
evidence-scoring method is used by simulating the semantic relatedness of claim
to retrieved evidence. Less strongly supported claims would be removed or
modified under evidence-based safety strategies [8]. There are also a series of
logical consistency verifications that the system performs to identify contradictions
between claims and evidence. This lifting hardens the generative output and
generates a semi-symbolic phase which is more accountable, but nonetheless
explainable [7, 9].
Claims are varied and an evidence sub-set is retrieved,
filtering out unsupported or directly opposed ones. Subsequently, the final
response is generated by the model which is restricted only to proven evidence.
Since the hallucination is not propagated, two-phase generation provides better
grounding. A comprehensive one is when all the citation, and evidence is with a
check list. This is transparent to enterprises and corresponds the enterprise
trust and compliance concerns enumerated in safety and reliability works [8].
The system performance is measured quantitatively from
various perspectives. The precision of EGA measures the proportion of actual
claims for which two retrieved documents matched. The Unsupported Claim Rate
(UCR) measures when hallucinations are reported [2]. The Expected Calibration
Error (ECE) evaluates the calibration of predicted confidence scores to
empirical correctness [13]. All those two are multi-dimensional confidence
proof for faithfulness to grounding, uncertainty calibration and operational soundness.
The proposed framework is expected to incur the
computational cost:
Because of the condition of its use, overhead is
expected to be less than in unconditional verification techniques. The approach
above maintains scalability for real-time deployments and improves the
reliability [5].
Assuming a monotonic increase in the probability of
hallucination with semantic entropy, then such a bound is an effective upper
bound for preventing hallucination [1]. Adaptive gating is therefore an
approximation to some form Bayesian risk minimization, since there is no
reliance on ground-truth supervision at test time. This conceptual framework
motivates semantic entropy to be integrated into the generation control loop
and justifies why reliability-oriented LLM can be used for safety-critical use
cases.
V. RESULTS
A. Overall Performance Comparison
The proposed entropy-gated self-reflective framework
(B4) was evaluated against three baselines: standalone LLM (B1), RAG-only (B2),
and RAG with unconditional verification (B3). Table 5 summarizes the main
quantitative results across evaluation metrics.
Table 5: Reliability and Efficiency
Comparison
|
Model |
Model |
EGA |
UCR |
Latency |
|
B1
(LLM Only) |
B1 |
0.61 |
0.39 |
420
ms |
|
B2
(RAG Only) |
B2 |
0.74 |
0.26 |
690
ms |
|
B3
(RAG + Verification) |
B3 |
0.88 |
0.12 |
1540
ms |
|
B4 (Proposed) |
B4 |
0.91 |
0.09 |
980
ms |
The new model outperformed other models in terms of
highest EGA and lowest UCR. Suturing-induced enteropathy was substantially ameliorated
as compared with RAG-only systems. While unconditional verification(B3)
achieved better factuality(reach), it also brought higher latency (1.7). The
proposed method also led to a reduction on the expected calibration error
(ECE), which could be interpreted as better matching predicted confidence and
empirical correctness.
B. Impact of Semantic Entropy Gating
To quantitatively test how well the semantic entropy
criterion acted as a gating rule, we investigated the rate of hallucination
with respect to every possible entropy threshold
Fig. 3. Empirical Relationship Between
Semantic Entropy and Hallucination Probability
Fig. 3
visualizes the empirical relationship between semantic entropy and
hallucination probability across 3,200 evaluation queries. The horizontal axis
represents semantic entropy
By bounding semantic entropy through adaptive
verification, the system effectively bounds hallucination risk.
C. Ablation Study
We carried out an ablation study to understand the role
of each architectural element. The result demonstrates that entropy alone
provides a little gain by offering a way to identify unstable response; and
verification alone is able to reduce the error due in part to better grounding
but at cost of an increase of computation. The best reliability improvement is
provided by combination of entropy gating and conditional check, see Table 6.
Table 6: Ablation Analysis.
|
Configuration |
EGA |
UCR |
Latency (ms) |
|
RAG
Only |
0.74 |
0.26 |
690 |
|
+
Verification |
0.88 |
0.12 |
1540 |
|
+
Entropy Only |
0.79 |
0.21 |
820 |
|
Full
Model |
0.91 |
0.09 |
980 |
D. Computational Efficiency
This selective verification strategy significantly
reduces unnecessary computational overhead compared to unconditional
verification pipelines. Empirical results confirm that verification was
activated only for a subset of high-entropy queries, thereby maintaining
efficiency while preserving reliability. Consequently, the system achieves a
balanced latency profile that is well-suited for real-time deployment
scenarios. The expected computational cost:
VI. DISCUSSION
A. Reliability Gains and Theoretical Implications
Results of experiments are argued to show that
hallucination is highly correlated with semantic indeterminacy. This is in line
with theoretical intuition that semantic entropy estimates epistemic
uncertainty. Semantic entropy, which is below 10 with the beam size of 100,
bounds hallucination probability, and it also avoids semantic transition. In
contrast to original RAG pipelines which build considerably on external
grounding, we present a second-order reliability mechanism: self-reflective
uncertainty-aware validation. This stratified process is more consistent with
verification-and-validation concepts, and with those described in previous
studies related to safety (see Table 7).
Table 7: Reliability and Efficiency Comparison
|
Model |
EGA ↑ |
UCR ↓ |
ECE ↓ |
Latency (ms) |
|
B1 (LLM Only) |
0.61 |
0.39 |
0.148 |
420 |
|
B2 (RAG Only) |
0.74 |
0.26 |
0.109 |
690 |
|
B3 (RAG + Verification) |
0.88 |
0.12 |
0.081 |
1540 |
|
B4 (Proposed) |
0.91 |
0.09 |
0.056 |
980 |
·
Hallucination rate reduced from 39% (B1) to 9% (B4).
·
Compared to RAG-only (B2), hallucination decreased by 17%.
·
Compared to unconditional verification (B3), latency was reduced by 36%.
Paired t-test between B3 and B4 on EGA:
B. Trade-off Between Reliability and Efficiency
A key challenge in self-verification systems is
computational overhead. Unconditional verification increases latency for all
queries, including low-risk ones. The entropy-gated approach addresses this
limitation by selectively allocating verification effort, as shown in Table 8.
Table 8. Relationship Between Semantic Entropy and
Hallucination Probability
|
Entropy Range |
# Queries |
Hallucination Rate |
95% CI |
|
Hsem < 0.5 |
1,480 |
4.1% |
±0.9% |
|
0.5 ≤ Hsem < 0.9 |
920 |
11.3% |
±1.8% |
|
Hsem ≥ 0.9 |
800 |
37.2% |
±2.9% |
Where:
·
Verification was triggered for 41% of queries.
·
False-negative rate (missed hallucinations): 3.8%
·
False-positive rate (unnecessary verification): 6.1%
·
ROC AUC for entropy-based detection: 0.87
Experiments further show
that in fact-induced invasions conditional assessment is similar to
unconditional approach but with a much lower expected cost. This results in a
useful reliability efficiency trade-off curve that is tunable through the threshold
calibration. system It is also suitable for high-stakes and knowledge-rich
systems, such as enterprise knowledge assistants, legal regulatory document
analysis system, medical information retrieval system and scientific
summarization tool. The architecture enforces a certain degree of system transparency
and trust model via explicit evidence citations and verification status.
Furthermore, entropy-based gating schemes can dynamically bypass high
uncertainty regions to deter the model from producing over-confident
misinformation as demonstrated in Table 9.
Table 9: Component Contribution
|
Configuration |
EGA |
UCR |
Latency (ms) |
|
RAG Only |
0.74 |
0.26 |
690 |
|
+ Verification |
0.88 |
0.12 |
1540 |
|
+ Entropy Only |
0.79 |
0.21 |
820 |
|
Full Model |
0.91 |
0.09 |
980 |
Entropy alone improves detection slightly, but strongest gains occur
when gating and verification are combined.
D. Limitations
The limitations although the results are promising:
·
Estimating entropy under this
environment is challenging as
multiple generation samples are needed, hence high computational cost.
·
The quality of the semantic clustering relies on the representation in embedding.
·
The verification module depends on
retrieval quality; thus, noisy
evidence can still lead to residual mistakes.
·
Threshold (t) needs validation tuning
and could be data domain dependent.
Future research avenues include lighter entropy approximations,
adaptive sampling plans and how to integrate the models with structured
knowledge graphs.
VII. CONCLUSION
In this work, we presented an entropy-gated self-reflexive reliability
procedure for LLs, which bridges cross-population retrieval-based and
base-leveling grounding with the notion that semantic coherence-based
uncertainty and conditional claim-specific verification driven confusion can be
managed in a unified controlling regime. This key assumption, that the instability
of semantic content drives hallucination, received empirical support: strong
positive correlations between them indicate that semantic entropy might be
considered an ecological-sound and robust correlator for epistemic uncertainty.
Experimental results on 3,200 evaluation queries demonstrated that the proposed
method achieved the lowest hallucination rate (9%) against all these setups and
satisfied the latency requirement for real- time deployment. Unlike methods (e.g.
unconditional verification) that outdo computability in accuracy (which
unavoidably fails the third battle), entropy-gated verification focuses on
verifying those claims with a high entropy instead. These open loops are what
this adaptive study will attempt to close to maintain computational gain while
achieving increase in reliability. The framework does manage to trade off
soundness with efficiency by interleaving evidence grounding uncertainty
modeling and structured validation in a single generation loop. The findings
also suggest that the hallucination awareness mechanism may not strictly rely
on evidence selection from an external source of verification or uniform
evaluation process: It is shaped to be flexibly modulated by the uncertainty
levels at a semantic level. Finally, we believe that the findings of this paper
not only provide theoretical and empirical evidence in support of the
entropydriven generation control as a scale-in way to deploy reliable LLM, but
also raise interesting implications for effective AI/ML. The framework that we
produce provides promising outlook for the way forward in terms of developing
reliable generative-AI-driven systems; it presents potential implications for
enterprise, and science-based decision-making application areas such as
high-stakes contexts.
Reference
1. Farquhar, S., et al. (2024). Detecting
hallucinations in large language models using semantic entropy. Nature. https://doi.org/10.1038/s41586-024-07421-0
2. Huang, L., et al. (2025). A survey on
hallucination in large language models: Principles, taxonomy, challenges, and
open questions. ACM Transactions on Information Systems. https://doi.org/10.1145/3703155
3. Dang, A.-H., & Nguyen, T. L.-M. (2025).
Survey and analysis of hallucinations in large language models. Frontiers in
Artificial Intelligence. https://doi.org/10.3389/frai.2025.1622292
4. Klesel, M., & Wittmann, J. (2025).
Retrieval-augmented generation (RAG). Business & Information Systems
Engineering. https://doi.org/10.1007/s12599-025-00945-3
5. Survey of uncertainty estimation in LLMs:
Sources, methods, applications, and challenges. (2026). Information Fusion. https://doi.org/10.1016/j.inffus.2025.104057
6. Lavrinovics, E., et al. (2024). Knowledge
graphs, large language models, and hallucinations: An NLP perspective. Journal
of Web Semantics. https://doi.org/10.1016/j.websem.2024.100844
7. Zhao, H., et al. (2024). Explainability for
large language models: A survey. ACM Transactions on Intelligent Systems and
Technology. https://doi.org/10.1145/3639372
8. A survey of safety and trustworthiness of
large language models through the lens of verification and validation. (2024).
Artificial Intelligence Review. https://doi.org/10.1007/s10462-024-10824-0
9. Schneider, J. (2024). Explainable
generative AI (GenXAI): A survey, conceptualization, and research agenda.
Artificial Intelligence Review. https://doi.org/10.1007/s10462-024-10916-x
10. Hicks, M. T., Humphries, J., & Slater,
J. (2024). ChatGPT is bullshit. Ethics and Information Technology. https://doi.org/10.1007/s10676-024-09775-5
11. Guerreiro, N. M., et
al. (2023).
Hallucinations in large multilingual translation models. Transactions of the
Association for Computational Linguistics. https://doi.org/10.1162/tacl_a_00615
12. Gallegos, I. O., et al. (2024). Bias and
fairness in large language models: A survey. Computational Linguistics. https://doi.org/10.1162/coli_a_00524
13. Jiang, Z., Araki, J., Ding, H., &
Neubig, G. (2021). How can we know when language models know? On the
calibration of language models for question answering. Transactions of the
Association for Computational Linguistics. https://doi.org/10.1162/tacl_a_00407
14. Mehrabi, N., et al. (2022). A survey on
bias and fairness in machine learning. ACM Computing Surveys. https://doi.org/10.1145/3457607
15. Bruch, S., Gai, S.,
& Ingber, A. (2024). An analysis of fusion functions for hybrid retrieval. ACM Transactions
on Information Systems. https://doi.org/10.1145/3596512
16. Cleverley, P. H., & Burnett, S. (2019).
Enterprise search and discovery capability: The factors and generative
mechanisms for user satisfaction. Journal of Information Science. https://doi.org/10.1177/0165551518770969
17. Balancing factual consistency and
informativeness for abstractive summarization. (2025). International Journal of
Machine Learning and Cybernetics. https://doi.org/10.1007/s13042-025-02724-8
18. Shakil, H., Farooq,
A., & Kalita, J. (2024). Abstractive text summarization: State of the art, challenges, and
improvements. Neurocomputing. https://doi.org/10.1016/j.neucom.2024.128255
19. Feuerriegel, S.,
Dolata, M., & Schwabe, G. (2020). Fair AI: Challenges and opportunities. Business &
Information Systems Engineering. https://doi.org/10.1007/s12599-020-00650-3
20. Feuerriegel, S., et
al. (2024). Generative AI. Business & Information Systems Engineering. https://doi.org/10.1007/s12599-023-00834-7


