Teaching RAG to Say 'I Don't Know'
The quality of a RAG system isn't defined by the moments it answers correctly. It's defined by whether it can recognize the moment it shouldn't answer at all and stay quiet. Most RAG implementations lack this second ability: ask anything, and they hand you something. This post walks through a retrieval layer that bakes "how much can we trust this answer" into the pipeline itself, and when needed says "I don't have information on this" without ever calling the LLM.
The examples come from a self-hosted RAG platform running in production; the language is Python and the database is PostgreSQL + pgvector.
The problem: high similarity, wrong answer
The classic RAG flow is simple:
# Naive RAG: always produces an answer
async def naive_answer(db, query: str) -> str:
embedding = await embed(query)
chunks = await vector_search(db, embedding, top_k=5) # always returns 5
context = "\n\n".join(c.content for c in chunks)
return await llm.chat(
system=f"Answer using the following context:\n{context}",
user=query,
)
Embed the query, pull the nearest k chunks, stuff them into the prompt, ask the LLM. This flow has a silent flaw. vector_search always returns "the nearest k results," but it never asks "is anything actually close enough?" Even when the knowledge base contains nothing about the topic, the nearest k chunks still come back and land in the prompt. Notice that naive_answer can never say "I don't know" under any condition; its return type doesn't even make that possible.
The conceptual mistake here is treating cosine similarity as a quality signal. Similarity is a ranking signal: is result A more relevant than result B? It answers that well. But "is result A relevant enough to answer this question?" is a different question, and similarity alone doesn't answer it. The "best" chunk at 0.65 cosine distance can still be entirely irrelevant in absolute terms.
The result is RAG's most dangerous mode: the model builds extremely confident sentences on top of weak context. The user never sees the source, only a fluent and wrong answer.
The fix is not to delegate "confidence" to the prompt (saying answer only based on the context). That instruction helps, but it's a weak line of defense; it relies on the model complying. Far more robust is to measure confidence in the retrieval layer and never go to the LLM when it's low.
A two-tier threshold
A single threshold doesn't solve this, because we're making two distinct decisions:
- Is this chunk worth putting into the context? (noise cleanup)
- Should we even attempt an answer with the context we have? (the answer decision)
Different jobs, different thresholds:
# Minimum confidence for a chunk to enter the context (filters low-quality matches)
MIN_CHUNK_CONFIDENCE = 25.0
# If average confidence is below this, skip the LLM and reject directly
MIN_CONFIDENCE_GATE = 40.0
First we convert cosine distance into a 0-100 chunk score:
def _distance_to_confidence(distance: float, max_dist: float = 0.65) -> float:
if distance <= 0:
return 100.0
if distance >= max_dist:
return 0.0
return round(max(0, (1 - distance) * 100), 1)
A linear transform; not a sophisticated calibration, and deliberately so. max_dist is a ceiling: anything beyond that distance counts as zero and is already dropped during retrieval. The goal here isn't to produce a true probability, but a consistent scale that's comparable against thresholds. On that scale, 25 ("don't put it in the context") and 40 ("don't answer at all") become meaningful boundaries.
Hybrid retrieval: vector + BM25, fused with RRF
Vector search captures meaning but is weak on exact matches: a product code, a proper name, an SKU doesn't resemble something semantically "close," it appears verbatim where it appears. BM25 keyword search is the opposite: strong on exact matches, weak on meaning. Using both is natural, but there's a problem: their scores aren't comparable. A cosine distance and a PostgreSQL ts_rank are not on the same scale and can't be summed.
Each search produces its own ranked list. The vector side, via pgvector and cosine distance:
distance_expr = Chunk.embedding.cosine_distance(query_vector)
stmt = (
select(Chunk, distance_expr.label("distance"))
.where(Chunk.embed_id == embed_id)
.where(Chunk.chunk_type == "child")
.order_by(distance_expr) # smaller distance = more relevant
.limit(top_k)
)
The keyword side, via PostgreSQL full-text search over a prefix-matching tsquery:
# "return policy" -> "return:* & policy:*" (prefix also matches "returns", "policies")
ts_query = func.to_tsquery("simple", func.unaccent(prefix_expr))
rank_expr = func.ts_rank(Chunk.search_vector, ts_query)
distance (near 0 = good) and ts_rank (large = good) point neither in the same direction nor on the same scale. The fix is to fuse ranks, not scores: Reciprocal Rank Fusion.
def _reciprocal_rank_fusion(
vector_results: list[tuple[str, float]],
keyword_results: list[tuple[str, float]],
k: int = 60,
) -> list[str]:
scores: dict[str, float] = {}
for rank, (chunk_id, _) in enumerate(vector_results):
scores[chunk_id] = scores.get(chunk_id, 0) + 1.0 / (k + rank + 1)
for rank, (chunk_id, _) in enumerate(keyword_results):
scores[chunk_id] = scores.get(chunk_id, 0) + 1.0 / (k + rank + 1)
return sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
The elegance of RRF is that it ignores the raw score value entirely. It uses only the rank: a chunk near the top of both lists gets a high total, one that appears in only a single list gets less. A concrete example (k=60):
Vector order: [A, B, C] Keyword order: [C, A, D]
A: 1/(60+0+1) + 1/(60+1+1) = 0.01639 + 0.01613 = 0.03252
C: 1/(60+2+1) + 1/(60+0+1) = 0.01587 + 0.01639 = 0.03226
B: 1/(60+1+1) = 0.01613
D: 1/(60+2+1) = 0.01587
Fused order: [A, C, B, D]
Even though neither A nor C tops a single list, they rise to the top because they appear in both. The constant k=60 (from the original RRF paper) softens the influence of the leading ranks; as k grows, the gaps between ranks shrink. This approach is scale-independent and resistant to outlier scores, which makes it the most practical way to combine two heterogeneous search systems.
After the fused ranking we filter out low-quality chunks, with one exception:
# Hand-written Q&A always ranks first, exempt from the minimum confidence filter
training_chunks = [c for c in chunks if c.chunk_metadata.get("source") == "training_data"]
doc_chunks = [c for c in chunks if c.chunk_metadata.get("source") != "training_data"]
# Drop document chunks below the threshold
doc_chunks = [c for c in doc_chunks if score_map.get(c.id, 0) >= MIN_CHUNK_CONFIDENCE]
Curated (hand-prepared) answers are exempt from this filter, and from the rejection gate we'll see next. The reasoning: if someone wrote the correct answer to this question by hand, the system's statistical confidence calculation shouldn't override that human decision.
The rejection gate: never calling the LLM
This is the real decision. If average confidence is below the threshold, we don't build a prompt and go to the LLM; we reject directly:
avg_confidence = _calc_final_confidence(chunks)
has_training_data = any(c.chunk_metadata.get("source") == "training_data" for c in chunks)
# If confidence is too low, skip the LLM. Never reject when training data matches.
if not has_training_data and avg_confidence is not None and avg_confidence < MIN_CONFIDENCE_GATE:
yield f"{json.dumps(meta, ensure_ascii=False)}\n"
yield "[REJECTION]"
yield await _get_rejection_message(embed, lang)
return
This has three concrete benefits:
- Zero hallucination risk. Weak context never reaches the model, so the model can't build on it.
- Token savings. No chat completion call is made for a query that will be rejected. At high volume this is a serious cost line.
- Faster response. Rejection skips the slowest step (LLM generation), so it returns instantly.
Since the stream is SSE, we mark the rejection with a [REJECTION] sentinel; when the frontend sees it, it renders the response as a "no information" state instead of a normal answer. From the client's side, a rejected query looks like this:
{"sources": [], "response_time_ms": 48, "avg_confidence": 22.0}
[REJECTION]
I don't have enough information about this topic in my knowledge base.
Note response_time_ms: 48: because the LLM was never called, the response returned in under 50 ms. For comparison, the first token of an accepted query usually takes 800-2000 ms. The rejection message itself is chosen by query language: first a fixed fallback dictionary, then LLM translation if needed.
_FALLBACK_REJECTION = {
"tr": "Bu konuda bilgi tabanımda yeterli bilgi bulunmamaktadır.",
"en": "I don't have enough information about this topic in my knowledge base.",
"de": "Zu diesem Thema habe ich nicht genügend Informationen in meiner Wissensdatenbank.",
}
Final confidence is not the retrieval score
The confidence score we show to the user and to analytics is not the raw cosine score. This is a deliberate separation: the raw retrieval score is an internal metric, while the confidence shown to the user is a UX decision. Telling a user "42% confidence" is misleading, because that number isn't confidence in the sense they understand.
def _calc_final_confidence(chunks, is_rejected=False):
scores = [getattr(c, "_confidence", None) for c in chunks]
scores = [s for s in scores if s is not None]
if not scores:
return round(10.0 if not is_rejected else 5.0, 1)
best_score = max(scores)
if is_rejected:
# Chunks found but couldn't answer -> 20-50 range
return round(20 + (best_score / 100) * 30, 1)
# Chunks found and answer given -> 70-100 range
normalized = min(1.0, max(0.0, (best_score - 20) / 60))
return round(70 + normalized * 30, 1)
The score first lands in a tier based on the outcome: answer given (70-100), chunks found but rejected (20-50), no chunks at all (5-10). The raw score of the best chunk only fine-tunes within that tier. This way the user interprets the number correctly: 80 really is "good," 35 really is "questionable."
How to tune these numbers
25, 40, 0.65, k=60 are not universal truths; they're values that settled in for our domain. The practical way to calibrate them for your own system:
- Log
avg_confidenceon real queries. - Hand-label correct and incorrect answers, and plot the confidence histogram of each group.
- Put the gate where the two distributions intersect.
In practice, a simple sweep over the logged data shows the real cost of the threshold:
# Evaluate candidate gate values over labeled historical queries
def evaluate_gate(samples, gate):
# samples: [(avg_confidence, was_correct), ...]
rejected = [s for s in samples if s[0] < gate]
answered = [s for s in samples if s[0] >= gate]
false_rejects = sum(1 for c, ok in rejected if ok) # rejected, but could have answered correctly
hallucinations = sum(1 for c, ok in answered if not ok) # answered, but wrong
return false_rejects, hallucinations
for gate in (30, 35, 40, 45, 50):
fr, hl = evaluate_gate(samples, gate)
print(f"gate={gate}: missed={fr}, hallucinations={hl}")
The two columns in the output are exactly the trade-off you want to manage: as gate rises, hallucinations drop and missed (needlessly rejected) queries climb. The domain's tolerance for error sets this balance. A medical or legal assistant should reject aggressively; a general FAQ bot can be more generous.
Closing
The path to better confidence in RAG is usually sought in "better embeddings," "a bigger model," "a smarter prompt." Yet the highest-leverage intervention is often simpler: teach the system to stay quiet. Instead of leaving confidence to the LLM's inner world, put a deterministic gate in the retrieval layer; that gate is cheap, testable, and predictable.
A system that can say "I can't answer this" earns more trust than one that answers everything. Because once a user gets a wrong-but-confident answer, they lose trust in the correct answers too.
Member discussion