Your RAG bot is confidently wrong and you will not notice until a customer calls
The chatbot had been running in production for six weeks before anyone noticed it was quoting a price list from nine months ago. Not occasionally. Every time someone asked about the premium tier, it would cheerfully produce a number that was wrong by forty percent and cite the source document with full confidence. The source document was real. It was just from before the November repricing. Somewhere in the vector store it sat next to the updated sheet, and the retriever kept picking it because the query "how much does the premium plan cost" matched the old document's phrasing better than the new one. Cosine similarity does not care about recency. The model did not know it was being lied to. Neither did we, for six weeks.
That is the specific failure mode of RAG with dirty data: not silence, not errors, not hallucinations in the traditional sense. Confident, sourced, plausible wrong answers. The LLM is doing exactly what it is supposed to do. The problem lives two steps upstream, in the ingestion pipeline, and you will not find it by looking at model outputs.
The assumption that breaks everything
Most teams treat RAG ingestion as a one-way door. Documents go in, chunks go into the vector store, the bot answers questions. The mental model is something between a search engine and a filing cabinet. You add files. The retriever finds relevant ones. Done.
The problem with this model is that it has no notion of time, no notion of conflict, and no notion of authority. A vector store full of chunks from different document versions is not a knowledge base. It is an archaeological site. Every layer is present simultaneously and the retriever excavates whichever layer has the highest similarity to the query, regardless of whether that layer describes a world that still exists.
In a well-curated corpus this is manageable. In a corpus assembled by a nightly n8n workflow that reads a shared Google Drive folder and appends everything it finds, it becomes a landmine. Shared drives accumulate documents the way email inboxes accumulate newsletters. Nobody deletes the Q3 2022 pricing deck because it might be useful someday. Nobody marks it as superseded. It just sits there, waiting to be embedded.
What the n8n workflow looked like before we fixed it
The original ingestion flow was three nodes: a Google Drive trigger that fired whenever a file changed, a document loader that split the content into chunks, and a Pinecone upsert that wrote everything to the vector store keyed by a hash of the chunk content.
// n8n Function node — naive chunk preparation
// Runs after the document loader splits text into chunks
const chunks = $input.all();
return chunks.map(chunk => ({
json: {
id: chunk.json.metadata.loc.pageNumber + '_' + chunk.json.pageContent.slice(0, 32),
content: chunk.json.pageContent,
metadata: {
source: chunk.json.metadata.source,
}
}
}));
Three things are wrong here. The chunk ID is derived from page number and content prefix, which means two versions of the same page in the same document generate the same ID and Pinecone upserts the new one over the old one correctly — but two versions of the same document with different filenames generate different IDs and both survive. The metadata carries only the source filename, which tells the retriever nothing about when the document was valid. And there is no deduplication pass: every run ingests everything the trigger fires on, including files that were last modified two years ago and have not changed.
We ran this for three weeks before noticing that the bot's answers about integration limits were citing a document from before we raised the limits. The retriever was not broken. It was retrieving exactly what it found.
The correct model: ingestion is a pipeline with validation gates
The fix is not to be more careful about which documents you add. That is a policy problem and policies fail. The fix is to make the ingestion pipeline structurally incapable of producing ambiguous state in the vector store.
That means three things: a canonical document identifier that is stable across updates, freshness metadata attached to every chunk at write time, and a deduplication step that removes prior versions of a document before inserting the new one.
// n8n Function node — validated chunk preparation
// Requires: item.json.fileId (Drive ID), item.json.modifiedTime, item.json.content
const crypto = require('crypto');
const items = $input.all();
const results = [];
for (const item of items) {
const { fileId, modifiedTime, content, mimeType } = item.json;
// Reject non-text files that slipped through
if (!content || typeof content !== 'string' || content.trim().length < 50) {
continue;
}
// Canonical doc ID is the Drive fileId, not the filename.
// Filenames change. Drive IDs do not.
const docId = fileId;
// Content hash catches duplicate files uploaded under different names
const contentHash = crypto
.createHash('sha256')
.update(content.trim())
.digest('hex')
.slice(0, 16);
// Chunk ID = docId + chunk index, so a new version of the same doc
// produces the same chunk IDs and overwrites the old ones in Pinecone.
// This only works if you set Pinecone upsert mode, which you should.
const chunkIndex = results.length;
const chunkId = `${docId}_${chunkIndex}`;
results.push({
json: {
id: chunkId,
content: content,
metadata: {
docId,
contentHash,
source: item.json.name,
// ISO timestamp lets you filter by freshness at retrieval time
ingestedAt: new Date().toISOString(),
// Drive's own modified time is more reliable than ingestion time
documentModifiedAt: modifiedTime,
// Keep the mime type; useful for debugging why a PDF chunked badly
mimeType,
}
}
});
}
return results;
The critical piece is the docId derived from the Drive file ID, not the filename. The filename is what humans write on files. The file ID is what the system assigns at creation and never changes. When you key your chunks off the file ID, a new version of a document uploaded as "pricing-2026-v2.pdf" to the same Drive slot generates the same chunk IDs as "pricing-2025.pdf" did, and Pinecone's upsert semantics replace the old chunks rather than accumulating them.
You also need a cleanup step for documents that have been deleted or archived from the source. Upsert handles updates; deletion requires an explicit delete call. Add a reconciliation node that runs weekly, lists all doc IDs currently in the vector store, compares them against the live Drive folder, and deletes the orphans.
// n8n Function node — orphan detection
// Runs after fetching currentDriveIds (from Google Drive List Files)
// and storedDocIds (from Pinecone list or a separate metadata store)
const driveIds = new Set($('Fetch Drive Files').all().map(i => i.json.id));
const storedIds = $('Fetch Stored Doc IDs').all().map(i => i.json.docId);
const orphans = storedIds.filter(id => !driveIds.has(id));
// Pass orphan IDs to a Pinecone delete node
return orphans.map(docId => ({ json: { docId } }));
Staleness filtering at retrieval time
Even with clean ingestion, you want the retrieval step to be able to reason about document age. Some questions are time-sensitive. Asking "what are the current API rate limits" should prefer documents modified in the last 90 days over ones from two years ago, even if the older document scores a slightly higher similarity.
Pinecone and most other vector databases support metadata filtering. Use it.
// n8n HTTP Request node body — Pinecone query with freshness filter
// Set as expression so the date computes at runtime
{
"vector": "{{ $json.embedding }}",
"topK": 6,
"includeMetadata": true,
"filter": {
"documentModifiedAt": {
"$gte": "{{ new Date(Date.now() - 90 * 24 * 60 * 60 * 1000).toISOString() }}"
}
}
}
This is a blunt instrument and you should know it. A 90-day cutoff will exclude a perfectly valid architecture document that has not been touched in two years because it is still correct. You need to decide which document categories are time-sensitive and apply the filter selectively. Pricing sheets: filter aggressively. API changelogs: filter to recent. Architectural decision records: do not filter at all. That decision belongs in the document metadata, set at ingestion time.
What you need in production observability
The first thing I added after fixing the ingestion pipeline was chunk logging. Every time the retriever returns results, the full list of retrieved chunks, their source documents, their similarity scores, and their documentModifiedAt timestamps goes to a structured log. Not a sample. Every query.
This is how you catch the next version of this problem before a customer does. A query that consistently retrieves a chunk with a similarity score below 0.75 is a query your knowledge base does not actually answer. A query that retrieves chunks from three different document versions is a sign your deduplication failed somewhere. A chunk with a documentModifiedAt from eighteen months ago appearing in answers about current pricing is a signal that needs an alert, not a retrospective.
// n8n Function node — structured retrieval log
const query = $('User Query').first().json.text;
const retrieved = $('Pinecone Query').all();
const logEntry = {
timestamp: new Date().toISOString(),
query,
retrievedChunks: retrieved.map(r => ({
id: r.json.id,
score: r.json.score,
source: r.json.metadata.source,
documentModifiedAt: r.json.metadata.documentModifiedAt,
contentPreview: r.json.metadata.content?.slice(0, 120),
})),
minScore: Math.min(...retrieved.map(r => r.json.score)),
maxAge: retrieved.reduce((oldest, r) => {
const t = new Date(r.json.metadata.documentModifiedAt).getTime();
return t < oldest ? t : oldest;
}, Date.now()),
};
// Send to your logging endpoint — we used a simple HTTP Request to Loki
return [{ json: logEntry }];
The maxAge field in that log will tell you, within one production day, whether your deduplication is working. If you see chunks from before your last ingestion run for documents you know you updated, your chunk IDs are wrong.
What I look for when reviewing RAG ingestion code
A chunk ID derived from content hash alone is a yellow flag. Content hash deduplication within a single document version is fine. Content hash as the primary key across versions will silently drop updates if the new version starts with the same paragraph as the old one.
Metadata that contains only a filename is a red flag. Filenames are human-assigned and humans are inconsistent. You want a system-assigned stable identifier plus a modification timestamp as separate fields, not concatenated into a string.
An ingestion workflow with no deletion or reconciliation step is a slow-motion incident. The vector store will drift from the source of truth at a rate proportional to how often your team archives or updates documents. Six weeks of drift gave us a bot confidently quoting wrong prices. Twelve months of drift in a regulatory context would be worse.
The question I ask in every review: if someone updates a document right now, what happens to all the chunks from the old version? If the answer is "it depends on the filename" or "they get an extra copy in there somewhere," the ingestion pipeline is not finished.