RAG citations: how Perplexity and ChatGPT do it (and how to do it in your app)

When your LLM answers from documents you provided, the user has to trust that the answer actually came from the documents and not from the model’s general knowledge or imagination. Citations are how you give them that trust. Done well, they let the user click through to the source paragraph in the original PDF. Done badly, the LLM hallucinates a citation that doesn’t exist or points at the wrong chunk.

This guide covers the two UI patterns the big consumer products use (Perplexity and ChatGPT), the two ways the citation data can flow through your code, and the failure modes to design against. Then working code for the pattern that actually works in production.

Why citations matter in RAG

Three reasons, in order of how often they come up:

The first is trust and verifiability. RAG answers look authoritative because they’re grounded in your documents. Without citations, the user has no way to tell which sentences in the answer came from the documents and which the model paraphrased or invented. Showing the source lets them check.

The second is debugging. When your RAG app gives a wrong answer, the first question is always “did the retriever find the right chunk?” Citations make that visible in the UI without you having to dig through logs.

The third is compliance. For legal, medical, financial, or enterprise use cases, “where did this answer come from” isn’t a nice- to-have. Many internal RAG deployments treat citations as a non-negotiable.

The two UI patterns

Inline footnotes (Perplexity-style)

The answer text has numbered markers interleaved with the prose, and the sources appear in a panel or list at the bottom. Hovering a marker shows the source preview.

The governing law of this contract is Delaware [1], with arbitration
in San Francisco [2]. EU customers have an override that applies
Irish law instead [1].

Sources
─────────
[1] contract.pdf p.12, "9.1 Governing Law"
[2] contract.pdf p.13, "9.2 Arbitration"

This is what Perplexity, Claude (web search), and most modern AI search products do. The benefit is granular attribution: the user can see which specific claim came from which source. The cost is UX complexity, since the markers have to map to real chunks reliably.

Bibliography at the bottom (ChatGPT-style)

The answer is plain prose with no inline markers, and the sources are listed in a “Sources” section at the end of the response. ChatGPT’s browsing mode and earlier RAG demos default to this.

The governing law of this contract is Delaware, with arbitration in
San Francisco. EU customers have an override that applies Irish law
instead.

Sources
─────────
- contract.pdf p.12, "9.1 Governing Law"
- contract.pdf p.13, "9.2 Arbitration"

Simpler to render, less precise about which claim came from where. Fine for short answers, less useful as the answer length grows.

Both, in practice

Most production RAG apps end up with a hybrid: inline markers on specific claims plus a bibliography panel. Perplexity does this explicitly. The UI work is the same as inline footnotes. You just also render the source list.

Where the citation data comes from

Two strategies, and the choice between them is the most important decision in RAG citations.

Strategy A: from the retriever (recommended)

Your retriever already knows which chunks it returned. You attach a number to each chunk before sending them to the LLM, then re-map any [N] the LLM produces back to the chunk you sent.

You sent the LLM:
  Source [1]: "...Delaware..."  (contract.pdf p.12)
  Source [2]: "...JAMS in San Francisco..." (contract.pdf p.13)
  Source [3]: "...Ireland (for EU customers)..." (contract.pdf p.42)

The LLM returned:
  "Governing law is Delaware [1], with arbitration in
  San Francisco [2]..."

You re-map: [1] → contract.pdf p.12, [2] → contract.pdf p.13

This is reliable because the citation numbers are your numbers, indexed against a real list of retrieved chunks. The LLM can’t invent a [5] you didn’t send, because validation catches it.

Strategy B: from the LLM (less reliable)

You tell the LLM “answer the question and cite your sources” without providing numbered chunks. The model invents its own citations based on its understanding of what’s in the prompt.

This works for big models (GPT-4 class and up) most of the time, but the failure mode is hard to detect: the model can hallucinate page numbers, invent paragraph references, or attribute a claim to the wrong source. You have no ground truth to validate against, so wrong citations look the same as right ones.

Use Strategy A. The reliability difference is large enough that even mid-sized models with explicit chunk numbering will outperform big models that are guessing.

The working pattern

Here’s the strategy A pattern end to end, with RedHop providing the retrieval and citation metadata.

import re
import redhop
from openai import OpenAI

QUERY = "What law governs this contract, and where is arbitration held?"

doc = redhop.Document.from_file("contract.pdf")
ctx = doc.context(QUERY)

# 1. Number the chunks RedHop returned, so we can re-map citation
#    markers back to them.
numbered = [(i + 1, c) for i, c in enumerate(ctx.citations)]
sources_block = "\n".join(
    f"[{n}] {c['text']}" for n, c in numbered
)

# 2. Build a prompt that tells the LLM to cite using the numbers.
prompt = f"""Answer the question using only the numbered sources below.
Cite each claim using [N] markers, where N is the source number.
Do not invent sources; only cite numbers that appear in the list.

Sources:
{sources_block}

Question: {QUERY}"""

response = OpenAI().chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
)
answer = response.choices[0].message.content

# 3. Validate: any [N] the model produced must map to a real source.
valid_ids = {n for n, _ in numbered}
cited = {int(m.group(1)) for m in re.finditer(r"\[(\d+)\]", answer)}
hallucinated = cited - valid_ids
if hallucinated:
    # Strip hallucinated markers or flag them in the UI.
    for n in hallucinated:
        answer = answer.replace(f"[{n}]", "")
    print(f"warning: dropped hallucinated citation(s): {hallucinated}")

# 4. Render: answer + sources used.
print(answer)
print("\nSources")
for n, c in numbered:
    if n in cited - hallucinated:
        page = f" p.{c['page']}" if c.get("page") else ""
        heading = f", {c['heading']}" if c.get("heading") else ""
        print(f"[{n}] {c['source']}{page}{heading}")

import { Document } from "redhop";
import OpenAI from "openai";

const QUERY = "What law governs this contract, and where is arbitration held?";

const doc = Document.fromFile("contract.pdf");
const ctx = doc.context(QUERY);

const numbered = ctx.citations.map((c, i) => ({ n: i + 1, ...c }));
const sourcesBlock = numbered.map(c => `[${c.n}] ${c.text}`).join("\n");

const prompt = `Answer the question using only the numbered sources below.
Cite each claim using [N] markers, where N is the source number.
Do not invent sources; only cite numbers that appear in the list.

Sources:
${sourcesBlock}

Question: ${QUERY}`;

const response = await new OpenAI().chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: prompt }],
});
let answer = response.choices[0].message.content;

const validIds = new Set(numbered.map(c => c.n));
const cited = new Set(
  Array.from(answer.matchAll(/\[(\d+)\]/g)).map(m => parseInt(m[1], 10))
);
const hallucinated = [...cited].filter(n => !validIds.has(n));
for (const n of hallucinated) {
  answer = answer.replaceAll(`[${n}]`, "");
}
if (hallucinated.length) {
  console.warn(`dropped hallucinated citations: ${hallucinated}`);
}

console.log(answer);
console.log("\nSources");
for (const c of numbered) {
  if (cited.has(c.n) && !hallucinated.includes(c.n)) {
    const page = c.page ? ` p.${c.page}` : "";
    const heading = c.heading ? `, ${c.heading}` : "";
    console.log(`[${c.n}] ${c.source}${page}${heading}`);
  }
}

use redhop::read_file;
use regex::Regex;
use std::collections::HashSet;
use async_openai::{Client, types::{
    CreateChatCompletionRequestArgs,
    ChatCompletionRequestUserMessageArgs,
}};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let query = "What law governs this contract, and where is arbitration held?";

    let mut doc = read_file("contract.pdf")?;
    let ctx = doc.context(query)?;

    let numbered: Vec<_> = ctx.citations.iter().enumerate()
        .map(|(i, c)| (i + 1, c))
        .collect();
    let sources_block = numbered.iter()
        .map(|(n, c)| format!("[{}] {}", n, c.text))
        .collect::<Vec<_>>().join("\n");

    let prompt = format!(
        "Answer the question using only the numbered sources below.\n\
        Cite each claim using [N] markers, where N is the source number.\n\
        Do not invent sources; only cite numbers that appear in the list.\n\n\
        Sources:\n{sources_block}\n\nQuestion: {query}"
    );

    let req = CreateChatCompletionRequestArgs::default()
        .model("gpt-4o-mini")
        .messages([ChatCompletionRequestUserMessageArgs::default()
            .content(prompt).build()?.into()])
        .build()?;
    let mut answer = Client::new().chat().create(req).await?
        .choices[0].message.content.clone().unwrap_or_default();

    let valid_ids: HashSet<usize> = numbered.iter().map(|(n, _)| *n).collect();
    let re = Regex::new(r"\[(\d+)\]")?;
    let cited: HashSet<usize> = re.captures_iter(&answer)
        .filter_map(|c| c[1].parse().ok()).collect();
    let hallucinated: Vec<_> = cited.difference(&valid_ids).copied().collect();
    for n in &hallucinated {
        answer = answer.replace(&format!("[{n}]"), "");
    }

    println!("{}", answer);
    println!("\nSources");
    for (n, c) in &numbered {
        if cited.contains(n) && !hallucinated.contains(n) {
            print!("[{n}] {}", c.source);
            if let Some(p) = c.page { print!(" p.{p}"); }
            if let Some(h) = &c.heading { print!(", {h}"); }
            println!();
        }
    }
    Ok(())
}

Four steps:

Number the chunks the retriever returned. RedHop’s ctx.citations is the right list to number: it’s exactly what made it into the prompt context.
Tell the LLM to cite using these numbers in the system prompt. Be explicit about not inventing new ones.
Validate the output: any [N] the model wrote that doesn’t appear in your numbered list is a hallucination. Strip it or flag it for the user.
Render: the answer with surviving markers, plus a sources list of just the chunks the LLM actually cited.

The failure modes to design against

Hallucinated citation numbers. The LLM cites [5] when you sent it three sources. Caught by the validation step in the pattern above.

Citation drift. The LLM cites a number, but the claim it attaches the citation to doesn’t actually appear in that chunk. Harder to catch automatically. The only real defense is to keep chunks small and specific so the chunk content is closely related to the citation context. If you suspect this is happening, log ctx.report along with the LLM output and audit a sample of answers.

Wrong page numbers. The user clicks the citation expecting page 12, but the page number is for the chunk’s start whereas the answer phrase is on page 13 because the chunk straddled a page boundary. RedHop reports the chunk’s starting page. Chunks have an n_tokens field if you need to estimate the actual span. For high-stakes use cases, render a page range (p.12–13) rather than a single page.

Over-citing. Every clause gets a marker, the answer turns into a field of [1][2][3]. Tell the LLM to cite at the end of each sentence, not each clause. Or render the markers in a more compact way in the UI (one marker per sentence, even if the model used more).

Under-citing. The LLM produces a long answer with no markers at all. Two common causes: the prompt didn’t instruct it strongly enough, or the LLM judged that the question was simple enough that citations felt redundant. Strengthen the prompt (“you must cite each claim”), or add a post-step that asks the LLM to retroactively add citations to its own output.

Citation latency. If you stream tokens to the UI, the citation markers stream too, and you have to render the source panel incrementally as new [N] appears. Most streaming-RAG apps render a placeholder (“[loading]”) for the first appearance of each [N] and fill in the real source link once the stream completes.

What Perplexity, ChatGPT, and Claude do

A rough survey of the production patterns you can observe in their UIs:

Perplexity uses the inline-footnote pattern aggressively. Numbered markers in the answer. Hovering one shows a card with the source title, URL, and a snippet, and clicking it opens the source. They also render a “Sources” rail with all the sources used. Their attribution is sentence-level when the model cooperates.

ChatGPT (browsing mode) and Atlas put the sources at the bottom of the answer in a card layout, with a title, URL preview, and small favicon for each. When the model includes inline numbered markers, they render them as small superscript pills that link to the relevant source card. Less aggressive about inline attribution than Perplexity, but the pattern is the same.

Claude (web search) does inline numbered citations linked to a sources panel, similar to Perplexity. The numbers appear as compact pills inline, and hovering shows a preview. Claude’s prompt instruction to itself is visible if you read carefully: “cite each claim from the search results using [N] format.”

All three use Strategy A under the hood, sending the model numbered sources and validating the output. The UI rendering varies. The data flow is the same.

When to skip citations entirely

A few cases where citations add cost without value:

Single-source corpora where the user already knows where the answer came from. “Q: what’s the refund window? A: 30 days” doesn’t need [1] if the user knows they uploaded one contract.
Conversational follow-ups that paraphrase or summarize a previous cited answer. Citing again on the summary is noise.
Internal-use deployments where users trust the data source and citations are decoration.

Default to including them. Remove them only when you’ve measured that the user doesn’t need them.

Build a RAG app covers the basic three-call RedHop pattern this guide builds on.

Retrieval & context tips covers the retrieval side: making sure the right chunks reach the LLM in the first place, which is the precondition for citations being useful.

Examples: Custom citations rendering has the basic footnote-renderer snippet without the LLM-citation validation flow this guide adds.