Stop Rereading Your PDFs: a plain-English guide to Token-Direct Visual RAG

TL;DR: Instead of converting your whole document library to text and searching that text, we search each page’s visual tokens (smart “patches” of the image). We find the right pages fast, then decode those exact tokens directly with DeepSeek-OCR to get the text and answer the question. No training needed. No full-document OCR passes. Just search → decode tokens → answer.

Why “text-first” RAG keeps letting you down

Classic RAG does this:

OCR every page to text
Split that text into chunks
Embed & search those chunks
Ask an LLM to answer

It’s okay for clean docs, but it breaks on:

multi-column layouts, tables, stamps, math, receipts
big OCR bills up front (or repeatedly)
brittle retrieval (if OCR misses a word, you never find it)

The flip: search the page itself, then decode

Our idea is simple:

Turn every page image into compact visual tokens once.
Turn your question into a tiny image (plus 2–5 short variants) and make tokens for that too.
Use ColBERT-style matching to find the pages whose tokens best match your question tokens.
Directly decode those winning page tokens with DeepSeek-OCR to get faithful text.
Let a lightweight LLM read the snippets and reply with citations.

Key point: we don’t run OCR across the corpus. We decode directly from the tokens we just retrieved. Nothing else.

Quick analogy

Each page is a mosaic of little magnetic tiles (visual tokens).
Your question becomes a mini mosaic too.
We bring them together; the tiles that “snap” hardest reveal the right pages.
Then we read those snapped tiles—not the whole wall.

Where ColBERT and DeepSeek-OCR fit (no jargon)

ColBERT: a retrieval trick that compares your question in small pieces to a page in small pieces, then adds up the best matches. It’s precise and great for spotting details.
DeepSeek-OCR: a modern OCR that can take those visual tokens directly and output text. No re-encoding pixels. No full-page OCR needed at question time.

Together: ColBERT finds the right tokens; DeepSeek-OCR reads those tokens.

How it works (for non-devs)

Index once — We convert each page into visual tokens and store them.
Ask anything — Your question becomes a tiny text image (plus a few synonyms), then we make tokens for it.
Match by parts — We compare little pieces of your question to little pieces of every page and rank the best pages.
Decode tokens — We hand the winning page tokens straight to DeepSeek-OCR and get back the exact text.
Answer + cite — A small LLM assembles the final answer and cites the pages it used.

Why this is different from text-based RAG

Topic	Text-first RAG	Token-Direct Visual RAG
Where search happens	Over OCR’d text chunks	Over visual tokens of each page
OCR at query time	Often heavy or repeated	Direct token decoding (no full-doc OCR)
Layout fidelity	Tables/columns can get mangled	Preserved until decoding
Compute	OCR + chunking + embeddings first	Search first, then decode the matched tokens
Traceability	“Which chunk produced this?”	The same tokens that matched are decoded

What you get in practice

Speed & lower cost: We don’t re-OCR or re-embed everything each time.
Faithful answers: We decode precisely the tokens that matched the query.
Great on messy layouts: Invoices, forms, multi-column reports, tables, stamps.
Zero training: Works out-of-the-box with standard ColBERT-style matching and DeepSeek-OCR.

Example: “What’s the total due on the March invoice?”

Old way: OCR the whole invoice, hope the table survived, hope the right chunk exists, then search the chunks.
Our way: Match your query-image (“total due March invoice”) against page tokens, jump straight to the bottom-right box that matched, decode those tokens directly, and answer—with a link to that page.

FAQ

Do we still “do OCR”?
We decode tokens directly with DeepSeek-OCR. That’s different from running OCR over every page. We decode only the tokens we retrieved, not entire documents.

Is there any training?
No. This is a zero-train pipeline. You can ship it as is.

What if I want summaries instead of verbatim text?
Today, we decode the matched tokens verbatim (fast and faithful). Later, we can drop in a specialized decoder (a small model head) that directly outputs the summary or a structured table—still from tokens—so you get exactly the format you want.

How do you handle synonyms or phrasing differences?
The query step creates a few short variants (synonyms/aliases) and turns them into images. That makes matching robust, even without training.

Roadmap (non-dev)

Now: Search by visual tokens → decode matched tokens → answer.
Soon:
- Two-stage search for big libraries (quick coarse pass, then exact pass).
- Token masks so we decode an even smaller set of tokens when pages are huge.
Later:
- Task-specific decoders (e.g., “decode to summary”, “decode tables to CSV”, “decode only figures & captions”).
- Drop-in, no changes to the search stage.

Why this matters

Documents are visual. Forcing them into plain text first is fragile and expensive. Token-Direct Visual RAG respects the page as a page: we find the answer visually, then read exactly what we found. That’s why it’s faster, cheaper, and more trustworthy—especially on the messy docs that break ordinary RAG.

Why this will feel different in production

Search happens before any heavy decoding: late-interaction over cached visual tokens is precise on small page regions (tables, stamps, math).
Decoding is targeted: you decode only the tokens that won retrieval, not whole pages. With DeepSeek’s compression, that slashes compute while keeping fidelity high.
Option to go “blazing”: If/when scale grows, drop in PLAID/FastPLAID (no training) for big retrieval-latency cuts, then rerank on full tokens

https://github.com/bacoco/DeepSynth

Deeplearning.fr

You have to learn the rules of the game. And then you have to play better than anyone else