XMRetriever: When Smarter Retrieval Beats Larger Prompts
Large Vision-Language Models (VLMs) like GPT-4o have transformed document understanding. Feed them an image of a complex table alongside a few annotated examples, and they can extract structured data with impressive accuracy. But there is a catch: the cost of those "few-shot" examples adds up fast. For documents with dense, heterogeneous tables -- the kind found in healthcare benefit summaries, legal contracts, and financial filings -- naive static prompting can balloon to ~80,000 tokens per page. That is expensive, slow, and increasingly prone to hallucination as context windows fill up.
XMRetriever is a cross-modal retrieval system born from a practical question: What if, instead of throwing more examples at the model, we found the right examples for each page? The result is a lightweight retrieval layer that sits in front of any VLM, dynamically selecting the most relevant few-shot exemplars while respecting a strict token budget. It cuts tokens by 75%, latency by 53%, and actually improves extraction accuracy.
The best prompt is not the longest one -- it is the most relevant one.
The Problem: Static Prompting Does Not Scale
Consider the task of extracting structured data from complex document tables -- benefit schedules, coverage matrices, formulary grids. These documents present a unique challenge:
- Visual heterogeneity -- Tables vary wildly in layout: merged cells, nested headers, footnote references, spanning columns, color-coded rows.
- Semantic density -- A single cell may encode conditional logic ("$30 copay after deductible, in-network only; 40% coinsurance out-of-network").
- Page-count variation -- A 2-page summary document looks nothing like a 40-page comprehensive schedule, yet both need accurate extraction.
The standard approach is static few-shot prompting: attach 20-30 annotated examples to every VLM call, regardless of what the input page actually looks like. This has three problems:
- Token bloat -- 20-30 examples at ~3,000 tokens each yields ~80k tokens/page. At GPT-4o pricing, this becomes prohibitively expensive at scale.
- Signal dilution -- Most of those examples are irrelevant to the current page. The model wastes attention on dissimilar layouts, reducing effective context for the actual task.
- Hallucination risk -- Overcrowded contexts increase the probability of the model "blending" information from unrelated examples into its output. With static prompting, hallucination rates of 8-9% are common.
The Insight: Cross-Modal Retrieval as Context Curation
XMRetriever reframes the problem: instead of engineering a better prompt, engineer a better retrieval system that automatically selects the right prompt for each page. The key insight is that document pages have both textual and visual identity, and effective retrieval must capture both modalities.
A page with a 3-column coverage table looks visually similar to other 3-column tables, but may be semantically about dental benefits vs. pharmacy benefits. Conversely, two pages about pharmacy benefits may use completely different visual layouts. XMRetriever learns a shared embedding space that captures both dimensions simultaneously.
Architecture: Dual-Head Projection into Shared Space
The architecture is deliberately simple. Rather than training an end-to-end cross-modal model from scratch (expensive, fragile, data-hungry), XMRetriever builds lightweight projection heads on top of pre-computed embeddings from existing foundation models.
Figure 1: XMRetriever architecture -- dual-head projection maps text and vision embeddings into a shared 512-d space, concatenated into a 1024-d vector for FAISS retrieval, feeding a token-budgeted prompt to GPT-4o.
Here is how each component works:
1. Embedding Extraction (Frozen)
For each page in the exemplar library, we compute two embeddings using off-the-shelf models -- no fine-tuning required:
- Text embedding -- The page text is extracted (OCR or native PDF text) and encoded with OpenAI's
text-embedding-3-small, producing a 1536-dimensional vector. - Vision embedding -- The page is rasterized to an image and encoded with ViT-B/16 (the CLS token), producing a 768-dimensional vector.
2. Dual-Head Projection (Trainable)
The core trainable component is remarkably small: two 3-layer MLP projection heads that map heterogeneous embeddings into a shared 512-dimensional space. Each head uses BatchNorm, GELU activation, and dropout for regularization.
import torch
import torch.nn as nn
class ProjectionHead(nn.Module):
"""Lightweight 3-layer MLP projection head.
Maps pre-computed embeddings from a foundation model
into a shared cross-modal space.
Args:
input_dim: Dimension of source embeddings
(1536 for text, 768 for vision).
hidden_dim: Internal hidden layer width.
output_dim: Shared projection space (512-d).
dropout: Dropout probability for regularization.
"""
def __init__(
self,
input_dim: int,
hidden_dim: int = 1024,
output_dim: int = 512,
dropout: float = 0.1,
):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, output_dim),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
projected = self.net(x)
# L2-normalize for cosine similarity in FAISS
return nn.functional.normalize(projected, p=2, dim=-1)
class XMRetriever(nn.Module):
"""Cross-Modal Retriever with dual projection heads."""
def __init__(self):
super().__init__()
self.text_head = ProjectionHead(input_dim=1536)
self.vision_head = ProjectionHead(input_dim=768)
def forward(self, text_emb, vision_emb):
text_proj = self.text_head(text_emb) # (B, 512)
vis_proj = self.vision_head(vision_emb) # (B, 512)
# Concatenate for joint representation
return torch.cat([text_proj, vis_proj], dim=-1) # (B, 1024)
The total parameter count is under 5 million -- small enough to train on a laptop. The entire training pipeline runs in under 3 hours on an Apple M4 chip with consumer-grade hardware. This is a deliberate design choice: simplicity as a feature.
3. Concatenation and Indexing
The two 512-d projections are concatenated into a single 1024-dimensional vector and indexed in a FAISS IVF-PQ index. At query time, the same dual-head projection is applied to the incoming page, and FAISS returns the top-K nearest neighbors from the exemplar library.
Neighbor-Aware Sampling: The Secret Sauce
The most impactful contribution is the neighbor-aware negative sampling strategy. Standard contrastive learning samples random negatives from the full dataset. But document retrieval has a unique structure: pages from documents with similar page counts tend to share structural characteristics. A 4-page summary looks fundamentally different from a 35-page comprehensive schedule.
Figure 2: Neighbor-aware negative sampling -- negatives are drawn primarily from the same page-count group (alpha=0.70), with 15% each from adjacent groups g-1 and g+1 to build robustness against distribution shifts.
The sampling strategy works as follows:
- Anchor -- The text embedding of a given page.
- Positive -- The vision embedding of the same page (cross-modal pairing).
- Negatives -- Drawn with a structured distribution:
- 70% (alpha = 0.7) from the same page-count group g -- these are the hardest negatives, forcing the model to learn fine-grained distinctions between similar documents.
- 15% each from adjacent groups g-1 and g+1 -- these teach the model to handle the fuzzy boundaries between document types.
- In-batch hard-negative mining (beta = 0.3) -- Within each mini-batch, the hardest negatives (highest similarity to anchor) are upweighted with probability beta.
This strategy is critical for real-world robustness. Without it, the model overfits to clean group boundaries and fails when it encounters documents that fall between categories -- a 6-page document that looks like a summary but has the density of a comprehensive schedule.
def neighbor_aware_triplet_loss(
anchor, # (B, D) text projections
positive, # (B, D) vision projections (same page)
negatives, # (N, D) candidate negatives
neg_groups, # group label for each negative
anchor_group, # group label for anchor
margin=0.3,
alpha=0.7, # same-group sampling weight
beta=0.3, # hard-negative mining probability
):
"""Triplet loss with neighbor-aware negative sampling.
Samples negatives preferentially from:
- Same group g (probability alpha)
- Adjacent groups g+-1 (probability (1-alpha)/2 each)
Then applies in-batch hard-negative mining.
"""
# --- Step 1: Neighbor-aware negative selection ---
same_group = (neg_groups == anchor_group)
adj_group = (
(neg_groups == anchor_group - 1) |
(neg_groups == anchor_group + 1)
)
# Build sampling weights
weights = torch.zeros(len(negatives))
weights[same_group] = alpha
weights[adj_group] = (1 - alpha) / 2
weights = weights / weights.sum() # normalize
# Sample K negatives per anchor
K = min(64, len(negatives))
indices = torch.multinomial(weights, K, replacement=False)
sampled_neg = negatives[indices] # (K, D)
# --- Step 2: Hard-negative mining within batch ---
# Compute similarity to anchor
sim = torch.mm(anchor, sampled_neg.T) # (B, K)
# With probability beta, select hardest negative
# Otherwise, use random from sampled set
if torch.rand(1).item() < beta:
hard_idx = sim.argmax(dim=-1) # hardest per anchor
negative = sampled_neg[hard_idx] # (B, D)
else:
rand_idx = torch.randint(0, K, (anchor.size(0),))
negative = sampled_neg[rand_idx]
# --- Step 3: Triplet margin loss ---
d_pos = (anchor - positive).pow(2).sum(dim=-1)
d_neg = (anchor - negative).pow(2).sum(dim=-1)
loss = torch.clamp(d_pos - d_neg + margin, min=0.0)
return loss.mean()
Token-Budgeted Prompt Assembly
Retrieval alone is not enough. Once XMRetriever returns the top-K most relevant exemplars, they need to be assembled into a prompt that fits within a strict token budget. This is where the token-budgeted prompt assembly pipeline takes over:
- MMR diversity filtering -- Maximal Marginal Relevance ensures the selected exemplars are not just relevant but also diverse. If the top 3 results are all nearly identical pharmacy tables, MMR swaps in a dental or vision table that is still relevant but adds coverage.
- Greedy budget allocation -- Each exemplar is assigned a token cost (the length of its annotated JSON + the base64-encoded image). Exemplars are greedily added to the prompt until the 20k token budget is reached.
- Result -- Instead of 20-30 static examples, the system selects 4-8 highly relevant, diverse exemplars that fit within budget.
Why 20k tokens per page?
This budget was empirically determined through ablation. Below 15k, accuracy drops as too few exemplars can be included. Above 25k, diminishing returns set in with marginal accuracy gains. 20k hits the sweet spot: enough room for 4-8 exemplars with their annotations, while keeping costs manageable at scale.
Results: Less Is More
The evaluation was conducted on a dataset of complex document tables spanning multiple domains -- healthcare benefit summaries, coverage schedules, and formulary documents -- with heterogeneous layouts, nested headers, and conditional logic in cells. All numbers compare XMRetriever against the Static-All baseline (the standard approach of including all available few-shot examples).
Figure 3: Head-to-head comparison -- XMRetriever achieves higher extraction accuracy with 75% fewer tokens by selecting relevant exemplars instead of including all of them.
| Metric | Static-All | XMRetriever | Change |
|---|---|---|---|
| Tokens / page | ~80,000 | ~18,000 | -75% |
| Latency / document | 210 s | 98 s | -53% |
| Exact Match (EM) | 72.0% | 79.4% | +7.4 pp |
| Hallucination rate | 8.6% | 1.2% | -7.4 pp |
| Throughput (docs/hour) | 1x (baseline) | 4.2x | +320% |
| Exemplars / prompt | 20-30 | 4-8 | Dynamic |
The most surprising result is that accuracy improved alongside token reduction. This is counterintuitive -- you might expect fewer examples to mean less information for the model. But the opposite is true: irrelevant examples actively hurt performance by confusing the model's attention. Removing them is not just cheaper; it is better.
The hallucination reduction from 8.6% to 1.2% is equally significant. In domains like healthcare and finance, a hallucinated value in an extracted table can have downstream consequences -- incorrect benefit limits, wrong copay amounts, misquoted coverage terms. Reducing hallucinations by 7x has direct business impact.
Ablation Studies: What Actually Matters
To understand which components contribute most to performance, we conducted systematic ablation studies across three dimensions.
1. Neighbor-Aware Sampling
Removing the g+/-1 neighbor sampling and using only random negatives drops Precision@5 from 0.82 to 0.77 -- a significant degradation. The adjacent-group negatives are what teach the model to handle the fuzzy boundaries between document categories. Without them, the model struggles with edge cases: a 7-page document that could belong to either the "short" or "medium" group.
2. Loss Function
We compared three contrastive loss functions:
- Triplet loss with margin (used in XMRetriever) -- Best overall performance. The explicit margin parameter provides direct control over the desired separation between positives and negatives.
- InfoNCE -- Competitive but slightly worse (-0.02 P@5). InfoNCE treats all negatives equally within a batch, missing the structured sampling that triplet loss leverages.
- MSE regression -- Significantly worse (-0.08 P@5). Regressing to a similarity score loses the ranking structure entirely.
3. Projection Head Depth
The 3-layer architecture was empirically optimal:
- 2-layer heads -- Underperform by -0.03 P@5. Insufficient capacity to learn the non-linear mapping between heterogeneous embedding spaces.
- 3-layer heads -- Optimal. Enough depth to learn complex mappings without overfitting.
- 6-layer heads -- No improvement over 3-layer, but 2x the parameters and training time. Classic diminishing returns.
Key takeaway from ablations
The neighbor-aware sampling strategy contributes more to performance than architectural complexity. A simple 3-layer MLP with smart training beats a deeper network with naive training. How you train matters more than what you train.
Why Simplicity Is a Feature
XMRetriever deliberately avoids end-to-end training of cross-modal encoders. This is not a limitation -- it is a design principle with several practical advantages:
- Foundation model agnosticism -- When a better text or vision encoder is released, swap it in and retrain just the projection heads in a few hours. No need to rebuild the entire pipeline.
- Training efficiency -- Under 5M trainable parameters means the entire system trains on consumer hardware (Apple M4) in under 3 hours. No GPU clusters required.
- Deployment simplicity -- The projection heads are tiny PyTorch models that can be serialized and loaded anywhere. FAISS indexes are standard and well-supported. No custom infrastructure needed.
- Debuggability -- When retrieval quality drops, you can inspect the projected embeddings, visualize the FAISS neighborhood, and diagnose issues. The system is not a black box.
Lightweight heads on pre-computed embeddings beat complex end-to-end models when the task is retrieval, not generation. Let foundation models do what they are good at -- encoding -- and invest your trainable parameters where they matter most: learning the cross-modal alignment.
Beyond Document Tables: Domain-Agnostic Applications
While XMRetriever was developed and evaluated on healthcare document table extraction, the architecture is domain-agnostic. The same dual-head projection + neighbor-aware sampling approach applies wherever you have:
- Legal documents -- Contract clause extraction, where table layouts vary between law firms and document types.
- Financial reports -- Balance sheets, income statements, and regulatory filings with heterogeneous table formats across institutions.
- Scientific papers -- Results tables with varying structures across journals, disciplines, and formatting conventions.
- Government forms -- Tax forms, permit applications, and compliance documents with agency-specific layouts.
The only requirements are: (1) a labeled exemplar library, (2) pre-computed text and vision embeddings, and (3) a downstream VLM for extraction. The projection heads can be retrained for a new domain in hours.
Implementation Notes
A few practical considerations for anyone looking to implement a similar system:
- Page-count groups -- We define groups by page-count ranges (e.g., 1-3, 4-7, 8-15, 16-30, 31+). The exact boundaries should be tuned per domain based on where natural structural breaks occur in your document corpus.
- FAISS index type -- IVF-PQ with 256 centroids and 64 sub-quantizers works well for exemplar libraries up to ~50k pages. For larger libraries, consider IVF-HNSW.
- Token counting -- Use tiktoken (cl100k_base encoding) for accurate token estimation during budget allocation. Approximate token counts lead to budget overruns or underutilization.
- Embedding caching -- Pre-compute and cache all embeddings. The OpenAI API call for text embeddings and the ViT forward pass for vision embeddings are the most expensive parts of indexing -- do them once.
Conclusion
XMRetriever demonstrates a principle that applies broadly across AI engineering: smarter retrieval beats larger prompts. By investing in a lightweight cross-modal retrieval layer -- just two 3-layer MLPs trained with a carefully designed sampling strategy -- we achieved simultaneous improvements across every metric that matters: cost, latency, accuracy, and reliability.
The key insight is that in a few-shot VLM pipeline, the quality of examples matters far more than the quantity. A well-chosen set of 4-8 exemplars, dynamically selected based on both textual and visual similarity, outperforms a brute-force approach of 20-30 static examples. And it does so at a fraction of the cost.
As document AI systems are deployed at scale across healthcare, legal, financial, and government domains, the economics of VLM inference will increasingly favor systems like XMRetriever -- systems that are intelligent about what they put in the prompt, not just how they process what comes out.
The most expensive token is the one that confuses the model. The cheapest optimization is removing it.