Generative Embeddings with Test-Time Scaling, Toward Efficient Visual Document Retrievers, and More!
Vol.124 for Sep 29 - Oct 05, 2025
Stay Ahead of the Curve with the Latest Advancements and Discoveries in Information Retrieval.
This week’s newsletter highlights the following research:
ModernVBERT: Efficient Visual Document Retrieval via Bidirectional Encoders and Late Interaction, from Teiletche et al.
Test-Time Scalable Text Embeddings via Iterative Contrastive Refinement, from National Taiwan University
Understanding Scaling Laws in Generative Recommendation Systems, from Snap Inc.
Fine-tuning with RAG for Improving LLM Learning of New Skills, from Imperial College London
Achieving State-of-the-Art Text Embeddings with 6 Million Open-Source Training Examples, from CodeFuse AI
Multi-Agent Deep Search Over Structured and Unstructured Graph Knowledge, from Yang et al.
Optimizing LLMs for Reasoning-Intensive Document Retrieval Through Rubric-Based Scoring, from Lan et al.
Scalable Deep Research with 4B Parameter Models, from Fractal AI Research
Adaptive Patch-Level Embedding Pruning for Storage-Efficient Visual Document Retrieval, from Yan et al.
Multilingual Learned Sparse Retrieval Through Lexical Space Alignment, from Nguyen et al.
[1] ModernVBERT: Towards Smaller Visual Document Retrievers
This paper from Teiletche et al. investigates optimal design choices for training efficient visual document retrieval models and introduces ModernVBERT, a compact 250M-parameter vision-language encoder. Through controlled experiments, the authors demonstrate that bidirectional attention masks significantly outperform causal attention for late interaction retrieval (+10.6 nDCG@5), though causal models remain sufficient for single-vector embeddings. They show that language modeling objectives during modality alignment substantially improve document retrieval by enabling fine-grained token-level interactions between image and text, while higher image resolutions (2048px) and extended alignment training further boost performance. The paper reveals that incorporating text-only query-document pairs alongside image-text pairs during contrastive training improves document retrieval through cross-modal transfer, addressing data scarcity issues. The resulting ColModernVBERT model matches the performance of models 10× larger on ViDoRe benchmarks while achieving 7× faster CPU inference.
📚 https://arxiv.org/abs/2510.01149
👨🏽💻 https://huggingface.co/ModernVBERT
[2] Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement
This paper from National Taiwan University presents GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings), a text embedding framework that leverages LLMs’ generative capabilities through autoregressive token generation rather than traditional single-pass encoding. The system generates sequences of “soft tokens”, i.e., probability distributions over vocabulary rather than discrete tokens, which are iteratively refined through a specialized Iterative Contrastive Refinement (ICR) objective that applies contrastive supervision at each generation step while enforcing progressive quality improvements. Trained on only 200K examples with Mistral-7B and Qwen2.5-7B backbones, GIRCSE achieves top 5-6 rankings on MTEB benchmark and top 2-3 on instruction-following tasks, avoiding the typical trade-off between general embedding quality and instruction-following capability. Critically, the framework exhibits test-time scaling behavior. This work establishes generative iterative refinement as a viable paradigm for representation learning, demonstrating that embeddings can benefit from multi-step reasoning analogous to advances in generative language models.
📚 https://arxiv.org/abs/2509.24291
[3] Understanding Generative Recommendation with Semantic IDs from a Model-scaling View
This paper from Snap Inc. investigates scaling behaviors in generative recommendation (GR) systems by comparing two paradigms: semantic ID-based GR (SID-based GR) and direct LLM-as-recommender systems (LLM-as-RS). The authors formulate a scaling law framework decomposing recommendation performance into collaborative filtering (CF) and semantic information (SI) components, then empirically analyze models ranging from 44M to 14B parameters across Amazon Review datasets. Their key finding is that SID-based GR, which quantizes item embeddings from modality encoders into discrete codes for autoregressive sequence modeling, exhibits severe scaling limitations. The performance saturates rapidly when scaling the recommender module (~13M parameters), while scaling neither the LLM encoder (up to 11B) nor the quantization tokenizer provides meaningful improvements. Through ablation studies injecting external LLM embeddings, they identify the fundamental bottleneck as SIDs’ limited capacity to encode semantic information, preventing effective knowledge transfer from powerful foundation models. In contrast, LLM-as-RS, which directly generates item text descriptions from prompts without quantization, demonstrates superior scaling properties with consistent performance gains up to 14B parameters and achieves up to 20% improvement over SID-based GR’s best performance. Challenging conventional wisdom, they also show that LLMs’ ability to capture CF signals improves with scale.
📚 https://arxiv.org/abs/2509.25522
[4] Fine-tuning with RAG for Improving LLM Learning of New Skills
This paper from Imperial College London presents an approach to improving LLM-based agents by converting RAG from a runtime dependency into internalized knowledge through distillation. The authors propose a four-stage pipeline: (1) collecting failures from base agents on interactive tasks, (2) automatically extracting generalizable hints from these failures using GPT-4o without expert supervision, (3) generating improved teacher trajectories by providing hints once at episode start via one-shot retrieval, and (4) training student models on these trajectories with hints removed to force internalization rather than memorization. Experiments show that distilled students consistently outperform baseline agents, while using 10-60% fewer tokens than retrieval-augmented teachers depending on the environment. The approach generalizes across model scales (7B/14B parameters) and agent architectures (ReAct/StateAct), demonstrating that retrieval benefits can be effectively internalized through targeted fine-tuning without permanent runtime dependencies, thereby eliminating deployment complexity and computational overhead.
📚 https://arxiv.org/abs/2510.01375
👨🏽💻 https://anonymous.4open.science/r/anonymized-submission-iclr/
[5] F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data
This paper from CodeFuse AI introduces F2LLM (Foundation to Feature Large Language Models), a family of text embedding models available in three sizes (0.6B, 1.7B, and 4B parameters) that achieves state-of-the-art performance through a remarkably efficient training approach. Unlike competing models that rely on multi-stage training pipelines with billions of weakly-supervised pretraining samples or expensive synthetic data generation, F2LLM is directly fine-tuned from Qwen3 foundation models using only 6 million query-document-negative tuples curated exclusively from open-source, non-synthetic datasets. The training data encompasses 4.9M retrieval samples, 0.2M classification samples, and 0.8M clustering samples, unified into a consistent format with task-specific instructions and margin-based adaptive hard negative mining. The models are trained for 2 epochs using contrastive learning with both hard negative loss and in-batch loss (for retrieval tasks only), without any architectural modifications to the base LLM. On the MTEB English benchmark, F2LLM-4B ranks 2nd among similar-sized models and 7th overall, achieving particularly exceptional performance on clustering tasks, while F2LLM-1.7B ranks 1st in the 1B-2B parameter range. The authors release all model checkpoints, training data, and code.
📚 https://arxiv.org/abs/2510.02294
👨🏽💻 https://huggingface.co/collections/codefuse-ai/codefuse-embeddings-68d4b32da791bbba993f8d14
[6] GraphSearch: An Agentic Deep Searching Workflow for Graph Retrieval-Augmented Generation
This paper from Yang et al. presents GraphSearch, an agentic framework designed to address critical limitations in GraphRAG systems, specifically shallow retrieval and inefficient utilization of structural graph data. The framework introduces a modular deep searching pipeline consisting of six components: Query Decomposition, Context Refinement, Query Grounding, Logic Drafting, Evidence Verification, and Query Expansion, enabling multi-turn iterative interactions with graph knowledge bases. A key innovation is the dual-channel retrieval strategy that simultaneously issues semantic queries over chunk-based text data and relational queries over structural graph representations, allowing comprehensive exploitation of both modalities. The system decomposes complex multi-hop questions into atomic sub-queries, progressively retrieves fine-grained evidence, and performs reflective reasoning to identify and remedy missing information gaps. Experimental validation across six multi-hop question-answering benchmarks (including HotpotQA, MuSiQue, 2WikiMultiHopQA, and domain-specific datasets in medicine, agriculture, and legal fields) demonstrates that GraphSearch consistently outperforms traditional single-round GraphRAG approaches in answer accuracy and generation quality. The framework exhibits strong plug-and-play capability with various existing GraphRAG methods (LightRAG, PathRAG, HyperGraphRAG), maintains effectiveness with smaller language models, and shows particularly pronounced advantages under constrained retrieval budgets.
📚 https://arxiv.org/abs/2509.22009
👨🏽💻 https://github.com/DataArcTech/GraphSearch
[7] Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval
This paper from Lan et al. introduces Retro* for reasoning-intensive document retrieval that addresses limitations in existing information retrieval systems, particularly those supporting LLM agents and RAG applications. The method employs a rubric-based relevance scoring mechanism that enables models to perform fine-grained reasoning about query-document relationships according to explicitly defined criteria, producing interpretable relevance scores ranging from 0-100 across five levels. Unlike existing listwise or setwise approaches, Retro* operates pointwise, allowing massive parallelism and significantly reduced inference latency as candidate document sets grow. The system supports test-time scaling through score integration across multiple reasoning trajectories, yielding more reliable relevance estimates than single-pass methods. To optimize the model’s reasoning capabilities, the authors develop a two-stage training strategy: supervised fine-tuning using filtered trajectories from a powerful teacher model (Qwen3-235B), followed by reinforcement learning with composite rewards that jointly optimize both intra-document scoring accuracy and inter-document ranking performance.
📚 https://arxiv.org/abs/2509.24869
👨🏽💻 https://github.com/FlagOpen/FlagEmbedding
[8] Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs
This paper from Fractal AI Research introduces Fathom-DeepResearch, an open-source agentic system comprising two specialized 4B-parameter models. The first component, Fathom-Search-4B (built on Qwen3-4B), performs evidence-based web investigation through three key innovations: (i) DuetQA, a ~5K-sample dataset generated via multi-agent self-play that enforces strict web-search dependency by requiring at least one reasoning hop to contain post-2024 information, thereby preventing models from relying solely on parametric knowledge; (ii) RAPO (Reward-Aware Policy Optimization), a zero-overhead extension of GRPO that stabilizes multi-turn reinforcement learning through curriculum pruning, reward-aware advantage scaling, and per-prompt replay buffers to address the training instabilities that arise when tool interactions induce distribution shifts; and (iii) a steerable step-level reward function that classifies each tool call by cognitive behavior (exploration vs. verification) and marginal utility, enabling explicit control over search trajectory characteristics while mitigating reward hacking that typically causes models to spam redundant tool calls. The second component, Fathom-Synthesizer-4B, converts multi-turn search traces into structured, citation-dense reports using DeepResearch-SFT, a synthetic dataset distilled from GPT-5 that provides supervision through question decomposition, section-level evidence mapping, and insight generation strategies following a plan-then-write protocol. Evaluated across nine benchmarks including SimpleQA, FRAMES, WebWalker, and DeepResearch-Bench, the system achieves state-of-the-art performance among open-weight models, reliably extending tool-calling beyond 20 calls when necessary and rivaling closed-source systems like Claude, GPT-4o, and Perplexity DeepResearch in comprehensive research tasks.
📚 https://arxiv.org/abs/2509.24107
👨🏽💻 https://github.com/FractalAIResearchLabs/Fathom-DeepResearch
[9] DocPruner: A Storage-Efficient Framework for Multi-Vector Visual Document Retrieval via Adaptive Patch-Level Embedding Pruning
This paper from Yan et al. introduces DocPruner, a storage-efficient framework that addresses the prohibitive storage overhead of multi-vector Visual Document Retrieval (VDR) systems by implementing adaptive patch-level embedding pruning. While state-of-the-art VDR models like ColPali, ColNomic, and Jina Embeddings v4 leverage Large Vision-Language Models (LVLMs) to represent documents as hundreds of patch-level embeddings for fine-grained retrieval, this approach incurs massive storage costs that hinder large-scale deployment. DocPruner extracts attention scores from a global token (typically the [EOS] token) in the final transformer layer to quantify each patch’s importance, then applies document-specific adaptive thresholding based on the mean and standard deviation of these importance scores to determine which patches to prune. This adaptive mechanism automatically adjusts pruning intensity based on each document’s information density, aggressively pruning sparse pages while conservatively handling dense ones. The framework is theoretically grounded in the Information Experiments demonstrate that DocPruner achieves 50-60% storage reduction with negligible performance degradation (often <1% drop in nDCG@5), outperforming both merging-based methods (semantic clustering, 1D/2D pooling) and non-adaptive pruning baselines (random, fixed-ratio, static threshold) across multiple models and multilingual scenarios, with only a 60-66% increase in offline encoding latency.
📚 https://arxiv.org/abs/2509.23883
[10] Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector
This paper from Nguyen et al. introduces MILCO, a multilingual learned sparse retrieval architecture that maps queries and documents from 39 languages into a shared English lexical space through a multilingual connector. The core innovation addresses a critical challenge in sparse retrieval: extending beyond English while avoiding semantic collapse, where models produce uninterpretable representations. MILCO employs a two-stage training regime combining Sparse Alignment Pretraining (SAP), which leverages 594M bitext pairs to ground multilingual inputs in English lexical targets, with contrastive knowledge distillation training. The proposed LexEcho head generates dual-view representations: an English view supporting cross-lingual retrieval through semantic matching, and a source-language view that preserves uncommon entities (particularly from non-Latin scripts) often lost in translation.
📚 https://arxiv.org/abs/2510.00671
👨🏽💻 https://anonymous.4open.science/r/milco-831D
I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest.
I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers.