Why Embedding Models Cannot Scale to All Retrieval Tasks, A Comprehensive Analysis of LLM-based Reranking Methods, and More!
Vol.119 for Aug 25 - Aug 31, 2025
Stay Ahead of the Curve with the Latest Advancements and Discoveries in Information Retrieval.
This week’s newsletter highlights the following research:
Theoretical Limits of Single-Vector Embedding Models in Information Retrieval, from Google DeepMind
Investigating Why Randomly Truncating Text Embeddings Barely Hurts Performance, from Takeshita et al.
Vector Quantization Attention for Ultra-Long User Behavior Modeling in Recommender Systems, from Kuaishou
Conditional Two-Tower Models for Bootstrapping User-to-Item Retrieval Systems, from Pinterest
Computational Scaling Laws for Zero-Shot Information Retrieval with Decoder Models, from Databricks
A Comprehensive Analysis of LLM-based Reranking Methods, from the University of Innsbruck
Lazy Decoder-Only Architecture for Industrial-Scale Generative Recommendation, from Kuaishou
Dynamic Multi-Task Learning for Scalable Recommendation Systems, from Kuaishou
Enabling Compact Language Models for Agentic RAG Through Distillation-Guided Reinforcement Learning, from Kotoge et al.
Combining ID and Content Embeddings Without Architectural Complexity, from Albatross AI
[1] On the Theoretical Limitations of Embedding-Based Retrieval
This paper from Google DeepMind demonstrates fundamental theoretical limitations of embedding-based retrieval systems by connecting communication complexity theory with modern neural information retrieval. The authors prove that the number of top-k document combinations that embedding models can represent is bounded by their embedding dimension through the sign-rank of the query relevance matrix, meaning that for any fixed dimension d, there exist retrieval tasks that cannot be solved regardless of training data or model improvements. They empirically validate this theory using "free embedding" optimization (directly optimizing vectors on test data) and show that even under these ideal conditions, models hit critical points where they cannot represent all document combinations. To demonstrate real-world implications, they create the LIMIT dataset with a simple natural language task where queries ask "who likes X?" about people with various preferences, and it tests all possible top-2 combinations from 46 documents across 1000 queries. Despite the task's apparent simplicity, state-of-the-art embedding models (including Gemini, GritLM, and E5-Mistral) perform poorly, achieving less than 20% recall@100, while alternative architectures like BM25 (with higher effective dimensionality through sparse representations) and multi-vector models perform significantly better. The work reveals that as instruction-following retrieval pushes embedding models toward representing increasingly diverse document combinations, they will inevitably encounter these theoretical limits, suggesting the field should consider alternative architectures like cross-encoders, multi-vector models, or sparse representations for tasks requiring comprehensive combinatorial coverage.
📚 https://arxiv.org/abs/2508.21038
👨🏽💻 https://github.com/google-deepmind/limit
[2] Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks
This paper from Takeshita et al. presents a counterintuitive finding that randomly removing up to 50% of dimensions from text embeddings results in only minimal performance degradation (less than 10%) across retrieval and classification tasks, challenging assumptions about how effectively these models use their representational capacity. Through extensive experiments with six state-of-the-art text encoders across 26 downstream tasks, the authors demonstrate that this phenomenon cannot be explained by common theories such as anisotropy, dimensional collapse, or outlier dimensions in the embedding space. Instead, they employ dimension attribution analysis to reveal that text embeddings contain a substantial number of "degrading dimensions", i.e., features that actually harm performance when present, and are uniformly distributed throughout the embedding space. When dimensions are randomly removed, both performance-enhancing and performance-degrading features are eliminated, resulting in only marginal overall performance loss. The study extends these findings to LLMs in generative tasks, though with more variable results, and shows that random truncation performs surprisingly well compared to Principal Component Analysis for dimensionality reduction. The authors conclude that current text embedding models contain significant inefficiencies in their use of representational space, suggesting potential for architectural improvements that could reduce these degrading dimensions while maintaining or improving performance.
📚 https://arxiv.org/abs/2508.17744
[3] VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling
This paper from Kuaishou presents VQL (Vector Quantization attention for ultra-Long sequences), a framework designed to handle ultra-long user behavior sequences in recommender systems while balancing computational efficiency with predictive accuracy. The core innovation lies in applying vector quantization only to attention keys while preserving original values, which enables offline precomputation of key-value aggregates into sequence-length-independent caches, eliminating the typical O(L) online inference cost. The authors prove that the resulting attention-weight error is independent of sequence length L and directly supervised by codebook loss, providing theoretical guarantees for the approach. VQL incorporates grouped vector quantization (inspired by grouped query attention) to enhance representational capacity without increasing cache size, and supports context-aware features through separable temporal kernels that maintain cache compatibility. The framework offers flexible caching strategies (light, medium, heavy) to accommodate different deployment constraints, making it practically viable for industrial recommender systems where ultra-long sequences (up to 5000 items) are common but direct modeling is computationally prohibitive.
📚 https://arxiv.org/abs/2508.17125
[4] Bootstrapping Conditional Retrieval for User-to-Item Recommendations
This paper from Pinterest introduces a method for conditional retrieval in recommendation systems, where the goal is to retrieve items that are both personalized to the user and relevant to a specific condition (such as a topic). The authors address the challenge of bootstrapping new conditional retrieval use cases when limited training data exists for specific conditions by proposing modifications to the standard two-tower model architecture. Their approach includes a condition extraction module that generates artificial conditions from item metadata and a conditional user tower that incorporates condition embeddings to enable feature interactions between user preferences and conditions. Using the same generic user-item engagement data as standard two-tower models, they train their system with contrastive learning through sampled softmax. The method was deployed at Pinterest for topic-based notification feeds, achieving +0.26% weekly active users. Online A/B experiments showed that their conditional retrieval model significantly outperformed both non-personalized indexing and standard learned retrieval with post-filtering, particularly in terms of condition relevance (82.8% vs 20.3% topic matching rate) and user engagement metrics.
📚 https://arxiv.org/abs/2508.16793
[5] Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs
This paper from Databricks investigates how retrieval performance in LLMs scales with pretraining computational resources (FLOPs). The researchers evaluated MPT decoder models ranging from 125 million to 7 billion parameters, trained on datasets from 1 billion to over 2 trillion tokens, then fine-tuned them minimally on MS MARCO and tested on the BEIR retrieval benchmark. Their key findings demonstrate that retrieval performance predictably scales with model size, training duration, and total FLOPs, following similar patterns to other LLM capabilities. Notably, smaller models trained on more data can match the retrieval performance of larger models trained on fewer tokens up to a performance ceiling, suggesting that isoFLOP curves overlap significantly across different model sizes. The study also reveals a strong correlation between in-context learning (ICL) scores and retrieval performance across tasks, indicating that models with better ICL capabilities tend to excel at retrieval. While the authors acknowledge their results represent lower bounds on performance due to simplified fine-tuning approaches (using only MS MARCO with 128-token sequences), the research provides important evidence for why modern 1B-8B parameter decoder models serve as strong foundations for embedding systems, potentially explaining the recent trend away from BERT-style encoders toward larger decoder-based retrieval models in the field.
📚 https://arxiv.org/abs/2508.17400
[6] How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models
This comprehensive empirical study from the University of Innsbruck evaluates 22 state-of-the-art reranking methods across 40 variants, systematically comparing pointwise, pairwise, and listwise approaches on both established benchmarks (TREC DL19/20, BEIR) and a new dataset called FutureQueryEval containing queries unseen by LLMs until May 2025. The researchers find that while LLM-based rerankers demonstrate superior performance on familiar queries from standard benchmarks, their generalization to truly novel queries reveals significant limitations, with performance drops of 5-15% across all method categories. Notably, listwise methods like Zephyr-7B and RankGPT show the smallest degradation on novel content (8% average drop) compared to pointwise (12%) and pairwise (15%) approaches, suggesting that inter-document modeling provides better robustness. The study reveals that lightweight models fine-tuned on information retrieval data, such as MonoT5-3B and FlashRank variants, often achieve comparable efficiency-accuracy trade-offs to much larger LLMs, while specialized domains remain challenging for all approaches.
📚 https://arxiv.org/abs/2508.16757
👨🏽💻 https://github.com/DataScienceUIBK/llm-reranking-generalization-study
[7] OneRec-V2 Technical Report
This paper from Kuaishou presents OneRec-V2, a significant evolution of Kuaishou's industrial-scale generative recommendation system, addressing two critical limitations of its predecessor through architectural and training innovations. The paper introduces a "Lazy Decoder-Only Architecture" that eliminates the computational bottleneck of encoder-decoder designs by removing the encoder component and simplifying cross-attention mechanisms, achieving a 94% reduction in computational requirements while enabling scaling to 8B parameters. This architectural change addresses the inefficiency where 97.66% of resources in OneRec-V1 were consumed by context encoding rather than the actual recommendation generation process. Additionally, OneRec-V2 implements preference alignment using real-world user feedback signals rather than relying solely on reward models, incorporating duration-aware reward shaping to mitigate video length bias and a novel reinforcement learning method called GBPO (Gradient-Bounded Policy Optimization) for stable training. The system demonstrates substantial improvements in extensive A/B testing on Kuaishou platforms serving 400 million daily active users.
📚 https://arxiv.org/abs/2508.20900
[8] MPFormer: Adaptive Framework for Industrial Multi-Task Personalized Sequential Retriever
This paper from Kuaishou presents MPFormer to address the critical misalignment between multi-objective optimization in ranking stages and single-objective modeling in retrieval stages of industrial recommendation systems. The paper proposes a dynamic multi-task Transformer framework that introduces three key innovations: an objective-conditioned transformer that jointly encodes user behavior sequences with multi-task semantics through learnable attention modulation, personalized target weights for dynamic retrieval adjustment, and integration of user personalization into both token representations and transformer architecture. Deployed on Kuaishou's short-video platform, MPFormer demonstrates substantial improvements over traditional parallel multi-path approaches, achieving 63% reduction in multi-task training overhead, 21.8% improvement in multi-objective exposure rates, and significant gains in user engagement metrics including total watch time (+0.426%) and app usage time (+0.195%). The framework maintains 99.99% service availability under peak loads while reducing GPU memory consumption by 31% compared to multi-model baselines.
📚 https://arxiv.org/abs/2508.20400
[9] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities
This paper from Kotoge et al. introduces Distillation-Guided Policy Optimization (DGPO), a reinforcement learning framework designed to enable compact language models (0.5B parameters) to perform sophisticated agentic RAG tasks that typically require much larger models. The authors identify that standard RL approaches fail with compact models due to sparse rewards from poor-quality student-generated outputs, while traditional knowledge distillation suffers from exposure bias. DGPO addresses these limitations through a two-phase approach: cold-start initialization using knowledge distillation from high-quality teacher demonstrations, followed by reinforcement learning with selective KL penalties that guide incorrect student predictions toward teacher behavior while allowing autonomous exploration for correct predictions. The researchers also introduce Agentic RAG Capabilities (ARC), a fine-grained evaluation framework that decomposes agentic behavior into three dimensions: reasoning (thinking), search coordination (query rewriting), and response synthesis (source referencing). Experiments across seven QA benchmarks demonstrate that DGPO enables a 0.5B student model to achieve a 55x performance improvement, reaching 0.329 average performance compared to the 3B teacher's 0.353, and even surpassing the teacher on several datasets.
📚 https://arxiv.org/abs/2508.20324
👨🏽💻 https://anonymous.4open.science/r/DGPO/
[10] DenseRec: Revisiting Dense Content Embeddings for Sequential Transformer-based Recommendation
This paper from Albatross AI introduces DenseRec, a method for integrating dense content embeddings into transformer-based sequential recommenders like SASRec to address the item cold-start problem. While previous attempts to directly use pre-trained content embeddings in place of ID embeddings have consistently underperformed, DenseRec employs a dual-path architecture that maintains both traditional ID embeddings and content embeddings projected into the ID space through a learned linear transformation. During training, the model probabilistically selects between these pathways (controlled by parameter p_dense), forcing it to learn both meaningful ID representations and an effective projection mechanism. At inference, the model can seamlessly handle known items using ID embeddings and cold-start items using projected content embeddings without requiring model retraining or complex infrastructure changes. Experiments on three Amazon Reviews 2023 datasets demonstrate that DenseRec consistently outperforms ID-only SASRec baselines by 11-34% in Hit Rate@100, with improvements primarily stemming from better sequence representations when cold-start items are present. The method requires minimal architectural modifications, introduces only one hyperparameter, and proves robust across different p_dense values, making it practically deployable in real-world recommendation systems with dynamic item catalogs.
📚 https://arxiv.org/abs/2508.18442
Extras: Datasets
💾 MSRS: Evaluating Multi-Source Retrieval-Augmented Generation
MSRS is a benchmark framework for evaluating multi-source RAG systems. Unlike prior evaluations that assume all necessary information is contained in a single source or require only short answers, MSRS targets scenarios where information must be retrieved and synthesized across multiple documents. Using this framework, the authors construct two benchmarks: MSRS-STORY, derived from narrative story datasets, and MSRS-MEET, based on meeting transcripts. Both require retrieval from large collections and the generation of long-form responses. The benchmarks include gold-standard queries, documents, and summaries, enabling systematic evaluation of both retrieval and generation.
📝 https://arxiv.org/abs/2508.20867
👨🏽💻 https://github.com/yale-nlp/MSRS
I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest.
I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers.