Understanding Embedding Scaling in Collaborative Filtering, Domain Knowledge Acquisition for LLMs via RL, and More!
Vol.123 for Sep 22 - Sep 28, 2025
Stay Ahead of the Curve with the Latest Advancements and Discoveries in Information Retrieval.
This week’s newsletter highlights the following research:
Scalable Multimodal Retrieval Through Flexible Late Interaction and Test-Time Budget Control, from Meta
How Interaction Noise Shapes Embedding Scalability in Collaborative Filtering Models, from He et al.
Embedding Domain Expertise in LLMs through Reinforcement Learning from Augmented Generation, from Nie et al.
EmbeddingGemma: Powerful and Lightweight Text Representations, from Google DeepMind
Bringing Context Engineering and Reasoning to Industrial Cascade Ranking Systems, from Shopee
The Role of Vocabularies in Learning Sparse Representations for Ranking, from Naver
Improving Dual Encoders for Multi-Level Document Retrieval, from Google DeepMind
A Production-Ready Generative Framework for Unified Search and Recommendation, from Alibaba
A Unified PyTorch Framework for Large-Scale Sparse-Dense Recommendation Training, from Alibaba
Treating Reranking as Noise Reduction in Multi-Stage Recommender Systems, from Kuaishou
[1] MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
This paper from Meta introduces MetaEmbed, a flexible multimodal retrieval framework that uses learnable “Meta Tokens” appended to input sequences during training. Unlike traditional methods that either compress queries and candidates into single vectors (losing fine-grained information) or use hundreds of token-level embeddings (computationally expensive), MetaEmbed extracts a small number of contextualized representations from these Meta Tokens’ final hidden states. The key innovation is the Matryoshka Multi-Vector Retrieval (MMR) training approach, which organizes information hierarchically across multiple vectors in nested groups, enabling test-time scaling where users can dynamically adjust the number of vectors used for retrieval based on accuracy-efficiency trade-offs. Extensive evaluation on the Massive Multimodal Embedding Benchmark (MMEB) and Visual Document Retrieval Benchmark (ViDoRe) demonstrates that MetaEmbed achieves state-of-the-art performance across various vision-language model architectures (Qwen2.5-VL, PaliGemma, Llama-3.2-Vision) and scales effectively up to 32B parameters.
📚 https://arxiv.org/abs/2509.18095
[2] Understanding Embedding Scaling in Collaborative Filtering
This paper from He et al. investigates the counterintuitive effects of scaling embedding dimensions in collaborative filtering recommendation systems through comprehensive experiments across 10 datasets and 4 representative models (BPR, NeuMF, LightGCN, and SGL). Contrary to the widely accepted belief that increasing embedding dimensions leads to performance degradation, the authors discover two phenomena: a “double-peak” pattern where performance initially rises, falls, rises again, and finally declines, and a “logarithmic” pattern showing continuous performance improvement with diminishing returns. Through theoretical analysis and empirical validation, they attribute the double-peak phenomenon to noisy interactions in datasets, demonstrating that models like BPR and NeuMF are particularly vulnerable to noise due to their unbounded gradient sensitivity, while graph-based methods like LightGCN and SGL exhibit better noise resistance through mechanisms analogous to Mixup data augmentation and contrastive learning regularization. This research shows that proper embedding scaling can achieve performance improvements of up to 25-34%, far exceeding gains from elaborate algorithmic modifications.
📚 https://arxiv.org/abs/2509.15709
[3] Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation
This paper from Nie et al. introduces Reinforcement Learning from Augmented Generation (RLAG), a method for embedding domain-specific knowledge into LLMs to overcome their limitations in specialized tasks due to knowledge gaps and temporal lag in training data. Unlike existing approaches such as Continual Pre-Training (CPT) which treats all tokens equally, or supervised fine-tuning (SFT) which struggles with complex reasoning, RLAG iteratively cycles between sampling generations (both with and without retrieved knowledge snippets) and optimizing the model using three tailored reward functions: knowledge reward, augmented generation reward, and naive generation reward. The method employs a Bradley-Terry preference model with reward clipping to prevent overfitting, effectively guiding the model to internalize domain knowledge while maintaining robust reasoning capabilities. Experimental results across biomedicine (USMLE), law (BarExamQA), astronomy (MMLU), and current events datasets demonstrate that RLAG significantly outperforms baseline approaches, though it requires approximately one order of magnitude more computational resources than traditional methods due to its iterative sampling and optimization process.
📚 https://arxiv.org/abs/2509.20162
👨🏽💻 https://github.com/ChaojunNie/RLAG
[4] EmbeddingGemma: Powerful and Lightweight Text Representations
This paper from Google DeepMind introduces EmbeddingGemma, a lightweight 308-million-parameter text embedding model based on the Gemma 3 language model family that achieves state-of-the-art performance despite its compact size. The authors developed a training recipe that strategically captures knowledge from larger models through encoder-decoder initialization (adapting Gemma 3 into an encoder-decoder model using UL2 objective before extracting the encoder), geometric embedding distillation from the powerful Gemini Embedding model, and a spread-out regularizer to improve robustness and expressiveness. The training process includes pre-finetuning on large-scale unsupervised data followed by finetuning on high-quality task-specific datasets, with model souping used to combine checkpoints from different optimized mixtures rather than different hyperparameters. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma ranks first among models under 500M parameters and achieves performance comparable to models nearly twice its size, with this advantage persisting even when quantizing weights to 4-bit precision or truncating embeddings to 128 dimensions, making it particularly suitable for low-latency, high-throughput, and on-device applications.
📚 https://arxiv.org/abs/2509.20354
👨🏽💻 https://ai.google.dev/gemma/docs/embeddinggemma
[5] OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System
This paper from Shopee introduces OnePiece, a unified framework that adapts LLM mechanisms of context engineering and multi-step reasoning to industrial recommendation systems. Unlike previous approaches that merely transplant Transformer architectures with incremental gains, OnePiece addresses the challenge of bringing LLM-style capabilities to ranking systems through three innovations: structured context engineering that unifies interaction history, preference anchors, and situational descriptors into tokenized sequences; block-wise latent reasoning that progressively refines representations across multiple reasoning steps; and progressive multi-task training that leverages natural user feedback chains (click → add-to-cart → order) as staged supervision signals. The authors demonstrate OnePiece’s effectiveness through extensive experiments on Shopee’s billion-scale e-commerce platform, showing superior data efficiency and continued improvement with larger training spans compared to traditional Deep Learning Recommendation Models (DLRMs). In production deployment, OnePiece achieves significant business improvements, including over 2% GMV/UU increase and 2.90% advertising revenue growth while maintaining computational efficiency through optimized hardware utilization.
📚 https://arxiv.org/abs/2509.18091
[6] The Role of Vocabularies in Learning Sparse Representations for Ranking
This paper from Naver investigates the role of vocabulary size and initialization in Learned Sparse Retrieval (LSR) models, specifically SPLADE variants, for improving both retrieval effectiveness and computational efficiency. The authors construct BERT models with 100K-sized output vocabularies, one initialized using the ESPLADE pretraining method and another randomly initialized, and compare them against a standard 32K-sized SPLADE model after finetuning on real-world search click logs and applying logit score-based pruning. Their experiments reveal that larger vocabulary sizes alone provide discriminative power for sparse representations by offering greater representational capacity, with the randomly initialized 100K model outperforming the smaller pretrained model in pruned settings, while the pretrained ESPLADE model achieves the best balance of effectiveness and efficiency. The study demonstrates that vocabulary size functions as a hyperparameter defining representational specification rather than merely providing semantic meaning, and that the ranking and FLOPS regularization losses gradually convert Wordpiece vocabularies into latent representations optimized for query-document similarity rather than linguistic accuracy. These findings suggest that vocabularies in LSR serve to configure the representational specification for queries, documents, and their interactions in retrieval engines, opening new avenues for improving sparse retrieval through vocabulary engineering.
📚 https://arxiv.org/abs/2509.16621
[7] Hierarchical Retrieval: The Geometry and a Pretrain-Finetune Recipe
This paper from Google DeepMind investigates hierarchical retrieval using dual encoder models, where documents are organized in a hierarchical structure and the goal is to retrieve not just exact matches but also all ancestor documents in the hierarchy. The authors first prove theoretically that dual encoders can solve hierarchical retrieval problems when the embedding dimension scales logarithmically with the number of documents and linearly with hierarchy depth. However, their experiments reveal a “lost-in-the-long-distance” phenomenon where retrieval accuracy degrades significantly for documents that are farther away in the hierarchical structure. To address this limitation, they propose a pretrain-finetune recipe that first trains the model on standard query-document pairs, then finetunes specifically on long-distance pairs. This approach dramatically improves long-distance retrieval performance while maintaining quality on closer documents.
📚 https://arxiv.org/abs/2509.16411
[8] IntSR: An Integrated Generative Framework for Search and Recommendation
This paper from Alibaba presents IntSR, an integrated generative framework that unifies search and recommendation tasks along with their retrieval and ranking sub-components within a single autoregressive model. The key insight is that search and recommendation differ primarily in how user intent is conveyed, explicitly through queries in search versus implicitly through interaction history in recommendation, which the authors address by using different query modalities as conditions for the same underlying generative process. The framework incorporates a Query-Driven Block (QDB) with customized attention mechanisms to reduce computational complexity, and introduces a temporal candidate alignment strategy to address the “time-varying vocabulary misalignment” problem where negative sampling inappropriately includes items that weren’t available when user interactions occurred. The system has been successfully deployed in production at Amap, serving hundreds of millions of users and achieving significant improvements in GMV (+3.02%), CTR (+2.76%), and accuracy (+5.13%) across different scenarios.
📚 https://arxiv.org/abs/2509.21179
[9] RecIS: Sparse to Dense, A Unified Training Framework for Recommendation Models
This paper from Alibaba presents RecIS, a unified sparse-dense training framework for recommendation systems built on PyTorch that addresses the challenge of integrating large-scale sparse embedding tables (handling categorical features like user/item IDs) with dense neural networks (processing embedded features through transformer-like architectures). The authors identify that while TensorFlow has traditionally dominated industrial recommendation systems due to its mature sparse modeling support, PyTorch has become essential for leveraging cutting-edge large language models and multimodal capabilities, creating a critical gap in unified framework support. RecIS bridges this divide by porting sparse components from TensorFlow while maintaining backward compatibility, then optimizing performance through three key strategies: breaking through IO bottlenecks via columnar storage and GPU batching, overcoming memory bandwidth limitations through dynamic GPU-based embeddings and load balancing across multiple GPUs, and leveraging PyTorch’s ecosystem for dense component optimization using techniques like mixed precision training and kernel fusion. Experimental results demonstrate up to 2x performance improvements over existing frameworks, with successful deployment across multiple Alibaba production systems including search ranking and advertising scenarios, achieving significant improvements in click-through rates.
📚 https://arxiv.org/abs/2509.20883
👨🏽💻 https://github.com/alibaba/recis
[10] Robust Denoising Neural Reranker for Recommender Systems
This paper from Kuaishou introduces DNR (Denoising Neural Reranker), a framework that addresses a limitation in multi-stage recommendation systems where retriever scores from the first stage contain valuable but noisy information that current rerankers fail to optimally utilize. The authors theoretically demonstrate that traditional reranking approaches suffer from misalignment between their optimization objectives and actual user feedback likelihood maximization due to irregular cases from retrievers and noisy retriever scores, which creates robustness concerns. To address this, DNR frames reranking as a noise reduction problem and employs an adversarial framework that pairs a denoising reranker with a carefully designed noise generation module, optimizing three key objectives: a denoising objective that reconstructs user feedback from noisy retriever scores, an adversarial retriever score generation objective that improves exploration in the retriever score space by generating hard-to-denoise samples, and a distribution regularization term that aligns generated noisy scores with real retriever score distributions.
📚 https://arxiv.org/abs/2509.18736
Extras: Datasets
💾 Simplified Longitudinal Retrieval Experiments: A Case Study on Query Expansion and Document Boosting
ir-datasets-longeval is an extension of the ir_datasets framework to support longitudinal retrieval experiments, accompanied by two dynamic datasets: LongEval Sci and LongEval Web. These datasets capture evolving search scenarios through snapshots that include changing document corpora, queries, and relevance judgments over time. By modeling temporal dynamics, they allow researchers to evaluate retrieval methods under conditions closer to real-world information environments.
📝 https://arxiv.org/abs/2509.17440
👨🏽💻 https://github.com/clef-longeval/ir-datasets-longeval
Extras: Benchmarks
⏱️ BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback
BESPOKE is a benchmark for evaluating personalization in search-augmented large language models. It is built from 2,870 real user chat and search history sessions collected over three weeks, contributed by annotators with diverse backgrounds. It provides 150 user-authored queries paired with detailed information needs, along with human-annotated responses that include fine-grained preference scores and explanatory feedback.
📝 https://arxiv.org/abs/2509.21106
👨🏽💻 https://augustinlib.github.io/BESPOKE/
⏱️ FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets
FORGE is a benchmark for forming semantic identifiers (SIDs) in generative retrieval using industrial-scale data. FORGE provides a dataset from Taobao with 14 billion user interactions and multimodal features of 250 million items, addressing limitations of existing public datasets that are smaller and lack multimodal richness.
📝 https://arxiv.org/abs/2509.20904
👨🏽💻 https://github.com/selous123/al_sid
I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest.
I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers.