Solving the RAG Inference Bottleneck, An End-to-End Generative Framework for E-commerce Search, and More!
Vol.120 for Sep 01 - Sep 07, 2025
Stay Ahead of the Curve with the Latest Advancements and Discoveries in Information Retrieval.
This week’s newsletter highlights the following research:
An End-to-End Generative Framework for Industrial E-commerce Search, from Kuaishou
A Model-Agnostic System for Customizable Deep Research Agent Deployment, from NVIDIA
A Family of ModernBERT-Based Embedding Models for Enterprise Information Retrieval, from IBM
Leveraging Uncertainty in Unlabeled Data for Improved Recommendations, from Zhao et al.
Multi-Task Learning and Data Synthesis for State-of-the-Art Text Embeddings, from Kingsoft AI
Fine-Grained Scoring and Reinforcement Learning for Efficient Text Reranking, from Alibaba
Addressing Algorithm Adaptation Bias in Recommender System Evaluation, from Roblox
Training LLMs to be Better Text Embedders through Bidirectional Reconstruction, from Su et al.
Solving the RAG Inference Bottleneck with Adaptive Chunk Embeddings, from Meta
Leveraging Code Generation Models for High-Performance Code Embeddings, from Jina AI
[1] OneSearch: A Preliminary Exploration of the Unified End-to-End Generative Framework for E-commerce Search
This paper from Kuaishou introduces OneSearch, the first industrial-deployed end-to-end generative framework for e-commerce search. Their framework incorporates three key innovations: a Keyword-enhanced Hierarchical Quantization Encoding (KHQE) module that preserves hierarchical semantics while maintaining query-item relevance constraints, a multi-view user behavior sequence injection strategy that constructs behavior-driven user IDs and incorporates both short-term and long-term user preferences, and a Preference-Aware Reward System (PARS) featuring multi-stage supervised fine-tuning and adaptive reward-weighted ranking. The system successfully addresses the fragmented computation and optimization objective collisions inherent in traditional Multi-stage Cascading Architecture (MCA) systems by unifying recall, pre-ranking, and ranking into a single generative model, and has been deployed across multiple search scenarios serving millions of users with tens of millions of daily page views.
📚 https://arxiv.org/abs/2509.03236
[2] Universal Deep Research: Bring Your Own Model and Strategy
This paper from NVIDIA introduces Universal Deep Research (UDR), a generalist agentic system that allows users to define and customize their own research strategies rather than being constrained to fixed, hard-coded approaches. Unlike current deep research tools that employ rigid research strategies with limited user control beyond the research prompt and rely on single model choices, UDR wraps around any language model without requiring additional training and converts user-defined natural language research strategies into executable code within a controlled framework. The system operates through a two-phase process: first converting user-specified research strategies into executable code with enforced structure to prevent shortcuts and ensure step-by-step compliance, then executing this code in an isolated environment while maintaining state through named variables rather than growing context windows. UDR achieves high computational efficiency by separating control logic from language model reasoning, relegating orchestration to CPU-executable code while limiting LLM calls to focused reasoning tasks, and includes security measures through sandboxed execution environments.
📚 https://arxiv.org/abs/2509.00244
[3] Granite Embedding R2 Models
This paper from IBM introduces Granite Embedding R2 models, a family of English encoder-based embedding models designed for enterprise-scale dense retrieval applications. The release includes three models: granite-embedding-english-r2 (149M parameters with 768-dimensional embeddings), granite-embedding-small-english-r2 (47M parameters with 384-dimensional embeddings), and granite-embedding-reranker-english-r2 (149M parameters for cross-encoder reranking). Built on the ModernBERT architecture with 8192-token context length support, these models demonstrate substantial improvements over their first-generation predecessors, achieving 19-44% faster processing speeds while maintaining superior accuracy across diverse retrieval domains, including text, code, long-document search, multi-turn conversational data, and tabular retrieval. The models were trained exclusively on enterprise-appropriate data, utilizing a multi-stage training pipeline that includes retrieval-oriented pretraining, tabular pretraining, contrastive finetuning, and knowledge distillation from a Mistral-7B teacher model. Experimental evaluation shows the models outperform comparable alternatives on standard benchmarks like BEIR, MTEB-v2, COIR, MLDR, and LongEmbed while providing enterprise-ready licensing under Apache 2.0, enabling unrestricted research and commercial deployment for mission-critical applications.
📚 https://arxiv.org/abs/2508.21085
👨🏽💻 https://huggingface.co/collections/ibm-granite/granite-docling-682b8c766a565487bcb3ca00
[4] Unlocking the Unlabeled Data: Enhancing Recommendations with Neutral Samples and Uncertainty
This paper from Zhao et al. introduces PNNP (Positive-Neutral-Negative Learning Paradigm). This collaborative filtering approach addresses the underutilization of vast unlabeled data in recommender systems by introducing a third "neutral" class alongside traditional positive and negative categories. Rather than treating all unlabeled data as negative samples, PNNP recognizes that some items represent complex user attitudes that are neither clearly positive nor negative, requiring sophisticated modeling through Elliptical Gaussian Distributions to capture inherent uncertainty rather than fixed-point representations. The authors develop a framework including semi-supervised learning with user-aware attention mechanisms to classify unlabeled data, uncertainty modeling at both user personality and item definition levels, and a two-step centroid ranking approach with adaptive margin control to handle set-level triple-wise ranking relationships (positive > neutral > negative). Experiments on four real-world datasets show that integrating PNNP with various collaborative filtering models (BPR-MF, LightGCN, NGCF, SGL) yields consistent performance improvements, with even simple matrix factorization achieving performance comparable to sophisticated graph neural networks.
📚 https://dl.acm.org/doi/10.1145/3766070
👨🏽💻 https://github.com/Asa9aoTK/PNN-RecBole
[5] QZhou-Embedding Technical Report
This paper from Kingsoft AI presents QZhou-Embedding, a text embedding model built on Qwen2.5-7B-Instruct that achieved first place on both MTEB and CMTEB benchmarks as of August 2025. The researchers developed a unified multi-task learning framework that transforms diverse datasets (retrieval, natural language inference, and classification) into compatible training formats using specialized data processing and loss functions, while implementing a two-stage training approach that begins with retrieval-focused pretraining followed by multi-task fine-tuning. To enhance data quality and diversity, they employed LLM-powered data synthesis techniques including paraphrasing for structural variety, augmentation for semantic diversity, and hard negative generation to improve model discriminability. The model incorporates several technical innovations: bidirectional attention modification to overcome decoder-only architecture limitations, a data grouping strategy that samples from single datasets per batch to increase training difficulty, and controlled sampling ratios to maintain retrieval performance while adding non-retrieval capabilities.
📚 https://arxiv.org/abs/2508.21632
👨🏽💻 https://huggingface.co/Kingsoft-LLM/QZhou-Embedding
[6] ERank: Fusing Supervised Fine-Tuning and Reinforcement Learning for Effective and Efficient Text Reranking
This paper from Alibaba introduces ERank, a two-stage training framework that combines Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to create efficient pointwise text rerankers capable of handling both semantic relevance and reasoning-intensive tasks. The researchers address a limitation in current LLM-based rerankers: binary classification approaches produce poor score discrimination (especially problematic with reasoning LLMs that generate overconfident predictions), while listwise methods achieve better ranking quality but suffer from prohibitive latency due to sequential document processing. ERANK's first stage trains the model using generative fine-grained integer scoring (0-10) rather than binary classification, significantly improving relevance discrimination, while the second stage employs Group Relative Policy Optimization (GRPO) with a listwise-derived reward function that enables the pointwise model to learn global ranking awareness without sacrificing efficiency. The authors synthesize high-quality training data using QwQ-32B as a teacher model, applying techniques like paraphrasing, augmentation, and hard negative generation to enhance dataset diversity and quality. Extensive evaluation on BRIGHT, FollowIR, TREC DL, and BEIR benchmarks demonstrates that ERANK-4B outperforms larger 7-8B parameter models while maintaining six times faster inference than listwise methods.
📚 https://arxiv.org/abs/2509.00520
👨🏽💻 https://huggingface.co/collections/Alibaba-NLP/erank-68b302be7d1f62b0b7cb19a2
[7] Algorithm Adaptation Bias in Recommendation System Online Experiments
This paper from Roblox identifies a critical but underexplored bias in recommender system A/B testing called "algorithm adaptation bias." The bias occurs because when new recommendation models are tested on only a small fraction of users (typical in A/B tests), they cannot trigger the full ecosystem-level dynamics that would emerge under full deployment, such as content virality, creator behavioral adaptation, social spillovers, and sufficient data accumulation. This creates a systematic disadvantage for the treatment variant, as it's evaluated in an environment still dominated by the production system's feedback loops and data distributions. The authors provide mathematical formalization showing how the experimental estimand diverges from the true policy-level causal effect, present empirical evidence from real-world experiments at Roblox where pre-launch A/B tests showed neutral results but post-launch analyses revealed positive impacts, and propose several mitigation strategies including model-data separation, adaptive ramp-up designs, and confirmation analyses.
📚 https://arxiv.org/abs/2509.00199
[8] Training LLMs to be Better Text Embedders through Bidirectional Reconstruction
This paper from Su et al. introduces "Anchor Embeddings," a two-stage training framework for improving LLMs as text embedders by addressing a fundamental mismatch between how the [EOS] token functions during pre-training versus embedding tasks. The authors identify that during standard LLM pre-training, the [EOS] token serves merely as a sequence delimiter without learning to encode semantic information, yet embedding approaches rely on this token's representation to capture text meaning. Their solution involves a first training stage using bidirectional reconstruction tasks, EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), where the [EOS] embedding from queries or documents is used to generate their counterparts, forcing the model to inject semantic alignment into this representation. This is followed by standard contrastive learning in the second stage. Experiments across multiple LLM architectures (LLaMA, Qwen, Mistral) ranging from 1B to 8B parameters demonstrate consistent improvements on the MTEB.
📚 https://arxiv.org/abs/2509.03020
👨🏽💻 https://github.com/LUMIA-Group/Anchor-Embedding
[9] REFRAG: Rethinking RAG based Decoding
This paper from Meta introduces REFRAG, a decoding framework designed to accelerate inference in RAG systems by exploiting the unique attention patterns found in RAG contexts. The authors observe that RAG contexts typically consist of concatenated passages with low semantic similarity, resulting in block-diagonal attention patterns where most cross-passage attention is zero. Instead of processing all retrieved tokens individually, REFRAG compresses chunks of context into lightweight embeddings using a pre-trained encoder, then feeds these compressed representations directly to the decoder alongside the original query tokens. This approach includes a reinforcement learning-based policy that selectively determines which chunks require full token expansion versus compressed representation. The system achieves dramatic performance improvements, delivering up to 30.85× acceleration in time-to-first-token latency while maintaining equivalent perplexity scores across multiple benchmarks including RAG, multi-turn conversations, and document summarization tasks.
📚 https://arxiv.org/abs/2509.01092
👨🏽💻 https://github.com/facebookresearch/refrag [not public as of 09/05]
[10] Efficient Code Embeddings from Code Generation Models
This paper from Jina AI introduces jina-code-embeddings, a family of specialized code embedding models (0.5B and 1.5B parameters) that leverage autoregressive decoder architectures pre-trained on both text and code to generate high-quality embeddings for code retrieval tasks. The researchers built upon Qwen2.5-Coder backbones and employed last-token pooling rather than traditional mean or CLS pooling methods, while implementing task-specific instruction prefixes for five distinct categories: natural language to code retrieval, technical question answering, code-to-code retrieval, code to natural language retrieval, and code completion retrieval. The models were trained using contrastive learning with InfoNCE loss and Matryoshka representation learning on a diverse dataset combining MTEB code tasks, adapted public datasets, and synthetically generated data from GPT-4o. Both models achieved state-of-the-art performance on the MTEB-CoIR benchmark and other code-related tasks, outperforming larger general-purpose embedding models like Qwen3-Embedding-0.6B and demonstrating competitive results against substantially larger alternatives.
📚 https://arxiv.org/abs/2508.21290
👨🏽💻 https://huggingface.co/collections/jinaai/jina-code-embeddings-68b0fbfbb0d639e515f82acd
Extras: Datasets
💾 Evaluating Recabilities of Foundation Models: A Multi-Domain, Multi-Dataset Benchmark
RECBENCH-MD is a benchmark for evaluating the recommendation capabilities (“recabilities”) of foundation models across multiple datasets and domains. It covers eight evaluation settings, ranging from zero-shot single-dataset to multi-domain multi-dataset scenarios, and supports both prompt-based and embedding-based recommendation approaches.
📝 https://arxiv.org/abs/2508.21354
👨🏽💻 https://github.com/Jyonn/RecBench-MD
💾 Document haystack: A long context multimodal image/document understanding vision LLM benchmark
Document Haystack from Amazon is a benchmark for evaluating vision-language models (VLMs) on long, multimodal documents. It is built from 25 financial 10-K reports, trimmed to lengths between 5 and 200 pages, and produces 400 document variants with 8,250 associated questions. The benchmark inserts “needles”, i.e. key-value statements expressed either as text or as a combination of text and images, at varying depths in the documents to test retrieval performance under different context lengths and modalities. Document Haystack is released in multiple formats (PDF, images, and text) to support diverse model inputs and provides an automated, objective evaluation framework for comparing VLMs’ ability to locate and extract information in long, visually complex contexts.
👨🏽💻 https://huggingface.co/datasets/AmazonScience/document-haystack
I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest.
I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers.