Towards AI Search Paradigm, Zero-Shot Mixture of Retrievers, and More!
Vol.110 for Jun 23 - Jun 29, 2025
Stay Ahead of the Curve with the Latest Advancements and Discoveries in Information Retrieval.
This week’s newsletter highlights the following research:
Building Next-Generation Search with Multi-Agent LLM Systems, from Baidu Search
Multi-Granularity Retriever Mixing for Robust Information Retrieval, from Kalra et al.
Contrastive Item Tokenization for Multi-Modal Generative Recommendation, from Alibaba
Self-Supervised Dense Retrieval through Language Modeling, from Cai et al.
Nested Embeddings for Scalable E-commerce Information Retrieval, from Qian et al.
Entity-Centric Inverted Indexing for Scalable Retrieval-Augmented Generation, from Zhang et al.
jina-embeddings-v4: Unified Vision-Language Embeddings with Shared Semantic Space, from Jina AI
End-to-End Learning of Search-Augmented Multimodal Reasoning, from ByteDance
Hierarchical MLP-Mixer for Multi-Scale User Interest Modeling in Sequential Recommendation, from ByteDance
Efficient and Scalable Retrieval Augmented Generation for Dynamic Corpora, from Zhang et al.
[1] Towards AI Search Paradigm
This paper from Baidu Search introduces the "AI Search Paradigm", a comprehensive framework that employs a multi-agent architecture powered by LLMs. The paradigm features four specialized agents: a Master agent that analyzes query complexity and orchestrates team formation; a Planner agent that decomposes complex queries into structured sub-tasks using Directed Acyclic Graphs (DAGs) and dynamically selects tools from a Model-Context Protocol (MCP) platform; an Executor agent that carries out sub-tasks through coordinated tool invocation; and a Writer agent that synthesizes results into comprehensive answers. The system dynamically adapts to query complexity through three configurations (Writer-Only, Executor-Inclusive, and Planner-Enhanced), enabling it to handle everything from simple factual queries to complex multi-step reasoning tasks that require tool coordination and evidence synthesis across multiple sources. The paper details key methodologies including task planning and tool integration, execution strategies, robust RAG techniques, preference alignment methods, and efficient LLM inference optimizations at both algorithmic and infrastructure levels.
📚 https://arxiv.org/abs/2506.17188
[2] MoR: Better Handling Diverse Queries with a Mixture of Sparse, Dense, and Human Retrievers
This paper from Kalra et al. introduces MoR (Mixture of Retrievers), a framework that dynamically combines multiple heterogeneous retrievers to improve RAG performance across diverse query types. Sparse retrievers like BM25 capture lexical matches while dense retrievers capture semantic similarity. Rather than relying on a single retriever selected through heuristics, MoR leverages the complementary strengths of different retrieval methods by computing query-specific weights for each retriever using both pre-retrieval signals (based on query embedding proximity to document clusters) and post-retrieval signals (including query performance prediction metrics like the Moran coefficient). The framework employs multi-granularity retrieval by decomposing queries and documents into atomic units (sub-questions and propositions) and combines retrievers operating at different semantic levels.
📚 https://arxiv.org/abs/2506.15862
👨🏽💻 https://github.com/Josh1108/MixtureRetrievers
[3] A Simple Contrastive Framework Of Item Tokenization For Generative Recommendation
This paper from Alibaba introduces SimCIT (Simple Contrastive Item Tokenization), for generative recommendations. Unlike conventional approaches that rely on reconstruction-based quantization (like RQ-VAE) which aim to precisely reconstruct item embeddings independently, SimCIT employs a fully contrastive learning-based approach that better aligns with the discriminative nature of recommendation tasks. The framework utilizes a learnable residual quantization module combined with multi-modal information fusion, treating different item modalities (text, images, collaborative signals, and spatial relationships for POI recommendation) as different "views" in a contrastive learning setup. Through soft residual quantization with Gumbel-Softmax and NT-Xent contrastive loss, SimCIT learns semantic tokens that serve as bridges between modalities while promoting diversity and reducing collision in the token space.
📚 https://arxiv.org/abs/2506.16683
[4] Revela: Dense Retriever Learning via Language Modeling
This paper from Cai et al. introduces Revela, a self-supervised framework for training dense retrievers through language modeling that addresses the costly annotation requirements of traditional retriever training. The key innovation is an "in-batch attention" mechanism that extends next-token prediction to condition on both local context and cross-document context within the same batch, with attention weights determined by retriever-computed similarity scores. This approach allows joint optimization of both the retriever and language model during training on raw, unannotated text. Revela treats retrieval as learning dependencies among chunks of tokens, analogous to how language models learn token-level dependencies.
📚 https://arxiv.org/abs/2506.16552
👨🏽💻 https://github.com/TRUMANCFY/Revela
[5] NEAR²: A Nested Embedding Approach to Efficient Product Retrieval and Ranking
This paper from Qian et al. introduces NEAR², a Nested Embedding Approach to product Retrieval and Ranking that addresses the dual challenge of accuracy and efficiency in e-commerce information retrieval systems. The authors leverage Matryoshka Representation Learning (MRL) combined with multiple negative ranking loss (MNRL) to train nested embeddings of different sizes within encoder-based Transformer models like BERT and eBERT, enabling the use of significantly smaller embedding dimensions without sacrificing performance. NEAR² achieves up to 12× efficiency in embedding size and 100× reduction in memory usage during inference while introducing no additional training costs, with evaluation on four challenging test sets (including queries with ambiguous, repetitive, and alphanumeric characteristics) demonstrating improved performance even at the smallest 64-dimension embeddings compared to full-size models.
📚 https://arxiv.org/abs/2506.19743
[6] SlimRAG: Retrieval without Graphs via Entity-Aware Context Selection
This paper from Zhang et al. presents SlimRAG, a lightweight RAG framework that addresses the fundamental inefficiency in existing RAG systems where semantic similarity is mistakenly treated as semantic relevance. The authors argue that graph-based RAG methods suffer from structural overhead, requiring costly entity linking and relation extraction pipelines while often retrieving subgraphs filled with tangential content. SlimRAG eliminates graph construction entirely, instead using a simple entity-to-chunk inverted index during indexing and performing entity-aware context selection during retrieval through query decomposition, semantic entity matching, and dual-factor scoring based on both embedding similarity and entity overlap. The framework introduces Relative Index Token Utilization (RITU) as a novel metric measuring index compactness by calculating the proportion of corpus tokens retained in the index.
📚 https://arxiv.org/abs/2506.17288
👨🏽💻 https://github.com/continue-ai-company/SlimRAG
[7] jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
This paper from Jina AI introduces jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations in a single semantic space, supporting both single-vector and multi-vector embeddings for diverse retrieval tasks. Built on the Qwen2.5-VL-3B-Instruct backbone, the model employs a unified architecture that processes images through a vision encoder before joint processing with text via language model decoders, eliminating the modality gap present in dual-encoder CLIP-style models. The framework incorporates three task-specific LoRA adapters (60M parameters each) for asymmetric query-document retrieval, semantic text similarity, and code search, while supporting Matryoshka Representation Learning for truncatable embeddings from 2048 to 128 dimensions. Training occurs in two phases: initial contrastive learning on text and multimodal pairs using InfoNCE loss with Kullback-Leibler divergence alignment, followed by task-specific fine-tuning with hard negatives and specialized loss functions. To evaluate performance on visually rich documents, the authors introduce Jina-VDR, a comprehensive benchmark extending ViDoRe with 30 additional multilingual tasks spanning diverse domains, document types, and query formats.
📚 https://arxiv.org/abs/2506.18902
👨🏽💻 https://huggingface.co/jinaai/jina-embeddings-v4
[8] MMSearch-R1: Incentivizing LMMs to Search
This paper from ByteDance introduces MMSearch-R1, the first end-to-end reinforcement learning framework that trains large multimodal models (LMMs) to perform on-demand search in real-world internet environments. The framework addresses limitations of existing RAG approaches, which rely on rigid pipelines and often result in excessive search behavior, by teaching models three key abilities: when to search, what to search for, and how to reason over search results. The researchers developed a multimodal search VQA dataset called FactualVQA (FVQA) through semi-automated pipelines that balances search-required and search-free samples, which proves essential for shaping efficient search behavior. Using Group Relative Policy Optimization (GRPO) with an outcome-based reward system that includes a search penalty, MMSearch-R1 integrates both image and text search tools and learns to recognize knowledge boundaries through multi-turn interactions with real internet content.
📚 https://arxiv.org/abs/2506.20670
👨🏽💻 https://github.com/EvolvingLMMs-Lab/multimodal-search-r1
[9] Pyramid Mixer: Multi-dimensional Multi-period Interest Modeling for Sequential Recommendation
This paper from ByteDance introduces Pyramid Mixer, a sequential recommendation model that leverages MLP-Mixer architecture to efficiently model comprehensive user interests across multiple dimensions and time periods. The approach implements cross-behavior and cross-feature mixer modules that capture interactions between different user behaviors and item features, while employing a pyramid structure to learn temporal interests across various time scales from short-term to long-term patterns. The model incorporates low-rank decomposition to enhance computational efficiency and uses an adaptive fusion module to balance cross-behavior and cross-feature representations.
📚 https://arxiv.org/abs/2506.16942
[10] EraRAG: Efficient and Incremental Retrieval Augmented Generation for Growing Corpora
This paper from Zhang et al. presents EraRAG, a multi-layered graph-based RAG framework designed to efficiently handle growing text corpora without requiring expensive full-graph reconstruction for each update. The key innovation lies in its use of hyperplane-based Locality-Sensitive Hashing (LSH) to partition and organize corpus content into hierarchical graph structures, enabling localized insertions of new documents while preserving the existing topology. Unlike existing Graph-RAG approaches that assume static corpora and necessitate complete rebuilding when new content arrives, EraRAG employs a selective re-segmenting and re-summarization mechanism that confines structural modifications to affected regions only. The framework constructs multi-layered graphs through recursive LSH-based segmentation with controllable size bounds, ensuring consistent granularity across segments, and supports both detailed and summarized retrieval strategies depending on query requirements.
📚 https://arxiv.org/abs/2506.20963
👨🏽💻 https://github.com/EverM0re/EraRAG-Official
I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest.
I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers.