Scaling Multilingual Encoders to 1800+ Languages, Bridging the Gap Between Academic and Industrial Recommenders, and More!
Vol.121 for Sep 08 - Sep 14, 2025
Stay Ahead of the Curve with the Latest Advancements and Discoveries in Information Retrieval.
This week’s newsletter highlights the following research:
A Modern Multilingual Encoder for 1833 Languages with Rapid Low-Resource Learning, from Johns Hopkins University
A Comprehensive Survey of Query Expansion in the Era of Large Language Models, from Li et al.
A Comprehensive Survey of Long-Document Retrieval, from Li et al.
A Three-Stage Pipeline for Complex Query Understanding in Information Retrieval, from Zhong et al.
Understanding the Divide Between Academic Research and Industrial Recommender Systems, from NTU
Dynamic Knowledge Boundary Detection for Efficient Retrieval-Augmented Generation, from Xidian University
Evaluating Numerical Understanding in Text Embedding Models, from HKUST
Why Recommender Systems Research Remains Fundamentally Flawed, from Said et al.
Comparative Analysis of One-Shot vs. Iterative Retrieval Strategies for RAG Systems, from Lin et al.
A Framework for Immersion-Aware Short Video Recommendations, from Tsinghua University
[1] mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
This paper from Johns Hopkins University presents MMBERT, a significant advancement in multilingual encoder-only language models, trained on 3 trillion tokens across over 1800 languages using a novel "cascading annealed language learning" approach that progressively introduces languages throughout training phases (starting with 60 high-resource languages, expanding to 110 mid-resource languages, and finally incorporating all 1833 languages during the decay phase). The model employs several innovative training techniques, including an inverse masking schedule (progressively reducing mask rates from 30% to 5%) and inverse temperature sampling that shifts from high-resource language bias to more uniform distribution across languages. Built on ModernBERT architecture with the Gemma 2 tokenizer, MMBERT demonstrates substantial improvements over previous multilingual encoders like XLM-R across classification, retrieval, and natural language understanding tasks, with the base model achieving 86.3 average on GLUE and 72.8 on XTREME benchmarks. Remarkably, the model shows that including low-resource languages only during the final 100 billion token decay phase enables rapid learning that boosts performance approximately 2x on these languages, with MMBERT even outperforming large language models like OpenAI's o3 and Google's Gemini 2.5 Pro on low-resource language tasks while maintaining efficiency gains of 2-4x faster inference compared to previous multilingual encoders.
📚 https://arxiv.org/abs/2509.06888
👨🏽💻 https://github.com/jhu-clsp/mmBERT
[2] Query Expansion in the Age of Pre-trained and Large Language Models: A Comprehensive Survey
This comprehensive survey from Li et al. examines the evolution of query expansion (QE) from traditional corpus-based methods to modern approaches leveraging pre-trained and large language models (PLMs/LLMs). The authors organize PLM/LLM-driven QE techniques along four key dimensions: point of injection (implicit embedding-based versus explicit selection-based expansion), grounding and interaction (ranging from zero-grounding non-interactive methods to corpus-evidence anchored and multi-round retrieve-expand loops), learning and alignment (including supervised fine-tuning, preference optimization, and distillation techniques), and knowledge graph augmentation. The survey demonstrates that while traditional QE methods rely on static resources and lightweight heuristics with limited contextual sensitivity, PLM/LLM-based approaches offer superior contextual disambiguation, cross-domain generalization, and simultaneous recall-precision improvements, albeit at higher computational costs. The authors systematically review applications across web search, biomedical retrieval, e-commerce, open-domain question answering, retrieval-augmented generation, conversational search, and code search, showing consistent gains from neural QE methods while identifying key challenges, including reliability on unfamiliar queries, knowledge leakage, quality control, efficiency constraints, and the need for better evaluation metrics beyond traditional IR measures.
📚 https://arxiv.org/abs/2509.07794
[3] A Survey of Long-Document Retrieval in the PLM and LLM Era
This survey from Li et al. provides a comprehensive examination of long-document retrieval (LDR) in the era of PLMs and LLMs, addressing the fundamental challenge of accurately retrieving relevant information from documents spanning thousands to tens of thousands of tokens. The authors systematically categorize LDR approaches into four main paradigms: holistic methods that process entire documents (including naive truncation, long-sequence Transformers like Longformer and BigBird, and LLM-based rerankers like RankGPT), divide-and-conquer strategies that segment documents into smaller units before aggregation (such as BERT-MaxP/SumP, hierarchical models like PARADE, and key block selection methods like KeyB), indexing-structure-oriented approaches that optimize document segmentation and representation (including MC-indexing, HELD, and RAPTOR), and specialized long-query retrieval techniques for scenarios where queries themselves are lengthy documents. The survey traces the evolution from classical lexical methods (TF-IDF, BM25) through early neural approaches (DSSM, DRMM) to modern PLM/LLM-based systems, highlighting how each generation addresses core challenges including evidence dispersion across long texts, computational scalability, and semantic fragmentation from document segmentation.
📚 https://arxiv.org/abs/2509.07759
[4] Reasoning-enhanced Query Understanding through Decomposition and Interpretation
This paper from Zhong et al. introduces ReDI (Reasoning-enhanced query understanding through Decomposition and Interpretation), a framework designed to improve document retrieval for complex, multi-faceted queries that have become increasingly common in AI-driven search systems. The authors argue that existing query understanding methods, while effective for simple keyword-based queries, struggle with longer queries requiring complex reasoning and multi-hop information retrieval. ReDI operates through a three-stage pipeline: first decomposing complex queries into focused sub-queries to capture diverse user intents, then enriching each sub-query with semantic interpretations tailored for either sparse (BM25) or dense retrieval methods, and finally using a fusion strategy to aggregate retrieval results across all sub-queries. The researchers created the Coin dataset containing 3,403 complex queries from real search logs and used knowledge distillation to train smaller, production-ready models from a DeepSeek-R1 teacher model. Experiments on the BRIGHT and BEIR benchmarks demonstrate that ReDI consistently outperforms strong baselines in both sparse and dense retrieval settings, achieving particularly notable improvements on reasoning-intensive tasks that challenge traditional single-query expansion methods, with the approach showing strong generalization to long documents and out-of-domain retrieval scenarios.
📚 https://arxiv.org/abs/2509.06544
[5] A Survey of Real-World Recommender Systems: Challenges, Constraints, and Industrial Perspectives
This survey from NTU examines the significant gap between academic recommender systems research and industrial applications by analyzing 228 papers from major conferences (2020-2024) that underwent real-world A/B testing. The authors classify industrial recommender systems into two main categories: Transaction-Oriented RecSys (optimizing for conversion, revenue, and purchase likelihood in e-commerce settings) and Content-Oriented RecSys (optimizing for user engagement, dwell time, and satisfaction in video, news, and audio platforms). While academic research typically focuses on algorithmic performance using metrics like precision and recall on public datasets, industrial systems must balance multiple business objectives including cost, latency, real-time processing, and long-term user value under strict operational constraints. The paper identifies key industrial challenges absent from academic work: massive data scale requiring multi-stage recommendation pipelines, real-time interest modeling responding to rapidly changing user preferences, cold-start problems with frequent new content, and multi-objective optimization balancing competing metrics like click-through rates versus conversion rates. The authors trace the evolution from heuristic methods through discriminative approaches to emerging generative paradigms that treat recommendation as a sequence generation task, following scaling laws similar to language models. They conclude by advocating for more practically-oriented academic research that incorporates user psychology, theory-guided optimization, and realistic constraints, while encouraging industry to release production-tested models and datasets to bridge the growing divide between theoretical advances and real-world deployment.
📚 https://arxiv.org/abs/2509.06002
[6] Rethinking LLM Parametric Knowledge as Post-retrieval Confidence for Dynamic Retrieval and Reranking
This paper from Xidian University introduces an approach to improve RAG systems by leveraging LLMs' internal hidden states to measure confidence changes when processing retrieved contexts. The authors propose using confidence shifts, quantified by comparing LLM hidden states before and after introducing external knowledge, as a preference signal to fine-tune rerankers. This enables rerankers to prioritize contexts that genuinely enhance the model's answering capability rather than relying solely on semantic similarity. They construct a preference dataset (NQ_Rerank) based on these confidence dynamics and introduce Confidence-Based Dynamic Retrieval (CBDR), which adaptively triggers retrieval only when the LLM exhibits low initial confidence in answering a query. Experimental results demonstrate significant improvements: 5.19% better context screening accuracy, 4.70% higher end-to-end RAG performance, and 7.10% reduction in retrieval costs while maintaining competitive accuracy.
📚 https://arxiv.org/abs/2509.06472
[7] Revealing the Numeracy Gap: An Empirical Investigation of Text Embedding Models
This paper from HKUST investigates a significant gap in text embedding evaluation by examining whether current embedding models can accurately preserve numerical information in text. The authors introduce EmbedNum-1K, a financial domain dataset containing 1,000 synthetic question-answer pairs where correctness depends entirely on numerical values (e.g., distinguishing between "20% stake" and "5% stake" when asked "Who owns over 15% of the company?"). Testing 13 widely-used embedding models including BERT-based and LLM-based variants, they find that models generally struggle with numerical details, achieving only slightly above random performance (54% accuracy). Key findings reveal that LLM-based models outperform encoder-based ones by about 5 percentage points, models interpret different numeric formats (like 8% vs 0.08) distinctly, and performance degrades with increasing significant figures, mirroring human cognitive patterns. Additional analyses show that domain-specific fine-tuning doesn't necessarily improve numeracy, out-of-vocabulary numbers pose particular challenges, models favor "greater-than" over "less-than" comparisons, and contextual information can actually weaken numerical signal preservation in embeddings. The study highlights fundamental limitations in how current embedding models handle fine-grained numerical content, suggesting that simply scaling model size or training data is insufficient and that targeted approaches specifically designed for numerical understanding are needed for applications in finance, healthcare, and other number-intensive domains.
📚 https://arxiv.org/abs/2509.05691
[8] We're Still Doing It (All) Wrong: Recommender Systems, Fifteen Years Later
This paper from Said et al. argues that recommender systems research continues to suffer from fundamental methodological flaws first identified by Xavier Amatriain in 2011, despite fifteen years of technological advancement. The authors contend that the field still treats ordinal ratings as interval data, prioritizes narrow performance metrics like RMSE and nDCG over meaningful user outcomes, and relies heavily on offline evaluation using historical data that may not reflect actual user preferences or experiences. Beyond these persistent issues, they identify new problems including environmental neglect from resource-intensive models, uncritical adoption of Large Language Models without demonstrating necessity, post-hoc approaches to fairness, and user disempowerment through "push" rather than negotiation-based systems. The paper critiques the field's "cult of evaluation" where researchers chase marginal metric improvements on standard datasets while ignoring broader questions of human impact, social responsibility, and sustainable practice. The authors call for a fundamental paradigm shift toward epistemic humility, human-centered evaluation methods, transparent reporting of computational costs, and participatory design approaches that treat users as co-designers rather than optimization targets.
📚 https://arxiv.org/abs/2509.09414
[9] Fishing for Answers: Exploring One-shot vs. Iterative Retrieval Strategies for Retrieval Augmented Generation
This paper from Lin et al. investigates two strategies to improve RAG systems when handling complex government documents, addressing the critical finding that 48% of traditional RAG failures stem from missing relevant "golden chunks" in top-k retrieval. The researchers propose two complementary approaches using a fishing metaphor: the "One-SHOT" strategy casts a larger net by implementing token-constrained retrieval that dynamically selects as many relevant chunks as possible within a predefined context window (rather than a fixed top-k), enhanced with rule-based chunk filtering and LLM-based chunk cropping modules. The "Iterative" strategy employs multiple smaller nets through an agentic RAG framework where a reasoning language model conducts multi-turn retrieval, dynamically issuing search queries and refining context over several iterations while addressing two key challenges: query drift (where autonomous reformulation deviates from original intent) and retrieval laziness (where models prematurely terminate searches due to cognitive burden from heavy context). Experiments demonstrate that both strategies achieve over 10% performance improvements compared to basic RAG baselines. However, combining the strategies proved counterproductive due to context management conflicts, with the token-constrained approach creating contexts too large for effective iterative processing, leading the authors to recommend choosing between strategies based on specific use case constraints rather than attempting integration.
📚 https://arxiv.org/abs/2509.04820
[10] User Immersion-aware Short Video Recommendation
This paper from Tsinghua University presents ImmersRec, a framework to integrate user immersion, i.e., a psychological state of deep engagement, into short video recommendation systems. The researchers conducted a user study with 30 participants to collect immersion annotations, revealing that immersion correlates more strongly with user satisfaction than traditional behavioral metrics like likes and view time. They developed a framework that predicts user immersion using historical behaviors and video features, then incorporates these predictions as contextual information for various recommendation backbones. The system addresses the challenge of scaling from limited annotated data to large-scale datasets through adversarial learning techniques that align immersion representations across different environments. Beyond immediate performance gains, the research shows that predicted immersion affects long-term user engagement, with users experiencing higher immersion scores more likely to return to the platform sooner. The work bridges psychological research with practical recommendation systems, demonstrating that even small amounts of carefully collected psychological annotations can enhance large-scale recommendation performance.
📚 https://dl.acm.org/doi/10.1145/3748303
👨🏽💻 https://github.com/hezy18/ImmersRec
Extras: Benchmarks
⏱️ Benchmarking Information Retrieval Models on Complex Retrieval Tasks
Complex Retrieval Unified Multi-task Benchmark (CRUMB) is a collection of eight datasets designed to evaluate information retrieval models on queries with multiple aspects or constraints. CRUMB includes diverse tasks such as multi-aspect paper retrieval, tip-of-the-tongue queries for movies, set-based logical entity queries, legal question answering with state-specific requirements, theorem retrieval, StackExchange question answering, clinical trial retrieval from patient histories, and code retrieval from problem descriptions. The benchmark standardizes document formatting, provides both full-document and chunked versions, and includes validation sets to support training and evaluation. Using CRUMB, the authors assess a range of state-of-the-art retrieval models and find that even the strongest systems perform poorly on complex tasks.
📝 https://arxiv.org/abs/2509.07253
👨🏽💻 https://github.com/jfkback/crumb
I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest.
I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers.