Extending GraphRAG to Millions of Documents, Scaling Recommender Models to One Billion Parameters, and More!
Vol.114 for Jul 21 - Jul 27, 2025
Stay Ahead of the Curve with the Latest Advancements and Discoveries in Information Retrieval.
This week’s newsletter highlights the following research:
Scaling Transformer-Based Recommender Systems to Billion-Parameter Models, from Yandex
Efficient Knowledge Graph Integration for Web-Scale RAG Systems, from Huawei
A Comprehensive Study of Data Splitting Strategies for Sequential Recommender System Evaluation, from Gusak et al.
Scaling Deep Learning Recommendation Models to One Billion Parameters with RankMixer Architecture, from ByteDance
A Generic Framework for Multi-Behavior Data Augmentation in Sequential Recommendation, from Shenzhen University
A Deep Dive into Retrieval-Augmented Generation for Code Completion on WeChat's Enterprise Codebase, from Tencent
A Large-Scale Evaluation of Prompt Engineering Techniques for LLM-based Recommendation Systems, from NEC
How Cross-Encoders Process Query-Document Relevance Signals, from Vast et al.
Detecting Similarity-Disrupting Tokens in Text Embedding Models, from Chen et al.
Retrieval-Augmented Generation for Missing Attribute Prediction, from Amazon
[1] Scaling Recommender Transformers to One Billion Parameters
This paper from Yandex presents ARGUS (AutoRegressive Generative User Sequential framework), an approach for scaling transformer-based recommender systems to one billion parameters. The authors address the challenge that while transformers have been successfully scaled in NLP and computer vision, recommender systems have been limited to much smaller models (typically under 176M parameters). Their key innovation is a dual-objective pre-training task that combines next-item prediction (learning to imitate existing recommendation behavior) with feedback prediction (learning actual user preferences), followed by a computationally efficient fine-tuning stage that converts the model into a two-tower architecture for offline inference. Through experiments on a proprietary music streaming dataset with over 300B user-item interactions, they demonstrate that their approach scales effectively across model sizes from 3.2M to 1B parameters. The authors successfully deployed a 126M-parameter version on a large-scale music platform serving millions of users, achieving significant improvements in key metrics: +2.26% in total listening time and +6.37% in user like likelihood, representing the largest quality improvements from any neural network-based system in the platform's history. The work validates that recommendation models can benefit from scaling laws similar to those observed in language models, suggesting that user interaction sequences can be as rich and learnable as natural language.
📚 https://arxiv.org/abs/2507.15994
[2] Millions of GeAR-s: Extending GraphRAG to Millions of Documents
This paper from Huawei presents an adaptation of GeAR (a graph-based retrieval-augmented generation system) to handle millions of documents for the SIGIR 2025 LiveRAG Challenge. The main challenge addressed is that traditional GraphRAG methods rely on expensive LLM-based triple extraction from passages, which becomes prohibitively costly at web scale. Instead of extracting triples directly from the FineWeb-10BT corpus, the authors propose an online alignment approach that pseudo-aligns passages retrieved through baseline methods (like BM25) with existing triples from Wikidata using sparse retrieval. Their system maintains GeAR's multi-step agentic framework, where queries are iteratively decomposed and reasoning chains are formed through graph expansion, but replaces the expensive offline triple extraction with a "knowledge synchronizer" using Falcon-3B-Instruct. The approach achieved correctness and faithfulness scores of 0.875714 and 0.529335 respectively on preliminary evaluations.
📚 https://arxiv.org/abs/2507.17399
[3] Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders
This paper from Gusak et al. systematically examines data splitting strategies for offline evaluation of sequential recommender systems, addressing issues with the widely-used leave-one-out (LOO) split that violates global timeline preservation and creates temporal leakage. The authors compare LOO against various global temporal split (GTS) variants with different target selection strategies (Last, First, Random, Successive, All) and validation schemes (Global Temporal, User-Based, Last Training Item). Through experiments across eight datasets and three sequential models (SASRec+, BERT4Rec, GRU4Rec), they demonstrate that evaluation outcomes vary significantly across splitting strategies, with LOO showing lower correlation with realistic evaluation scenarios and potentially distorting model rankings. Their findings reveal that GTS with Last, Random, or Successive targets provides better alignment with real-world usage patterns, while the First target suffers from session-boundary artifacts and the All target exhibits task mismatch with next-item prediction. The study shows that 77.3% of recent sequential recommendation papers use problematic LOO splits, and only 6.7% apply appropriate GTS strategies.
📚 https://arxiv.org/abs/2507.16289
👨🏽💻 https://github.com/monkey0head/time-to-split
[4] RankMixer: Scaling Up Ranking Models in Industrial Recommenders
This paper from ByteDance introduces RankMixer, a hardware-aware architecture designed to scale Deep Learning Recommendation Models (DLRMs) to billion-parameter scale while maintaining strict latency constraints for industrial deployment. The key innovation lies in replacing traditional CPU-era feature-crossing modules with two scalable components: multi-head token mixing (which achieves cross-token feature interactions through parameter-free operations, outperforming self-attention in recommendation tasks) and per-token feed-forward networks (which allocate independent parameters for different feature subspaces to prevent high-frequency fields from dominating). RankMixer is further extended with a Sparse Mixture-of-Experts (MoE) variant using dynamic routing strategies to address expert training imbalances. Through experiments on ByteDance's trillion-scale production dataset, the authors demonstrate that RankMixer achieves superior scaling laws compared to existing methods, boosting Model FLOPs Utilization (MFU) from 4.5% to 45% and scaling parameters by 70× while maintaining similar inference latency.
📚 https://arxiv.org/abs/2507.15551
[5] MBASR: A Generic Framework for Multi-Behavior Data Augmentation in Sequential Recommendation
This paper from Shenzhen University presents MBASR (Multi-Behavior data Augmentation for Sequential Recommendation), a framework designed to address data sparsity challenges in multi-behavior sequential recommendation systems. The authors propose five behavior-aware data augmentation operations that work both within and across subsequences: order perturbation (OP), redundancy reduction (RR), behavior transition (BT), pairwise swapping (PS), and similar insertion (SI), along with a combined SI-PS method. The framework partitions user interaction sequences using target behaviors (e.g., purchases) as boundaries to create meaningful subsequences, then applies tailored augmentation operations that mimic real-world user behavior patterns. To mitigate noise introduction, the authors implement two position-based sampling strategies that prioritize interactions based on temporal relevance: Forward Decay Sampling and Reverse Recency Sampling. The model-agnostic design allows seamless integration with various existing MBSR architectures without modifying their underlying structures.
📚 https://dl.acm.org/doi/10.1145/3749998
👨🏽💻 https://github.com/XiaoQi-C/MBASR
[6] A Deep Dive into Retrieval-Augmented Generation for Code Completion: Experience on WeChat
This paper from Tencent presents a comprehensive empirical study of RAG for code completion in closed-source enterprise environments, specifically using WeChat's proprietary codebase containing 1,669 internal repositories. The researchers evaluate two main RAG approaches: identifier-based RAG (which retrieves relevant definitions of identifiers) and similarity-based RAG (which retrieves similar code implementations using lexical and semantic techniques), across 26 open-source LLMs ranging from 0.5B to 671B parameters on a manually curated benchmark of 100 examples across seven domains. Their key findings demonstrate that both RAG methods consistently improve code completion performance in closed-source repositories, with similarity-based RAG showing superior results, particularly when using BM25 (lexical retrieval) and GTE-Qwen (semantic retrieval) techniques. Notably, the study reveals that lexical and semantic retrieval methods capture fundamentally different aspects of code similarities with minimal overlap in retrieved candidates, leading to optimal performance when BM25 and GTE-Qwen are combined. The researchers also developed a fine-grained preprocessing algorithm to handle unique challenges in C++ codebases, including file segmentation, recursive dependencies, auto-generated code, and macro specificity.
📚 https://arxiv.org/abs/2507.18515
[7] Revisiting Prompt Engineering: A Comprehensive Evaluation for LLM-based Personalized Recommendation
This paper from NEC presents a comprehensive evaluation of prompt engineering techniques for LLM-based personalized recommendation systems, comparing 23 different prompt types across 8 datasets and 12 LLMs in a single-user setting where no information from other users is utilized. The researchers found that for cost-efficient models like GPT-4.1-mini and Llama3.3-70b, three prompt categories proved especially effective: those that rephrase instructions for better understanding (Rephrase), consider background knowledge (Step-Back), and follow structured reasoning processes (ReAct). Interestingly, commonly used NLP prompting techniques such as step-by-step reasoning, role-playing, and emotional appeals did not improve recommendation accuracy and sometimes reduced performance. For high-performance models like Claude-3.7-Sonnet, simple baseline prompts achieved comparable accuracy to complex techniques while being significantly more cost-effective, and reasoning models like O3 showed strong performance but did not exceed Claude-3.7-Sonnet's accuracy while incurring higher costs. The study provides practical guidelines suggesting that cost-conscious applications should use GPT-4.1-mini or Llama3.3-70b with effective prompts (achieving ~90% of Claude-3.7-Sonnet's accuracy at one-fifth the cost), while accuracy-focused applications should use Claude-3.7-Sonnet with simple baseline prompts to avoid unnecessary complexity and expense.
📚 https://arxiv.org/abs/2507.13525
👨🏽💻 https://github.com/nec-research-labs/recsys2025_reproducibility_prompts
[8] Understanding Matching Mechanisms in Cross-Encoders
This paper from Vast et al. investigates the internal mechanisms of cross-encoder neural information retrieval models, specifically focusing on how they detect and process matching signals between queries and documents. Using ablation studies and attention matrix analysis on MonoBERT with TREC Deep Learning track data (2019-2022), the researchers decompose input into five components (CLS token, query, SEP1, document, SEP2) and trace information flow across layers. Their findings reveal a sophisticated three-stage relevance prediction process: early to middle layers perform separate contextualization of queries and documents while specialized attention heads conduct lexical matching, middle layers transition to semantic matching between contextualized tokens, and finally relevance signals are aggregated at the CLS token. The study challenges previous conclusions about document-to-query information transfer importance and provides mathematical interpretation of matching mechanisms through Query-Key matrix analysis, showing that specific attention heads contain subspaces dedicated to query-document matching detection. The research also confirms the "no-op" operation roles of SEP and CLS tokens, where attention heads fall back to these tokens when unable to perform their specialized matching functions.
📚 https://arxiv.org/abs/2507.14604
👨🏽💻 https://git.isir.upmc.fr/mat_vast/sigir25-matching-signals
[9] Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models
This paper from Chen et al. introduces and systematically investigates "sticky tokens". These are anomalous tokens in text embedding models that, when repeatedly inserted into sentences, pull pairwise sentence similarity toward the mean of the model's token embedding distribution, thereby disrupting normal similarity patterns and degrading downstream task performance. The authors formally define sticky tokens and propose an efficient detection method called Sticky Token Detector (STD) that uses sentence pair filtering, token filtering, sticky scoring, and validation steps. Applying STD to 40 checkpoints across 14 model families, they discovered 868 sticky tokens that frequently originate from special tokens, unused vocabulary entries, and fragmented subwords from multilingual corpora, with no clear correlation to model or vocabulary size. Their evaluation demonstrates that sticky tokens cause substantial performance degradation in downstream tasks like clustering and retrieval (up to 50% in some cases), with attention analysis revealing that these tokens disproportionately dominate the model's internal representations.
📚 https://arxiv.org/abs/2507.18171
👨🏽💻 https://github.com/March-7/StickyToken
[10] CatalogRAG: Retrieval-guided LLM prediction for multilingual e-commerce product attributes
This paper from Amazon introduces CatalogRAG, a RAG system designed to predict missing structured attributes in multilingual e-commerce product catalogs using LLMs. The authors address the significant challenge that nearly half of relevant structured attribute values are missing in typical e-commerce catalogs, with LLM performance varying substantially across different languages. Rather than relying on external knowledge sources, CatalogRAG strategically leverages existing product catalog entries as contextual examples, implementing a multi-stage retrieval framework that progressively refines search results through product type filtering, BM25 text-based similarity matching, and heuristic-based reranking using glance views and brand relationships. The system selects up to three highly relevant few-shot examples per missing attribute from similar products within the same language store, incorporating these examples into attribute-specific prompts to guide LLM predictions. Experimental evaluation across three major e-commerce stores in different languages (US, DE, FR) demonstrates catalog entry-level improvements of up to 43.32% in completeness and 2.83% in correctness, particularly showing strong performance gains for non-English catalogs where external knowledge sources are typically less effective.
I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest.
I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers.