Large Foundation Model for Ads Recommendation, Feedback-Driven Approaches in RAG, and More!
Vol.118 for Aug 18 - Aug 24, 2025
Stay Ahead of the Curve with the Latest Advancements and Discoveries in Information Retrieval.
This week’s newsletter highlights the following research:
Large Foundation Model for Ads Recommendation, from Tencent
A Survey of Feedback-Driven Approaches in Retrieval-Augmented Generation, from Rathee et al.
Scaling Laws for Click-Through Rate Prediction with Knowledge Distillation, from Meituan
Closing the Performance Gap in Generative Recommenders with Collaborative Tokenization and Efficient Modeling, from Lepage et al.
Can LLM-Generated QA Data Replace Human Benchmarks for RAG Systems?, from van Elburg et al.
Reducing False Positives in Sequential Recommendation through Explicit Negative Feedback Modeling, from Ivanova et al.
An Industry Study of RAG Implementation Challenges and Practices, from Brehme et al.
A Framework for Ultra-Long Behavioral Modeling in Candidate Retrieval, from ByteDance
Representation Quantization for Collaborative Filtering Augmentation, from Kuaishou
Task-Specialized Co-training for Information Retrieval and Semantic Similarity, from Tencent
[1] Large Foundation Model for Ads Recommendation
This paper from Tencent presents LFM4Ads, a transfer learning framework that leverages large foundation models for advertisement recommendation by addressing limitations in existing approaches that only transfer user representations. The authors propose an "All-Representation Multi-Granularity" framework that extracts and transfers not only user representations (URs) but also item representations (IRs) and user-item cross representations (CRs) from pre-trained foundation models. To handle the transferability challenges of CRs, they develop a time-interval decaying aggregation strategy that transforms fine-grained sample-level cross representations into coarser user-level or item-level forms. The framework implements three transfer mechanisms of increasing granularity: feature-level transfer using non-linear adapters, module-level transfer through an Isomorphic Interaction Module, and model-level transfer via Standalone Retrieval. LFM4Ads employs a triple-tower architecture with dual branches for content and advertisement data, processing 4TB of parameters and handling tens of billions of daily samples across approximately 2,000 features.
📚 https://arxiv.org/abs/2508.14948
[2] Test-time Corpus Feedback: From Retrieval to RAG
This survey paper from Rathee et al. provides a comprehensive overview of test-time corpus feedback mechanisms in RAG systems, examining how feedback signals derived from retrieved documents, re-rankers, and corpus structure can dynamically improve retrieval performance. The authors categorize feedback into three key stages: query-level feedback (including pseudo-relevance feedback using top-k documents and generative relevance feedback using LLMs for query reformulation), retrieval-level feedback (employing neighborhood-based corpus expansion and query vector adaptation based on ranker signals), and generation-time feedback (utilizing rule-based approaches, external signals like uncertainty detection, and self-triggered retrieval through reasoning). While these adaptive approaches show promise for complex tasks requiring iterative evidence gathering (such as multi-hop question answering, fact verification, and complex reasoning), they face significant challenges including computational overhead from multiple retrieval rounds, noisy feedback from irrelevant documents, lack of sophisticated decision policies for determining when feedback is sufficient, and inadequate evaluation metrics that fail to assess feedback effectiveness across retrieval iterations. The paper synthesizes recent developments across information retrieval and NLP communities, highlighting how feedback-driven RAG systems can treat retrieval as a dynamic, learnable component rather than a static preprocessing step.
📚 https://arxiv.org/abs/2508.15437
[3] Exploring Scaling Laws of CTR Model for Online Performance Improvement
This paper from Meituan introduces SUAN (Stacked Unified Attention Network), a CTR prediction model inspired by the scaling laws observed in LLMs. The authors propose a two-stage paradigm: first constructing a high-grade SUAN model that demonstrates scalable performance with increased model size and data volume, then using knowledge distillation to transfer this performance to a lightweight LightSUAN variant suitable for online deployment. SUAN employs stacked Unified Attention Blocks (UABs) that combine self-attention, cross-attention, and dual alignment attention mechanisms to model both sequential and non-sequential features, while incorporating LLM-inspired components like RMSNorm and SwiGLU for training stability. To address deployment constraints, the authors develop LightSUAN with sparse self-attention and parallel inference strategies, then apply online distillation to train it using a high-grade SUAN as teacher. Experiments on three datasets demonstrate that SUAN exhibits scaling laws spanning three orders of magnitude in model grade and data size, with the distilled LightSUAN outperforming the original SUAN configured one grade higher. In online A/B testing, the distilled LightSUAN achieved significant improvements of 2.81% in CTR and 1.69% in CPM while maintaining acceptable inference times.
📚 https://arxiv.org/abs/2508.15326
👨🏽💻 https://github.com/laiweijiang/SUAN
[4] Closing the Performance Gap in Generative Recommenders with Collaborative Tokenization and Efficient Modeling
This paper from Lepage et al. addresses the performance gap between generative recommender systems and traditional ID-based methods. The authors introduce two key innovations: COSETTE (Collaborative and Semantic Tokenization of Text Embeddings) and MARIUS (Multi-scale Attention as Recommendation Index with fUSion). COSETTE tackles the limitation that existing generative recommenders rely on content-based tokenization without collaborative signals by incorporating a contrastive learning objective that integrates user interaction patterns directly into the item quantization process, resulting in semantic identifiers that capture both content and collaborative information. MARIUS addresses architectural inefficiencies in current encoder-decoder models by decoupling temporal sequence modeling from item decoding. Their temporal transformer processes one token per item (similar to SASRec) while the depth transformer handles autoregressive generation of discrete item codes, significantly reducing computational costs and enabling KV-caching. Experiments demonstrate that their approach substantially narrows or eliminates the performance gap with strong ID-based baselines like SASRec++, with MARIUS achieving 33× faster training and 3× faster inference than TIGER-based methods.
📚 https://arxiv.org/abs/2508.14910
[5] Can we Evaluate RAGs with Synthetic Data?
This study from van Elburg et al. investigates whether synthetic question-answer datasets generated by LLMs can effectively substitute for human-labeled benchmarks when evaluating RAG systems. The researchers conducted two controlled experiments across four datasets (two open-domain: SQuAD and ASQA, and two proprietary domain-specific datasets) to assess ranking consistency between synthetic and human benchmarks. In Experiment A, which varied retriever parameters while keeping the generator fixed, they found moderate to high ranking consistency (average Kendall's τ = 0.44-0.75), suggesting synthetic benchmarks can reliably evaluate retrieval configurations like the number of context documents retrieved. However, Experiment B, which compared different generator architectures (GPT-3.5, GPT-4o, Llama-7b-instruct, Claude-3-Haiku), revealed substantial inconsistencies between synthetic and human benchmark rankings, with several metric-dataset combinations yielding low or negative correlations. The authors attribute this breakdown to task mismatches between synthetic and human benchmarks, i.e., synthetic questions tend to be more specific and technical while human questions are often more general or ambiguous. Also, potential stylistic bias could favor certain generators, particularly since synthetic data was generated using GPT-4o. The findings indicate that while synthetic benchmarks offer a scalable solution for RAG evaluation, their reliability depends heavily on the alignment between task design and evaluation target, making them suitable for retriever parameter tuning but unreliable for comparing generator architectures.
📚 https://arxiv.org/abs/2508.11758
👨🏽💻 https://github.com/JonasElburgUVA/Can-we-evaluate-RAGs-with-synthetic-data/
[6] Benefiting from Negative yet Informative Feedback by Contrasting Opposing Sequential Patterns
This paper from Ivanova et al. presents PNFRec, a sequential recommendation model that leverages both positive and negative user feedback to improve recommendation quality. The authors propose using two separate transformer encoders to process positive and negative interaction sequences independently, trained with a composite loss function that combines positive cross-entropy, negative cross-entropy, and a contrastive term designed to better distinguish opposing user preference patterns. Unlike traditional sequential recommenders that focus solely on positive interactions or treat all interactions as positive, this approach explicitly models negative feedback (such as skipped songs or low ratings) to reduce false positive recommendations while maintaining or improving true positive metrics.
📚 https://arxiv.org/abs/2508.14786
👨🏽💻 https://github.com/Veronika-Ivanova/pnfrec
[7] Retrieval-Augmented Generation in Industry: An Interview Study on Use Cases, Requirements, Challenges, and Evaluation
This paper from Brehme et al. presents a comprehensive interview study with 13 industry practitioners examining the current state of RAG adoption in real-world corporate environments. Through semi-structured interviews conducted between April and June 2025, the researchers investigated four key aspects: use cases, requirements, challenges, and evaluation methods. The findings reveal that current RAG applications are predominantly limited to domain-specific question-answering tasks, with most systems still in prototype stages rather than full deployment. Industry requirements prioritize data protection, security, and quality (with average ratings of 8.5-8.9 out of 10), while ethical considerations, bias mitigation, and scalability receive significantly less attention (averaging only 5.6). The study identifies data preprocessing as the primary implementation challenge, encompassing issues with data variety, access management, identity recognition, and chunking strategies, while also highlighting difficulties with retrieval optimization and generator hallucination problems. Notably, RAG evaluation in industry relies predominantly on manual human assessment rather than the automated evaluation methods proposed in academic research, indicating a substantial gap between research advancements and practical implementation.
📚 https://arxiv.org/abs/2508.14066
[8] LongRetriever: Towards Ultra-Long Sequence based Candidate Retrieval for Recommendation
This paper from ByteDance presents LongRetriever, a practical framework that incorporates ultra-long user behavioral sequences into the candidate retrieval stage of industrial recommender systems. The framework tackles key challenges including the computational intractability of interactions between ultra-long sequences and billions of candidate items, and the need for diverse yet precise interest modeling. LongRetriever employs two core innovations: in-context training, which restructures mini-batches to ensure user subsequences and candidate items share the same interest category (mitigating data leakage), and multi-context retrieval, which partitions the item repository into category-based sub-repositories and generates multiple user representations for different interests. The system uses a search-based mechanism to filter relevant subsequences from users' lifelong behavioral data (averaging over 20,000 interactions) and represents users as multiple dense vectors across different interest domains without increasing training overhead.
📚 https://arxiv.org/abs/2508.15486
[9] Representation Quantization for Collaborative Filtering Augmentation
This paper from Kuaishou presents DQRec, a two-stage collaborative filtering framework that addresses data sparsity in recommender systems by jointly extracting behavioral patterns from user-item interaction sequences and attributes through a newly proposed Decomposition-based Quantized Variational AutoEncoder (DQ-VAE). Unlike existing approaches that rely on coarse-grained attributes or overlapping interactions to establish user-user and item-item linkages, DQRec employs singular value decomposition (SVD) to decompose pre-trained representation embeddings into distinct, decorrelated dimensions and quantizes them to generate multi-dimensional semantic ID sequences that capture user multi-aspect interests and item characteristics. The framework enhances collaborative filtering through two mechanisms: feature augmentation, where semantic IDs serve as dynamic attributes to enrich sparse user/item features, and linkage augmentation, where quantized representations identify pattern-similar neighbors to strengthen homogeneous connections and improve information diffusion.
📚 https://arxiv.org/abs/2508.11194
[10] CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity
This paper from Tencent introduces CoDiEmb, a framework that addresses the persistent challenge of negative transfer when jointly training text embedding models for Information Retrieval (IR) and Semantic Textual Similarity (STS) tasks. The authors argue that naive co-training of these fundamentally different tasks typically results in steep performance trade-offs due to their distinct characteristics: IR requires asymmetric query-document matching with sparse relevance signals, while STS demands fine-grained symmetric similarity assessment between short text pairs. CoDiEmb resolves this conflict through three key innovations: (1) task-specialized loss functions paired with a dynamic sampler that ensures single-task batches and prevents gradient interference; (2) a delta-guided model fusion strategy that computes fine-grained parameter-level merging weights by analyzing deviations from pre-trained initialization; and (3) an efficient single-stage training pipeline with cross-device negative sampling for IR and task-specific batch sizing. Experiments across 15 benchmarks using three base encoders (MiniCPM, E5, BGE) demonstrate that CoDiEmb not only mitigates cross-task trade-offs but actually outperforms both single-task specialists and naive co-training approaches.
📚 https://arxiv.org/abs/2508.11442
I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest.
I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers.