Search Engines in an AI Era, Adapting In-Context Learning for Information Retrieval, and More!
Vol.76 for Oct 28 - Nov 03, 2024
Stay Ahead of the Curve with the Latest Advancements and Discoveries in Information Retrieval.
This week’s newsletter highlights the following research:
A Framework for Evaluating LLM-Based Answer Engines, from Salesforce AI Research
A Comprehensive Survey of LLM-Based Recommender Systems Through an Industrial Lens, from Wang et al.
A Novel Metric for Evaluating Search Result Relevance, from LinkedIn
Enhancing Retrieval Models with In-Context Examples, from Tejaswi et al.
Structured Decomposition for Better Retrieval Augmented Generation, from Verma et al.
Enhancing Retrieval-Augmented Generation Through Contrastive Explanations, from Ranaldi et al.
A Reinforcement Learning Framework for Improving LLM Retrieval, from Hsu et al.
Rule-Guided Retrieval-Augmented Generation for Knowledge-Intensive QA, from Chen et al.
Efficient Collaborative Filtering via Embedding Table Analysis, from Loveland et al.
A Probabilistic Framework for Embedding-Based Retrieval, from Zhang et al.
[1] Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Response
Large language model-based answer engines (like Perplexity.ai and BingChat) are increasingly replacing traditional search engines, but their limitations aren't fully understood. The researchers at Salesforce AI Research conducted a study with 21 participants to evaluate how people interact with both answer engines and traditional search engines, particularly focusing on technical queries and debate topics. Based on their findings, they identified 16 key limitations and developed corresponding design recommendations, along with 8 metrics for evaluation. When they tested these metrics on YouChat, Bing Copilot, and Perplexity AI across 303 search queries, they found concerning issues across all platforms. Most notably, the engines showed a strong tendency (50-80%) to generate one-sided answers that agreed with the bias in debate questions, with Perplexity performing particularly poorly despite generating longer responses.
📚 https://arxiv.org/abs/2410.22349
👨🏽💻 https://github.com/SalesforceAIResearch/answer-engine-eval
[2] Towards Next-Generation LLM-based Recommender Systems: A Survey and Beyond
This survey paper from Wang et al. examines the emerging role of Large Language Models in recommender systems from a fresh perspective. While previous surveys simply categorize LLM-based recommenders according to NLP techniques, this paper proposes a novel three-tier framework that traces the progression from research to real-world implementation. The tiers - Representing and Understanding, Scheming and Utilizing, and Industrial Deployment - reflect how LLMs can bridge the long-standing gap between academic research and industrial applications in recommendation systems. The authors highlight that while traditional recommender systems struggle with interpretable embeddings and complex user behaviors, LLMs offer unique advantages through their factual knowledge and reasoning capabilities. However, they also note that LLMs face limitations in industrial scenarios, such as difficulties in precise product scoring for advertising.
📚 https://arxiv.org/abs/2410.19744
👨🏽💻 https://github.com/jindongli-Ai/Next-Generation-LLM-based-Recommender-Systems-Survey
[3] Semantic Search Evaluation
This paper from LinkedIn introduces a novel approach to evaluating search system performance using semantic matching between queries and results. The authors developed a metric called "on-topic rate" (OTR) to measure result relevance at LinkedIn's content search platform. Unlike traditional engagement metrics, this method directly assesses search quality using GPT-3.5 to evaluate semantic relevance. The pipeline was specifically designed to address two key challenges: the indirect nature of existing measurement methods and the need for automated, continuous quality assessment as search patterns evolve. Implemented as a weekly benchmark at LinkedIn, the system not only monitors search performance but also helps identify improvement opportunities by analyzing off-topic results.
📚 https://arxiv.org/abs/2410.21549
[4] RARe: Retrieval Augmented Retrieval with In-Context Examples
This paper from Tejaswi et al. introduces RARe (Retrieval Augmented Retrieval with In-Context Examples), an approach to improve retrieval models using in-context examples. While in-context learning has proven effective for large language models, the authors found that simply prepending example query-document pairs to target queries doesn't work well for retrieval tasks. Instead, RARe finetunes pre-trained models using in-context examples that are semantically similar to the target query. Testing this approach across various architectures (including both decoder-only language models and retriever models), they demonstrated consistent performance improvements.
📚 https://arxiv.org/abs/2410.20088
👨🏽💻 https://github.com/atutej/RARe
[5] Plan×RAG: Planning-guided Retrieval Augmented Generation
This paper from Verma et al. introduces Plan×RAG (Planning-guided Retrieval Augmented Generation), a framework that reimagines how language models interact with external knowledge. Unlike traditional Retrieval Augmented Generation systems that follow a retrieve-then-reason approach, Plan×RAG introduces a plan-then-retrieve paradigm where queries are decomposed into a directed acyclic graph (DAG) of interrelated atomic sub-queries. This structure enables parallel processing and better information sharing between sub-queries. A key innovation is the use of plug-and-play experts (including critic and relevance experts) that work with frozen language models, eliminating the need for expensive fine-tuning.
📚 https://arxiv.org/abs/2410.20753
[6] Eliciting Critical Reasoning in Retrieval-Augmented Language Models via Contrastive Explanations
This paper from Ranaldi et al. introduces Contrastive-RAG (C-RAG), a framework designed to improve how language models critically analyze retrieved information in retrieval-augmented generation systems. While RAG has become crucial for enhancing LLMs' factual accuracy, existing systems often struggle with noisy or irrelevant retrieved content. C-RAG addresses this through a four-step process: collecting relevant passages, constructing contrastive reasoning arguments about their relevance, consolidating these into a final explanation, and generating an answer. The authors demonstrate that by using larger models to generate contrastive reasoning demonstrations, they can significantly improve smaller models' performance on retrieval-augmented tasks, achieving an average 55.4% accuracy increase over standard RAG.
📚 https://arxiv.org/abs/2410.22874
[7] Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval
This paper from Hsu et al. introduces Learning to Retrieve by Trying (LeReT), a reinforcement learning framework that tackles a persistent challenge in large language models – their tendency to hallucinate when they can't effectively search for and ground their responses in reliable sources. The key innovation lies in treating the search query generation as a reinforcement learning problem, where the model learns to improve its queries through trial and error. LeReT works by diversifying search queries using varied few-shot examples in prompts, followed by context distillation and preference-based optimization. While the paper focuses on search and retrieval, the authors note that their framework could be adapted for any scenario where language models need to learn to use external tools effectively.
📚 https://arxiv.org/abs/2410.23214
👨🏽💻 https://sherylhsu.com/LeReT/
[8] RuleRAG: Rule-guided retrieval-augmented generation with language models for question answering
This paper from Chen et al. introduces RuleRAG, a framework that explicitly incorporates symbolic rules to improve both retrieval and answer generation. Traditional RAG systems often struggle because they don't specify retrieval preferences or guide the model on how to use retrieved documents effectively. RuleRAG tackles this by implementing two approaches: RuleRAG-ICL, which uses in-context learning with symbolic rules to guide both retrieval and generation, and RuleRAG-FT, which fine-tunes the models using these rules for better instruction-following. Using the example of determining someone's nationality, instead of just searching for direct mentions, the system can use rules (like "people often have the nationality of their birth country") to guide more logical and effective retrieval.
📚 https://arxiv.org/abs/2410.22353
👨🏽💻 https://github.com/chenzhongwu20/RuleRAG_ICL_FT
[9] Understanding and Scaling Collaborative Filtering Optimization from the Perspective of Matrix Rank
This paper from Loveland et al. addresses efficiency challenges in Collaborative Filtering (CF) recommender systems by examining the mathematical properties of embedding tables. Through theoretical and empirical analysis, they discovered that the singular values of embedding matrices are inherently connected to different CF loss functions and their negative sampling strategies. This insight led them to develop a novel warm-start strategy that regularizes the stable rank of user and item embeddings during early training phases. Their approach yields impressive results: up to 66% improvement in training speed for complex loss functions like DirectAU, and up to 21% performance gains for simpler loss functions like Bayesian Personalized Ranking (BPR).
📚 https://arxiv.org/abs/2410.23300
👨🏽💻 https://anonymous.4open.science/r/StableRankReg-7BF2/
[10] pEBR: A Probabilistic Approach to Embedding Based Retrieval
This paper from Zhang et al. introduces pEBR (probabilistic Embedding Based Retrieval), a framework that addresses the limitations of current retrieval systems which use fixed item counts for all queries. Instead of treating all queries the same way - which leads to insufficient results for broad queries like "gift" and irrelevant results for specific ones like "iPhone 14" - pEBR learns the actual distribution of relevant items for different types of queries. Moving away from traditional frequentist approaches, the system implements dynamic similarity thresholds using probabilistic cumulative distribution functions (CDF). As the first application of probabilistic modeling to embedding-based retrieval, this work replaces conventional heuristic solutions with a more theoretically grounded approach.
📚 https://arxiv.org/abs/2410.19349
Extras: Datasets
💾 RecFlow: An Industrial Full Flow Recommendation Dataset
RecFlow is a comprehensive dataset from Kuaishou that captures the full pipeline of an industrial recommendation system across six stages, from retrieval to edge ranking. Unlike traditional recommendation datasets that only contain exposed items (ones shown to users), RecFlow includes both exposed and unexposed items filtered at each stage of the recommendation funnel. The dataset comprises 38M interactions from 42K users across 9M items, plus 1.9B stage samples from 9.3M online requests over 37 days. What makes RecFlow particularly valuable is its inclusion of multiple types of user feedback (views, likes, shares), context features, timestamps, and stage-specific information, enabling researchers to study critical issues like selection bias, multi-stage optimization, and the gap between offline training and online serving environments.
📝 https://arxiv.org/abs/2410.20868
👨🏽💻 https://github.com/RecFlow-ICLR/RecFlow
💾 AmazonQAC: A large-scale, naturalistic query autocomplete dataset
AmazonQAC is a large-scale dataset from Amazon for Query Autocomplete research containing 395M real-world examples from Amazon Search logs. AmazonQAC captures actual user typing sequences, including the prefixes users type before reaching their final search terms, along with session IDs and timestamps.
👨🏽💻 https://huggingface.co/datasets/amazon/AmazonQAC
Extras: Benchmarks
⏱ Long2RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall
Long2RAG is a new benchmark for evaluating Retrieval-augmented generation systems, specifically focused on handling long-context retrieval and long-form responses. The dataset contains 280 questions across 10 domains and 8 categories, with each question paired with 5 retrieved documents averaging 2,444 words in length.
📝 https://arxiv.org/abs/2410.23000
👨🏽💻 https://github.com/QZH-777/longrag
I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest.
I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers.