Understanding Context Sufficiency in RAG Systems, A Systematic Review of Language Model Representations, and More!
Vol.78 for Nov 11 - Nov 17, 2024
Stay Ahead of the Curve with the Latest Advancements and Discoveries in Information Retrieval.
This week’s newsletter highlights the following research:
A Systematic Review of Language Model Representations, from Zhang et al.
Adaptive RAG Through Knowledge Boundary Detection, from Alibaba
Understanding Context Sufficiency in RAG Systems, from Joren et al.
A Cache-Driven Approach to Scaling Recommendation Systems, from Kuaishou
A General Framework for Content Annotation and Retrieval, from Clarke et al.
A Comprehensive Framework for Multimodal Long Document Understanding, from Alibaba
Specialized Text Embeddings for Financial Document Understanding, from Balyasny Asset Management
A Theoretical Framework for Comparing Recommendation Loss Functions, from Teodoro et al.
Understanding Document Retrieval Trade-offs in RAG Systems, from Intel Labs
A Systematic Evaluation of Commercial LLM Fine-tuning Services, from Stanford University
[1] From Word Vectors to Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
This comprehensive survey from Zhang et al. examines the evolution and current state of word embeddings and language models, from foundational concepts to cutting-edge applications. The paper traces the progression from simple word representations like one-hot encoding to sophisticated contextual embeddings used in models such as BERT, GPT, and XLNet. It explores how these embeddings capture semantic relationships and handle challenges like polysemy and multilinguality. The authors emphasize the grounding of language models in other modalities, including vision and robotics, and discuss how embeddings relate to human brain activity patterns. The survey also dedicates significant attention to advanced topics such as model compression techniques, interpretability challenges, numerical encoding, and bias mitigation strategies.
📚 https://arxiv.org/abs/2411.05036
[2] Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment
This paper from Alibaba explores the effectiveness of Retrieval-Augmented Generation in Large Language Models and introduces a Knowledge Boundary Model (KBM) to optimize retrieval decisions. The authors find that RAG's impact on LLMs can be beneficial, neutral, or harmful depending on the query type, and propose methods to evaluate LLMs' known and unknown knowledge boundaries using confidence and certainty metrics. The KBM is trained to determine whether retrieval is necessary for a given query, thereby reducing unnecessary retrievals while maintaining performance. Experiments across 11 English and Chinese datasets demonstrate that KBM significantly reduces retrieval ratios (up to 43.17% on WebQA) while maintaining comparable performance to full RAG approaches. The model proves particularly effective in handling dynamic knowledge, long-tail static knowledge, and multi-hop problems.
📚 https://arxiv.org/abs/2411.06207
[3] Sufficient Context: A New Lens on Retrieval Augmented Generation Systems
This paper from Joren et al. introduces a framework for analyzing RAG systems by introducing the concept of "sufficient context" - whether retrieved information contains enough data to answer a query. The authors develop an autorater to classify context sufficiency, achieving 93% accuracy, and use it to analyze several models and datasets. Their analysis reveals several key findings: proprietary LLMs (like Gemini, GPT, Claude) excel at answering queries with sufficient context but often output incorrect answers instead of abstaining when context is insufficient, while open-source LLMs (like Llama, Mistral, Gemma) tend to hallucinate or abstain even with sufficient context. Surprisingly, models sometimes generate correct answers even with insufficient context. Building on these insights, the authors develop a selective generation method that combines confidence and sufficient context signals to reduce hallucinations, improving the fraction of correct answers by 2-10% for various models.
📚 https://arxiv.org/abs/2411.06037
[4] MARM: Unlocking the Future of Recommendation Systems through Memory Augmentation and Scalable Complexity
This paper from Kuaishou introduces MARM (Memory Augmented Recommendation Model), a method for scaling recommendation systems by leveraging memory caching to reduce computational complexity. The authors identify key differences between language models and recommendation systems: while RecSys have abundant training data and parameter storage, they face strict computational constraints due to millisecond-level response requirements. MARM addresses this by caching intermediate calculation results, effectively reducing the time complexity of attention mechanisms from O(n²d) to O(nd). This enables the extension of single-layer attention-based sequence modeling to multiple layers without significantly increasing computational costs.
📚 https://arxiv.org/abs/2411.09425
[5] Annotative Indexing
This paper from Clarke et al. introduces "annotative indexing," a novel framework that unifies and generalizes traditional inverted indexes, column stores, object stores, and graph databases. The key innovation here is representing content as a sequence of tokens in an address space, with annotations providing information about intervals over that content. Each annotation is a triple containing a feature, an interval, and a value. This approach allows for flexible handling of heterogeneous data types and formats, while maintaining efficient query processing through minimal-interval semantics. The paper presents a reference implementation called Cottontail that supports both static and dynamic indexing, with the dynamic version supporting ACID properties and multiple concurrent readers and writers.
📚 https://arxiv.org/abs/2411.06256
👨🏽💻 https://github.com/claclark/Cottontail
[6] M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
This paper from Alibaba introduces M-LongDoc, a new benchmark for evaluating how well large multimodal models can understand and answer questions about lengthy documents containing text, figures, and tables. Unlike existing benchmarks that focus on shorter documents and simple extractive answers, M-LongDoc features 851 samples from documents averaging over 200 pages and requires in-depth, open-ended responses. The researchers also developed an automated evaluation framework using multiple "judge" models to assess response quality, showing strong correlation with human judgment. Their analysis revealed that current models struggle particularly with figure and table-based questions compared to text-based ones, highlighting a multimodal bias.
📚 https://arxiv.org/abs/2411.06176
👨🏽💻 https://multimodal-documents.github.io/
[7] Greenback Bears and Fiscal Hawks: Finance is a Jungle and Text Embeddings Must Adapt
This paper from Balyasny Asset Management introduces BAM embeddings, a specialized set of text embeddings optimized for financial document retrieval. The authors constructed a dataset of 14.3M query-passage pairs from 2.8M financial documents, using a carefully designed process that includes synthetic query generation through LLM prompting. Unlike general-purpose text embeddings, BAM embeddings are specifically trained to handle financial terminology, jargon, and acronyms. The model achieves 62.8% Recall@1 on their test set, significantly outperforming general-purpose embeddings like OpenAI's text-embedding-3-large (39.2%). Through ablation studies, they demonstrate the importance of hard negative mining and dataset scale in achieving this performance.
📚 https://arxiv.org/abs/2411.07142
[8] A Theoretical Analysis of Recommendation Loss Functions under Negative Sampling
This paper from Teodoro et al. presents a theoretical analysis of three common loss functions used in recommender systems: Binary Cross-Entropy (BCE), Categorical Cross-Entropy (CCE), and Bayesian Personalized Ranking (BPR). The authors prove that when using only one negative sample, CCE and BPR are mathematically equivalent, and all three losses share the same global minimum when item scores are bounded. The study also establishes probabilistic lower bounds for each loss function's relationship with ranking metrics like NDCG and MRR, showing that BCE provides stronger bounds than BPR, which in turn provides stronger bounds than CCE. Their results show that while using more negative samples generally improves model performance, the effectiveness of each loss function varies depending on the training stage and the number of negative samples used.
📚 https://arxiv.org/abs/2411.07770
👨🏽💻 https://anonymous.4open.science/r/recsys_losses/README.md
[9] Toward Optimal Search and Retrieval for RAG
This paper from Intel Labs investigates how different aspects of retrieval affect the performance of RAG pipelines, particularly for question-answering tasks. Through extensive experiments on multiple datasets (ASQA, QAMPARI, and Natural Questions), the authors make several key findings: (1) RAG performance tends to plateau after including 5-10 retrieved documents in the context, with even a single relevant ("gold") document significantly improving accuracy; (2) using approximate nearest neighbor search with lower accuracy (down to 70% search recall) only minimally impacts RAG performance while potentially offering significant speed and memory benefits; (3) contrary to previous research, adding noisy or less relevant documents to the context consistently degrades performance rather than improving it. The study also evaluated two language models (Mistral and LLaMA) and two retrieval models (BGE and ColBERT), finding that while ColBERT generally performed slightly better, both retrievers showed similar patterns in their relationship to downstream task performance.
📚 https://arxiv.org/abs/2411.07396
[10] FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs?
This paper from Stanford University introduces FineTuneBench, a comprehensive evaluation framework and dataset designed to assess how effectively commercial fine-tuning APIs can inject new knowledge into LLMs. The researchers evaluated five models from OpenAI and Google (GPT-4o, GPT-4o-mini, GPT-3.5-turbo, Gemini 1.5 Pro, and Gemini 1.5 Flash) across four different tasks: learning latest news, remembering fictional people profiles, updating medical guidelines, and adapting to code changes. The results reveal significant limitations in these models' ability to learn and generalize new information through fine-tuning, with an average generalization accuracy of only 37% for new knowledge and 19% for updating existing knowledge. GPT-4o-mini performed best among all models, while Gemini models struggled significantly, barely able to memorize training data.
📚 https://arxiv.org/abs/2411.05059
👨🏽💻 https://github.com/kevinwu23/StanfordFineTuneBench
I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest.
I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers.