Retrieval-Augmented Visual Document Understanding, Long Context RAG Performance of LLMs and More!
Vol.77 for Nov 04 - Nov 10, 2024
Stay Ahead of the Curve with the Latest Advancements and Discoveries in Information Retrieval.
This week’s newsletter highlights the following research:
Bridging Visual and Textual Document Understanding with Multi-modal Retrieval, from Bloomberg
Best Practices for Distilling Large Language Models into BERT for Web Search Ranking, from Tencent
Long Context RAG Performance of Large Language Models, from Databricks
A Pairwise Perspective on Softmax Loss for Robust Recommendation, from Zhejiang University
Efficient HTML Processing for RAG Systems, from Tan et al.
A Universal Framework for Multimodal Retrieval with Large Language Models, from NVIDIA
Rationale-Guided Retrieval Augmented Generation for Medical Question Answering, from Korea University
A Reasoning-Based Framework for Document Reranking with Large Language Models, from Salesforce AI Research
Efficient RAG Through Cost-Constrained Chunk Optimization and Dynamic Configuration, from NTU
A Comprehensive Assessment of LLMs in Recommendation, from Jiang et al.
[1] M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
The paper from Bloomberg introduces M3DOCRAG, a framework that combines multi-modal retrieval and language models to handle document-based visual question answering across multiple pages and documents. Unlike existing approaches that either focus on single-page documents or rely on text extraction, M3DOCRAG processes documents as images and can efficiently retrieve relevant pages before generating answers. The researchers also present M3DOCVQA, a new benchmark dataset containing over 2,400 multi-hop questions across more than 3,000 PDF documents (totaling 41,000+ pages), designed to evaluate open-domain document understanding capabilities. M3DOCRAG outperforms existing methods across three benchmarks, achieving state-of-the-art results on MP-DocVQA using ColPali for retrieval and Qwen2-VL 7B for question answering. The framework shows particular strength in handling visual elements like charts and figures that text-based methods typically struggle with.
📚 https://arxiv.org/abs/2411.04952
[2] Best Practices for Distilling Large Language Models into BERT for Web Search Ranking
This paper from Tencent introduces DisRanker, a novel framework for distilling large language models' ranking capabilities into smaller BERT models for web search applications. The researchers address the practical challenge of deploying LLMs in commercial search systems by developing a three-stage approach: First, they perform domain-specific continued pre-training where LLMs learn to generate clicked titles and summaries from queries. Second, they fine-tune the LLM using a rank loss function, utilizing the end-of-sequence token to represent query-document pairs. Finally, they employ a hybrid distillation approach combining point-wise and margin MSE losses to transfer the LLM's knowledge to a more efficient BERT model.
📚 https://arxiv.org/abs/2411.04539
[3] Long Context RAG Performance of Large Language Models
This paper from Databricks presents a comprehensive study of how increased context length affects Retrieval Augmented Generation performance across 20 popular LLMs. The researchers evaluated models ranging from 2,000 to 128,000 tokens (and up to 2 million tokens where possible) on three domain-specific datasets. Their key findings reveal that while retrieving more documents can improve performance, only the most recent state-of-the-art LLMs (like o1, GPT-4o, Claude 3.5, Gemini 1.5, and Qwen 2 70B) can maintain consistent accuracy at context lengths above 64k tokens. Most other models show performance degradation after 16-32k tokens. The researchers also identified distinct failure modes in long context scenarios: Claude 3 Sonnet often refused to answer due to perceived copyright concerns, Gemini 1.5 Pro encountered issues with overly sensitive safety filters, and open-source models like Mixtral and DBRX showed various patterns of content repetition or instruction-following failures.
📚 https://arxiv.org/abs/2411.03538
[4] PSL: Rethinking and Improving Softmax Loss from Pairwise Perspective for Recommendation
This paper from Zhejiang University introduces Pairwise Softmax Loss (PSL), a new family of loss functions that improves upon traditional Softmax Loss (SL) for recommendation systems. The authors identify two key limitations of SL: its loose relationship with ranking metrics like DCG and its sensitivity to false negative instances. PSL addresses these issues by reformulating SL from a pairwise perspective and replacing its exponential function with more appropriate activation functions (like ReLU, Tanh, or Atan). The authors demonstrate three main advantages of PSL: it serves as a tighter surrogate for ranking metrics, provides better control over data contribution weights, and can be interpreted as a BPR loss enhanced by Distributionally Robust Optimization (DRO).
📚 https://arxiv.org/abs/2411.00163
👨🏽💻 https://github.com/Tiny-Snow/IR-Benchmark
[5] HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
This paper from Tan et al. introduces HtmlRAG, a novel approach that uses HTML instead of plain text as the format for retrieved knowledge in Retrieval-Augmented Generation systems. The authors argue that converting HTML to plain text leads to loss of structural and semantic information, such as headings and table structures. To address the challenge of long HTML documents containing noise (like CSS and JavaScript), they propose a three-part solution: HTML cleaning to remove irrelevant content, and a two-step block-tree-based pruning method that uses both embedding and generative models to retain only relevant HTML content. The pruning process first builds a block tree from the DOM structure, then prunes blocks based on text embedding similarity scores, and finally applies a finer-grained pruning using a generative model.
📚 https://arxiv.org/abs/2411.02959
👨🏽💻 https://github.com/plageon/HtmlRAG
[6] MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs
This paper from NVIDIA introduces MM-EMBED, an approach to universal multimodal retrieval that leverages multimodal large language models (MLLMs) to handle diverse search scenarios involving multiple modalities (text, images, or combinations) and various retrieval tasks. The authors identify that while MLLM-based retrievers excel at handling complex multimodal queries, they underperform CLIP-based models in cross-modal retrieval tasks due to modality bias. To address this, they propose modality-aware hard negative mining and continual text-to-text retrieval fine-tuning. Their final model, MM-Embed, achieves state-of-the-art performance on both the M-BEIR multimodal retrieval benchmark and the MTEB text retrieval benchmark. Additionally, they explore using MLLMs as zero-shot rerankers, finding that this approach can significantly improve performance on challenging tasks.
📚 https://arxiv.org/abs/2411.02571
👨🏽💻 https://huggingface.co/nvidia/MM-Embed
[7] Rationale-Guided Retrieval Augmented Generation for Medical Question Answering
This paper from Korea University introduces RAG² (RAtionale-Guided RAG), a framework designed to enhance the reliability of retrieval-augmented generation in medical question-answering. The authors address three key challenges in medical RAG: LLMs' vulnerability to irrelevant context, poorly targeted medical queries, and retriever bias toward specific corpora. RAG² incorporates three main innovations: a small filtering model trained on perplexity-based labels of rationales; LLM-generated rationales as queries; and a balanced retrieval system that draws equally from four biomedical corpora to mitigate retriever bias.
📚 https://arxiv.org/abs/2411.00300
👨🏽💻 https://github.com/dmis-lab/RAG2
[8] JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking
This paper from Salesforce AI Research introduces JUDGERANK, a method for improving document reranking in retrieval systems using LLMs. The system mimics human cognitive processes through three key steps: query analysis to identify core problems, document analysis to extract query-relevant summaries, and relevance judgment. Unlike traditional reranking methods that rely on surface-level matching or make immediate judgments, JUDGERANK employs a structured reasoning process before determining document relevance. The study also revealed interesting insights about model complementarity, showing that LLMs of different sizes (8B, 70B, and 405B parameters) often make different judgments, leading to improved performance through model ensembling.
📚 https://arxiv.org/abs/2411.00142
[9] CORAG: A Cost-Constrained Retrieval Optimization System for Retrieval-Augmented Generation
This paper from NTU introduces CORAG, a cost-constrained retrieval optimization system for RAG that addresses three key challenges: correlations between document chunks, non-monotonic utility of chunks, and diverse query types. The system employs a Monte Carlo Tree Search (MCTS)-based framework to find optimal chunk combinations sequentially, alongside a configuration agent that predicts optimal configurations for different query types through contrastive learning. Unlike traditional RAG approaches that select chunks independently or exhaust budgets, CORAG integrates budget constraints into the optimization process and considers chunk correlations through tree-based search. The researchers evaluated CORAG against several baseline methods using the WikiPassageQA and MARCO datasets, demonstrating up to 30% performance improvement over baselines.
📚 https://arxiv.org/abs/2411.00744
[10] Beyond Utility: Evaluating LLM as Recommender
This paper from Jiang et al. introduces a comprehensive evaluation framework for LLMs as recommender systems that goes beyond traditional utility metrics. The authors identify and explore four new evaluation dimensions specific to LLMs: history length sensitivity, candidate position bias, generation-involved performance, and hallucinations. Through extensive experiments comparing seven LLMs (using three prompting strategies) with six traditional recommendation models across four datasets, the researchers uncover several key findings: LLMs excel in domains where they possess extensive knowledge, are particularly good at recommending niche items, perform well with short user histories but struggle to utilize longer ones, and suffer from position bias in candidate lists.
📚 https://arxiv.org/abs/2411.00331
👨🏽💻 https://github.com/JiangDeccc/EvaLLMasRecommender
Extras: Tools
🛠️ Lightning IR: Straightforward Fine-tuning and Inference of Transformer-based Language Models for Information Retrieval
Lightning IR is an open-source PyTorch Lightning-based framework that simplifies the fine-tuning and inference of transformer-based language models for IR tasks. The framework provides a modular and extensible architecture that supports all stages of an IR pipeline - from fine-tuning and indexing to searching and re-ranking.
📝 https://arxiv.org/abs/2411.04677
👨🏽💻 https://github.com/webis-de/lightning-ir
💾 RAG-QA arena: Evaluating domain robustness for long-form retrieval-augmented question answering
LFRQA (Long-form RobustQA) is a new benchmark designed to evaluate RAG-QA systems using LLMs. It comprises 26K queries across seven different domains, featuring human-written long-form answers that coherently integrate information from multiple documents.
👨🏽💻 https://github.com/awslabs/rag-qa-arena
I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest.
I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers.