BM25-Guided LLM Reranking for Complex Queries, Visual Analytics for GraphRAG Pipelines, and More!

Vol.109 for Jun 16 - Jun 22, 2025

Jun 20, 2025

This week’s newsletter highlights the following research:

Zero-Shot Enhancement of LLM Rerankers through BM25 Score Integration, from Seetharaman et al.
A Visual Analytics Framework for Debugging Graph-based RAG Systems, from Zhejiang University
Production-Scale Generative Recommendations with Reinforcement Learning, from Kuaishou
A Generative Approach to Cold-Start Recommendation via Sequential User Modeling, from ByteDance
A Reference-Anchored Approach for Zero-Shot Document Ranking, from Li et al.
Structure-Aware Code Chunking for Enhanced Retrieval-Augmented Generation, from CMU
A Modular Open-Source Framework for Efficient Retrieval-Augmented Generation Research, from Zhang et al.
A Learning-to-Rank Framework for Dynamic Retriever Selection in RAG Systems, from CMU
A Comprehensive Review of Multi-Interest Recommendation, from Li et al.
Hard Negative Mining through Hierarchical User Grouping in Recommendation Systems, from Credit Karma

[1] InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking

This paper from Seetharaman et al. introduces InsertRank to enhance LLM-based document reranking by incorporating BM25 lexical scores directly into the prompt during listwise reranking for reasoning-intensive queries. The method addresses the growing need for more sophisticated retrieval systems as users pose increasingly complex queries that require reasoning over documents rather than simple keyword matching. InsertRank demonstrates consistent improvements across multiple LLM families (GPT, Gemini, and Deepseek models) on two reasoning-centric benchmarks.

📚 https://arxiv.org/abs/2506.14086

[2] XGraphRAG: Interactive Visual Analysis for Graph-based Retrieval-Augmented Generation

This paper from Zhejiang University introduces XGraphRAG, a visual analysis framework designed to help developers understand and debug GraphRAG systems, which enhance LLMs by using knowledge graphs as intermediate representations for more structured information retrieval. The authors conducted a formative study with 5 RAG experts to identify key challenges in analyzing GraphRAG systems, including lack of traceability across complex processing pipelines, difficulty connecting numerous LLM invocations to their graph context, and insufficient support for multi-facet relevance analysis. Based on these findings, they developed XGraphRAG with four interactive views: QA & Inference Trace Views for identifying suspicious recalls through answer comparison, Topic Explore View for global relevance analysis using hierarchical circle packing visualization, Entity Explore View for local relevance analysis via node-link diagrams, and LLM Invocation View for examining detailed LLM behavior across extraction, merge, and summarization stages. Their two-stage workflow first automatically identifies suspicious retrievals by comparing actual answers with ground truth using LLM-assisted evaluation, then enables systematic analysis of missing correlations and LLM behavior tracing.

📚 https://arxiv.org/abs/2506.13782

👨🏽‍💻 https://github.com/Gk0Wk/XGraphRAG

[3] OneRec Technical Report

This paper from Kuaishou presents OneRec, an end-to-end generative recommendation system that addresses fundamental limitations of traditional cascaded architectures through an encoder-decoder framework with reinforcement learning alignment. The system tokenizes videos into semantic IDs using collaborative-aware multimodal representations and RQ-Kmeans clustering, processes multi-scale user behavior through specialized pathways (static, short-term, positive-feedback, and lifelong), and generates recommendations via a Mixture-of-Experts decoder optimized through preference-aligned RL with a comprehensive reward system. Through extensive infrastructure optimizations, OneRec achieves 23.7% and 28.8% Model FLOPs Utilization during training and inference respectively, while operating at only 10.6% the OPEX of traditional recommendation pipelines. The system demonstrates scaling laws for recommendation models with 10× enhanced computational FLOPs.

📚 https://arxiv.org/abs/2506.13695

[4] Next-User Retrieval: Enhancing Cold-Start Recommendations via Generative Next-User Modeling

This paper from ByteDance introduces Next-User Retrieval, a generative framework designed to address the item cold-start problem in recommendation systems by modeling lookalike users through transformer-based next-user prediction. The approach treats sequential users who have positively interacted with items (through likes and comments) as lookalike users and employs a transformer encoder-decoder architecture with causal attention to generate embeddings for the next potential user most likely to interact with a cold-start item. The system incorporates prefix prompt embeddings (item features like ID and category) and a learnable [CLS] token to bridge feature domain gaps, while using a combined loss function consisting of contrastive loss, cross-entropy loss, and auxiliary loss to optimize both generative ability and robustness. Deployed on Douyin's short-video platform with over 600 million daily active users, the method integrates with the existing HNSW-based retrieval system and demonstrates significant performance improvements.

📚 https://arxiv.org/abs/2506.15267

[5] Leveraging Reference Documents for Zero-Shot Ranking via Large Language Models

This paper from from Li et al. introduces RefRank, a zero-shot document ranking method that addresses the trade-off between computational efficiency and ranking accuracy in LLM-based information retrieval. While pointwise approaches are computationally efficient (O(n)) but lack inter-document comparisons, and pairwise methods provide better accuracy through explicit document comparisons but suffer from quadratic complexity (O(n²)), RefRank achieves linear time complexity while preserving comparative evaluation benefits by using a fixed reference document as an anchor for comparisons. The method selects a reference document from top-ranked initial retrieval results and prompts the LLM to evaluate each candidate document relative to this shared reference, enabling indirect comparisons between documents. To enhance robustness, the authors propose a multiple reference document ensemble strategy using weighted averaging across different reference choices, with experimental analysis suggesting optimal performance when selecting from the top-2 documents and weighting up to 5 references.

📚 https://arxiv.org/abs/2506.11452

[6] cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree

This paper from CMU introduces cAST (Chunking via Abstract Syntax Trees), a structure-aware method for improving code RAG by intelligently dividing source code into semantically coherent chunks that respect syntactic boundaries. Unlike traditional line-based chunking approaches that often break functions or merge unrelated code segments, cAST leverages Abstract Syntax Tree parsing to recursively split large AST nodes into smaller chunks while merging sibling nodes within size constraints, preserving the structural integrity of code units across programming languages. The method employs a recursive split-then-merge algorithm that maintains syntactic integrity, maximizes information density, ensures language invariance, and provides plug-and-play compatibility with existing RAG pipelines by measuring chunk size through non-whitespace characters rather than line counts.

📚 https://arxiv.org/abs/2506.15655

👨🏽‍💻 https://github.com/yilinjz/astchunk

[7] FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation

This paper from Zhang et al. introduces FlexRAG, a comprehensive open-source framework for RAG research and prototyping that addresses several limitations in existing frameworks including algorithm reproduction difficulties, lack of advanced techniques, and high system overhead. The framework features a modular architecture comprising twelve core modules organized into four functional groups: models (encoders, generators, rerankers), retrievers (Web Retriever, API-Based Retriever, FlexRetriever), system development components (preprocessors, refiners, assistants), and evaluation tools. FlexRAG distinguishes itself through four key capabilities: research-oriented design with unified configuration management and Hugging Face Hub integration for community sharing, extensive infrastructure including bilingual documentation and command-line tools, comprehensive technical support spanning text-based, multimodal, and web-based RAG with end-to-end pipeline coverage, and superior performance through asynchronous processing, persistent caching, and memory mapping that consumes only one-tenth the CPU and memory resources of comparable frameworks.

📚 https://arxiv.org/abs/2506.12494

👨🏽‍💻 https://github.com/ictnlp/FlexRAG

[8] LTRR: Learning To Rank Retrievers for LLMs

This paper from CMU introduces LTRR (Learning to Rank Retrievers), a query routing framework that addresses the limitation of relying on single fixed retrievers by dynamically selecting the optimal retriever from a pool based on query characteristics. The authors frame retriever selection as a learning-to-rank problem that directly optimizes for downstream LLM utility rather than traditional retrieval metrics, incorporating both retriever selection ("where to query") and retrieval necessity ("when to query") by including a no-retrieval option.

📚 https://arxiv.org/abs/2506.13743

👨🏽‍💻 https://github.com/kimdanny/Starlight-LiveRAG

[9] Multi-Interest Recommendation: A Survey

This paper from Li et al. presents a comprehensive survey of multi-interest recommendation systems, which model users' diverse and multifaceted preferences in recommendation scenarios. The authors analyze over 170 research works spanning from 2005 to 2025, systematically categorizing multi-interest modeling approaches into explicit methods (using external information like user behaviors, temporal patterns, item attributes, and multi-modal data) and implicit methods (learning directly from interaction histories). The survey establishes a technical framework consisting of two main components: multi-interest extractors (primarily using dynamic routing, attention mechanisms, or non-linear transformations) and multi-interest aggregators (employing either representation aggregation or recommendation aggregation strategies). Key motivations include fine-grained modeling of user preferences and item aspects, enhanced recommendation diversity, and improved explainability.

📚 https://arxiv.org/abs/2506.15284

👨🏽‍💻 https://github.com/WHUIR/Multi-Interest-Recommendation-A-Survey

[10] Hierarchical Group-wise Ranking Framework for Recommendation Models

This paper from Credit Karma introduces a Hierarchical Group-wise Ranking Framework designed to improve CTR/CVR prediction models in recommender systems by addressing the limitations of traditional in-batch negative sampling, which predominantly surfaces easy negatives that provide minimal learning signals. The framework employs residual vector quantization (RVQ) to encode user embeddings into hierarchical discrete codes, creating a trie-structured clustering system where users are organized into nested groups of increasing similarity. By applying listwise ranking losses within these hierarchical user groups, the method progressively surfaces harder negatives as it moves deeper into the hierarchy, such that shallow levels provide easier negatives from loosely similar users while deeper levels yield challenging negatives from highly similar users who share behavioral patterns and content exposure.

📚 https://arxiv.org/abs/2506.12756

Extras: Benchmarks

⏱️ C-SEO Bench: Does Conversational SEO Work?

The C-SEO Bench benchmark evaluates the effectiveness of conversational search engine optimization (C-SEO) methods in improving the visibility of web documents in LLM responses. It spans question answering and product recommendation tasks across six domains and supports multi-actor scenarios. The benchmark measures whether stylistic or content-based modifications to documents can improve their citation rank in LLM-generated answers.

📝 https://arxiv.org/abs/2506.11097

👨🏽‍💻 https://github.com/parameterlab/c-seo-bench

⏱️ T²-RAGBench: Text-and-Table Benchmark for Evaluating Retrieval-Augmented Generation

The T²-RAGBench benchmark is introduced to evaluate RAG methods on question answering tasks involving both textual and tabular data, with a focus on financial documents. It comprises 32,908 context-independent question-answer-context triples derived from over 9,000 real-world financial documents. Unlike existing datasets that operate in oracle-context settings with potentially ambiguous questions, T²-RAGBench reformulates questions to ensure a single correct answer, enabling rigorous assessment of both retrieval and numerical reasoning components in RAG pipelines.

📝 https://arxiv.org/abs/2506.12071

👨🏽‍💻 https://anonymous.4open.science/r/g4kmu-paper-D5F8/README.md

I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest.

I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers.

Check out my blog HERE!

Top Information Retrieval Papers of the Week

Discussion about this post