Efficient Synthetic Data Generation for Text Embeddings, A Survey of Conversational Search, and More!
Vol.75 for Oct 21 - Oct 27, 2024
Stay Ahead of the Curve with the Latest Advancements and Discoveries in Information Retrieval.
This week’s newsletter highlights the following research:
Efficient Synthetic Data Generation for Text Embeddings Using Small Language Models, from Chen et al.
Improving Search Relevance Using Large Language Models, from Pinterest
A Technical Survey of Modern Conversational Search Systems, from Mo et al.
A Comprehensive Analysis of Negative Sampling Methods in Large-Scale Recommendation Systems, from Apple
Optimizing E-commerce Search Results Through User Intent Centrality, from Saadany et al.
Understanding the Benefits of Joint Training in LLM-Based Search and Recommendation Systems, from Spotify
A Simple Training-free Approach for LLM-based Recommendations, from Google DeepMind
Optimizing Multi-Channel Fusion for Large-Scale Recommender Systems, from SJTU
Enhancing Long-Context QA through LongRAG, from Zhao et al.
Enhancing E-commerce Content Curation through Automated High Consideration Query Ranking, from Amazon
[1] Little Giants: Synthesizing High-Quality Embedding Data at Scale
This paper from Chen et al. introduces SPEED, a framework that enables small open-source language models (8B parameters) to generate high-quality synthetic data for training text embedding models, traditionally a task requiring expensive GPT-4 API calls. The framework employs a three-model approach: a junior generator for initial synthesis, a senior generator for advanced generation, and a data revisor for self-improvement. SPEED begins by using GPT-4 to create seed data and task descriptions (sourced from the Open Directory Project for diversity), then uses supervised fine-tuning and preference optimization to train the smaller models. The approach achieves superior performance compared to the state-of-the-art E5_mistral embedding model while using less than 1/10 of the GPT API calls.
📚 https://arxiv.org/abs/2410.18634
[2] Improving Pinterest Search Relevance Using Large Language Models
This paper presents Pinterest's implementation of an LLM-based search relevance system that improves how accurately search results match user queries. The system uses a comprehensive approach by combining search queries with rich content representations, including AI-generated image captions, link data, user engagement history, board information, and Pin metadata. To overcome the limitations of expensive human-labeled training data, the team employed semi-supervised learning and multilingual LLMs to expand their training dataset beyond English content. A key innovation is their use of knowledge distillation to transform the computationally intensive LLM model into a more efficient version suitable for real-time deployment. The system employs a 5-level relevance scoring guideline for more nuanced relevance judgments than simple binary classifications.
📚 https://arxiv.org/abs/2410.17152
[3] A Survey of Conversational Search
This survey paper from Mo et al. examines the evolution and technical foundations of conversational search systems, particularly in light of recent advances in large language models. Unlike traditional keyword-based search engines, these systems enable natural language dialogue and maintain context across multiple interactions, significantly improving information retrieval and user experience. The paper breaks down four essential components: query reformulation, search clarification, conversational retrieval, and response generation. The survey notes how these systems have been successfully adapted for specialized domains like healthcare, finance, and legal services.
📚 https://arxiv.org/abs/2410.15576
[4] Evaluating Performance and Bias of Negative Sampling in Large-Scale Sequential Recommendation Models
This paper from Apple presents a comprehensive analysis of negative sampling methods in large-scale recommendation systems, addressing a crucial challenge in training models that must select from millions or billions of items. The researchers compared six different negative sampling approaches (random, popularity-based, in-batch, mixed, adaptive, and adaptive with mixed variants) using sequential recommendation models. Through extensive experiments involving hyperparameter optimization and multiple iterations across three datasets with varying popularity distributions, they discovered that commonly used random sampling tends to reinforce popularity bias while performing well for frequently-chosen items. Popularity-based methods, while offering more balanced performance across item popularity ranges (head, mid, and tail), typically result in lower overall model performance.
📚 https://arxiv.org/abs/2410.17276
[5] Centrality-aware Product Retrieval and Ranking
This paper from Saadany et al. addresses the challenge of improving e-commerce product search by better aligning search results with user intent. The authors identify key issues where traditional semantic similarity approaches fall short, particularly with ambiguous queries (like "iPhone 13"), repetitive terms, and alphanumeric product codes. They introduce a User-intent Centrality Optimization (UCO) approach, trained on a carefully curated dataset from eBay that includes both traditional relevance scores (1-5) and binary centrality scores indicating how well products match typical user intentions. The innovation lies in their dual-loss optimization technique, which specifically handles "hard negatives" - products that are semantically relevant but don't match the user's likely intent (like showing an iPhone case when searching for the phone itself).
📚 https://arxiv.org/abs/2410.15930
[6] Bridging Search and Recommendation in Generative Retrieval: Does One Task Help the Other?
This paper from Spotify investigates the effectiveness of combining search and recommendation tasks into a unified generative model powered by Large Language Models, challenging the traditional approach of using separate specialized models. The research explores two key hypotheses: that joint training helps better estimate item popularity, and that it improves item representations by combining content-based aspects from search with collaborative-filtering aspects from recommendations. Using both simulated and real-world datasets, the study demonstrates that unified models can outperform task-specific approaches, showing an average 16% improvement in retrieval accuracy at rank 30. However, this success depends on specific conditions: the popularity distributions between search and recommendation tasks should have low divergence, and item co-occurrence patterns should align between tasks.
📚 https://arxiv.org/abs/2410.16823
[7] STAR: A Simple Training-free Approach for Recommendations using Large Language Models
This paper from Google DeepMind introduces STAR (Simple Training-free Approach for Recommendation), a framework that leverages Large Language Models for recommendation tasks without requiring expensive fine-tuning. The approach consists of two stages: a retrieval stage that combines LLM-based semantic embeddings with collaborative user information to identify candidate items, and a ranking stage that uses LLMs for pairwise ranking to refine recommendations. The authors emphasize that incorporating collaborative information alongside semantic data is crucial for optimal performance in both stages, and demonstrate that their framework can serve as a versatile, training-free alternative to traditional supervised recommendation systems.
📚 https://arxiv.org/abs/2410.16458
[8] Unleashing the Potential of Multi-Channel Fusion in Retrieval for Personalized Recommendations
This paper from SJTU tackles the challenge of multi-channel fusion in recommender systems' retrieval stage, where results from different candidate generators must be efficiently merged while maintaining personalization and performance. The authors point out that despite advances in individual retrieval methods, the fusion process often relies on basic heuristics and manual designs, leading to suboptimal recommendations. To address this, they introduce two novel approaches: a two-stage optimization strategy combining the Cross Entropy Method with Bayesian Optimization for global weight assignments, and a policy gradient-based method for personalized channel merging. The significance of proper weight optimization is demonstrated through experiments showing that different weight combinations can cause Recall@200 variations of up to 79.7% and 86.7% on the Gowalla and Amazon_Books datasets, respectively.
📚 https://arxiv.org/abs/2410.16080
[9] LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering
This paper from Zhao et al. introduces LongRAG, a system designed to improve long-context question answering by addressing key limitations in existing approaches. While current large language models struggle with the "lost in the middle" problem and traditional RAG systems suffer from fragmented context and noisy retrieval, LongRAG tackles these challenges through a dual-perspective approach. It combines a hybrid retriever and LLM-augmented components that preserve global context while precisely identifying relevant details. The system includes an information extractor that maps chunks into a higher-dimensional semantic space, and a Chain-of-Thought-guided filter that helps evaluate evidence quality. Testing on three multi-hop datasets showed LongRAG significantly outperforming both long-context LLMs (by 6.94%) and other RAG systems (by up to 17.25%).
📚 https://arxiv.org/abs/2410.18050
👨🏽💻 https://github.com/QingFei1/LongRAG
[10] Identifying High Consideration E-Commerce Search Queries
This paper from Amazon introduces the task of Engagement-based Query Ranking (EQR) to identify High Consideration (HC) queries in e-commerce, which are searches requiring careful decision-making and substantial research from customers. The authors propose a novel approach that uses a combination of behavioral cues, financial signals, and catalog features to rank queries based on their potential engagement with shopping knowledge content. Their method leverages Click-Through Rate (CTR) of informational content as a proxy for HC query labels, allowing for a generalizable model across all search traffic. The paper presents offline experiments showing that their proposed method outperforms baselines in EQR across all metrics. Human evaluation demonstrates a 96% precision in HC query selection.
📚 https://arxiv.org/abs/2410.13951
I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest.
I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers.