Abstract:Traditional enterprises face significant challenges in processing business documents, where tasks like extracting transport references from invoices remain largely manual despite their crucial role in logistics operations. While Large Language Models offer potential automation, their direct application to specialized business domains often yields unsatisfactory results. We introduce Matrix (Memory-Augmented agent Training through Reasoning and Iterative eXploration), a novel paradigm that enables LLM agents to progressively build domain expertise through experience-driven memory refinement and iterative learning. To validate this approach, we collaborate with one of the world's largest logistics companies to create a dataset of Universal Business Language format invoice documents, focusing on the task of transport reference extraction. Experiments demonstrate that Matrix outperforms prompting a single LLM by 30.3%, vanilla LLM agent by 35.2%. We further analyze the metrics of the optimized systems and observe that the agent system requires less API calls, fewer costs and can analyze longer documents on average. Our methods establish a new approach to transform general-purpose LLMs into specialized business tools through systematic memory enhancement in document processing tasks.
Abstract:This paper proposes HistoLens, a multi-layered analysis framework for historical texts based on Large Language Models (LLMs). Using the important Western Han dynasty text "Yantie Lun" as a case study, we demonstrate the framework's potential applications in historical research and education. HistoLens integrates NLP technology (especially LLMs), including named entity recognition, knowledge graph construction, and geographic information visualization. The paper showcases how HistoLens explores Western Han culture in "Yantie Lun" through multi-dimensional, visual, and quantitative methods, focusing particularly on the influence of Confucian and Legalist thoughts on political, economic, military, and ethnic. We also demonstrate how HistoLens constructs a machine teaching scenario using LLMs for explainable analysis, based on a dataset of Confucian and Legalist ideas extracted with LLM assistance. This approach offers novel and diverse perspectives for studying historical texts like "Yantie Lun" and provides new auxiliary tools for history education. The framework aims to equip historians and learners with LLM-assisted tools to facilitate in-depth, multi-layered analysis of historical texts and foster innovation in historical education.
Abstract:Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, UltraFeedback, GSM8K, HH-RLHF, and TutorEval datasets, demonstrating consistent improvements. Specifically, TreeBoN achieves a 65% win rate at maximum lengths of 192 and 384 tokens, outperforming standard BoN with the same computational cost. Furthermore, TreeBoN achieves around a 60% win rate across longer responses, showcasing its scalability and alignment efficacy.
Abstract:Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for language model (LM) alignment. At its core, RLHF uses a margin-based loss for preference optimization, specifying ideal LM behavior only by the difference between preferred and dispreferred responses. In this paper, we identify a common pitfall of margin-based methods -- the under-specification of ideal LM behavior on preferred and dispreferred responses individually, which leads to two unintended consequences as the margin increases: (1) The probability of dispreferred (e.g., unsafe) responses may increase, resulting in potential safety alignment failures. (2) The probability of preferred responses may decrease, even when those responses are ideal. We demystify the reasons behind these problematic behaviors: margin-based losses couple the change in the preferred probability to the gradient of the dispreferred one, and vice versa, often preventing the preferred probability from increasing while the dispreferred one decreases, and thus causing a synchronized increase or decrease in both probabilities. We term this effect, inherent in margin-based objectives, gradient entanglement. Formally, we derive conditions for general margin-based alignment objectives under which gradient entanglement becomes concerning: the inner product of the gradients of preferred and dispreferred log-probabilities is large relative to the individual gradient norms. We theoretically investigate why such inner products can be large when aligning language models and empirically validate our findings. Empirical implications of our framework extend to explaining important differences in the training dynamics of various preference optimization algorithms, and suggesting potential algorithm designs to mitigate the under-specification issue of margin-based methods and thereby improving language model alignment.
Abstract:In recommendation systems, the relevance and novelty of the final results are selected through a cascade system of Matching -> Ranking -> Strategy. The matching model serves as the starting point of the pipeline and determines the upper bound of the subsequent stages. Balancing the relevance and novelty of matching results is a crucial step in the design and optimization of recommendation systems, contributing significantly to improving recommendation quality. However, the typical matching algorithms have not simultaneously addressed the relevance and novelty perfectly. One main reason is that deep matching algorithms exhibit significant uncertainty when estimating items in the long tail (e.g., due to insufficient training samples) items.The uncertainty not only affects the training of the models but also influences the confidence in the index construction and beam search retrieval process of these models. This paper proposes the UICR (Uncertainty-based explore for Index Construction and Retrieval) algorithm, which introduces the concept of uncertainty modeling in the matching stage and achieves multi-task modeling of model uncertainty and index uncertainty. The final matching results are obtained by combining the relevance score and uncertainty score infered by the model. Experimental results demonstrate that the UICR improves novelty without sacrificing relevance on realworld industrial productive environments and multiple open-source datasets. Remarkably, online A/B test results of display advertising in Shopee demonstrates the effectiveness of the proposed algorithm.
Abstract:Ranking passages by prompting a large language model (LLM) can achieve promising performance in modern information retrieval (IR) systems. A common approach is to sort the ranking list by prompting LLMs for pairwise comparison. However, sorting-based methods require consistent comparisons to correctly sort the passages, which we show that LLMs often violate. We identify two kinds of intrinsic inconsistency in LLM-based pairwise comparisons: order inconsistency which leads to conflicting results when switching the passage order, and transitive inconsistency which leads to non-transitive triads among all preference pairs. In this paper, we propose LLM-RankFusion, an LLM-based ranking framework that mitigates these inconsistencies and produces a robust ranking list. LLM-RankFusion mitigates order inconsistency using in-context learning (ICL) to demonstrate order-agnostic comparisons and calibration to estimate the underlying preference probability between two passages. We then address transitive inconsistency by aggregating the ranking results from multiple rankers. In our experiments, we empirically show that LLM-RankFusion can significantly reduce inconsistent pairwise comparison results, and improve the ranking quality by making the final ranking list more robust.
Abstract:Despite extensive pre-training and fine-tuning in moral alignment to prevent generating harmful information at user request, large language models (LLMs) remain vulnerable to jailbreak attacks. In this paper, we propose AutoDefense, a response-filtering based multi-agent defense framework that filters harmful responses from LLMs. This framework assigns different roles to LLM agents and employs them to complete the defense task collaboratively. The division in tasks enhances the overall instruction-following of LLMs and enables the integration of other defense components as tools. AutoDefense can adapt to various sizes and kinds of open-source LLMs that serve as agents. Through conducting extensive experiments on a large scale of harmful and safe prompts, we validate the effectiveness of the proposed AutoDefense in improving the robustness against jailbreak attacks, while maintaining the performance at normal user request. Our code and data are publicly available at https://github.com/XHMY/AutoDefense.
Abstract:In online advertising scenario, sellers often create multiple creatives to provide comprehensive demonstrations, making it essential to present the most appealing design to maximize the Click-Through Rate (CTR). However, sellers generally struggle to consider users preferences for creative design, leading to the relatively lower aesthetics and quantities compared to Artificial Intelligence (AI)-based approaches. Traditional AI-based approaches still face the same problem of not considering user information while having limited aesthetic knowledge from designers. In fact that fusing the user information, the generated creatives can be more attractive because different users may have different preferences. To optimize the results, the generated creatives in traditional methods are then ranked by another module named creative ranking model. The ranking model can predict the CTR score for each creative considering user features. However, the two above stages are regarded as two different tasks and are optimized separately. In this paper, we proposed a new automated Creative Generation pipeline for Click-Through Rate (CG4CTR) with the goal of improving CTR during the creative generation stage. Our contributions have 4 parts: 1) The inpainting mode in stable diffusion is firstly applied to creative generation task in online advertising scene. A self-cyclic generation pipeline is proposed to ensure the convergence of training. 2) Prompt model is designed to generate individualized creatives for different user groups, which can further improve the diversity and quality. 3) Reward model comprehensively considers the multimodal features of image and text to improve the effectiveness of creative ranking task, and it is also critical in self-cyclic pipeline. 4) The significant benefits obtained in online and offline experiments verify the significance of our proposed method.
Abstract:In the e-commerce advertising scenario, estimating the true probabilities (known as a calibrated estimate) on CTR and CVR is critical and can directly affect the benefits of the buyer, seller and platform. Previous research has introduced numerous solutions for addressing the calibration problem. These methods typically involve the training of calibrators using a validation set and subsequently applying these calibrators to correct the original estimated values during online inference. However, what sets e-commerce advertising scenarios is the challenge of multi-field calibration. Multi-field calibration can be subdivided into two distinct sub-problems: value calibration and shape calibration. Value calibration is defined as no over- or under-estimation for each value under concerned fields. Shape calibration is defined as no over- or under-estimation for each subset of the pCTR within the specified range under condition of concerned fields. In order to achieve shape calibration and value calibration, it is necessary to have a strong data utilization ability.Because the quantity of pCTR specified range for single field-value sample is relative small, which makes the calibrator more difficult to train. However the existing methods cannot simultaneously fulfill both value calibration and shape calibration. To solve these problems, we propose a new method named Deep Ensemble Shape Calibration (DESC). We introduce innovative basis calibration functions, which enhance both function expression capabilities and data utilization by combining these basis calibration functions. A significant advancement lies in the development of an allocator capable of allocating the most suitable shape calibrators to different estimation error distributions within diverse fields and values.
Abstract:Graph Neural Network (GNN) is the trending solution for item retrieval in recommendation problems. Most recent reports, however, focus heavily on new model architectures. This may bring some gaps when applying GNN in the industrial setup, where, besides the model, constructing the graph and handling data sparsity also play critical roles in the overall success of the project. In this work, we report how GNN is applied for large-scale e-commerce item retrieval at Shopee. We introduce our simple yet novel and impactful techniques in graph construction, modeling, and handling data skewness. Specifically, we construct high-quality item graphs by combining strong-signal user behaviors with high-precision collaborative filtering (CF) algorithm. We then develop a new GNN architecture named LightSAGE to produce high-quality items' embeddings for vector search. Finally, we design multiple strategies to handle cold-start and long-tail items, which are critical in an advertisement (ads) system. Our models bring improvement in offline evaluations, online A/B tests, and are deployed to the main traffic of Shopee's Recommendation Advertisement system.