Abstract:Retrieving relevant items that match users' queries from billion-scale corpus forms the core of industrial e-commerce search systems, in which embedding-based retrieval (EBR) methods are prevailing. These methods adopt a two-tower framework to learn embedding vectors for query and item separately and thus leverage efficient approximate nearest neighbor (ANN) search to retrieve relevant items. However, existing EBR methods usually ignore inconsistent user behaviors in industrial multi-stage search systems, resulting in insufficient retrieval efficiency with a low commercial return. To tackle this challenge, we propose to improve EBR methods by learning Multi-level Multi-Grained Semantic Embeddings(MMSE). We propose the multi-stage information mining to exploit the ordered, clicked, unclicked and random sampled items in practical user behavior data, and then capture query-item similarity via a post-fusion strategy. We then propose multi-grained learning objectives that integrate the retrieval loss with global comparison ability and the ranking loss with local comparison ability to generate semantic embeddings. Both experiments on a real-world billion-scale dataset and online A/B tests verify the effectiveness of MMSE in achieving significant performance improvements on metrics such as offline recall and online conversion rate (CVR).
Abstract:Cross-Modal Retrieval (CMR) is an important research topic across multimodal computing and information retrieval, which takes one type of data as the query to retrieve relevant data of another type, and has been widely used in many real-world applications. Recently, the vision-language pre-trained model represented by CLIP has demonstrated its superiority of learning visual and textual representations and its impressive performance on various vision and language related tasks. Although CLIP as well as the previous pre-trained models have shown great performance improvement in unsupervised CMR, the performance and impact of these pre-trained models on supervised CMR were rarely explored due to the lack of multimodal class-level associations. In this paper, we take CLIP as the current representative vision-language pre-trained model to conduct a comprehensive empirical study and provide insights on its performance and impact on supervised CMR. To this end, we first propose a novel model CLIP4CMR (\textbf{CLIP For} supervised \textbf{C}ross-\textbf{M}odal \textbf{R}etrieval) that employs pre-trained CLIP as backbone network to perform supervised CMR. We then revisit the existing loss function design in CMR, including the most common pair-wise losses, class-wise losses and hybrid ones, and provide insights on applying CLIP. Moreover, we investigate several concerned issues in supervised CMR and provide new perspectives for this field via CLIP4CMR, including the robustness to modality imbalance and the sensitivity to hyper-parameters. Extensive experimental results show that the CLIP4CMR achieves SOTA results with significant improvements on the benchmark datasets Wikipedia, NUS-WIDE, Pascal-Sentence and XmediaNet. Our data and codes are publicly available at https://github.com/zhixiongz/CLIP4CMR.
Abstract:Live streaming is becoming an increasingly popular trend of sales in E-commerce. The core of live-streaming sales is to encourage customers to purchase in an online broadcasting room. To enable customers to better understand a product without jumping out, we propose AliMe MKG, a multi-modal knowledge graph that aims at providing a cognitive profile for products, through which customers are able to seek information about and understand a product. Based on the MKG, we build an online live assistant that highlights product search, product exhibition and question answering, allowing customers to skim over item list, view item details, and ask item-related questions. Our system has been launched online in the Taobao app, and currently serves hundreds of thousands of customers per day.