Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

Jun 12, 2021

Ravi Shankar Mishra, Kartik Mehta, Nikhil Rasiwasia

Figure 1 for Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

Figure 2 for Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

Figure 3 for Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

Figure 4 for Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

Share this with someone who'll enjoy it:

Abstract:In this paper, we present SANTA, a scalable framework to automatically normalize E-commerce attribute values (e.g. "Win 10 Pro") to a fixed set of pre-defined canonical values (e.g. "Windows 10"). Earlier works on attribute normalization focused on fuzzy string matching (also referred as syntactic matching in this paper). In this work, we first perform an extensive study of nine syntactic matching algorithms and establish that 'cosine' similarity leads to best results, showing 2.7% improvement over commonly used Jaccard index. Next, we argue that string similarity alone is not sufficient for attribute normalization as many surface forms require going beyond syntactic matching (e.g. "720p" and "HD" are synonyms). While semantic techniques like unsupervised embeddings (e.g. word2vec/fastText) have shown good results in word similarity tasks, we observed that they perform poorly to distinguish between close canonical forms, as these close forms often occur in similar contexts. We propose to learn token embeddings using a twin network with triplet loss. We propose an embedding learning task leveraging raw attribute values and product titles to learn these embeddings in a self-supervised fashion. We show that providing supervision using our proposed task improves over both syntactic and unsupervised embeddings based techniques for attribute normalization. Experiments on a real-world attribute normalization dataset of 50 attributes show that the embeddings trained using our proposed approach obtain 2.3% improvement over best string matching and 19.3% improvement over best unsupervised embeddings.

* Accepted in ECNLP workshop of ACL-IJCNLP 2021 (https://sites.google.com/view/ecnlp)

View paper on

Share this with someone who'll enjoy it:

Title:Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

Paper and Code