Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

H. Bahadir Sahin

Optimizing Product Deduplication in E-Commerce with Multimodal Embeddings

Sep 19, 2025

Aysenur Kulunk, Berk Taskin, M. Furkan Eseoglu, H. Bahadir Sahin

Abstract:In large scale e-commerce marketplaces, duplicate product listings frequently cause consumer confusion and operational inefficiencies, degrading trust on the platform and increasing costs. Traditional keyword-based search methodologies falter in accurately identifying duplicates due to their reliance on exact textual matches, neglecting semantic similarities inherent in product titles. To address these challenges, we introduce a scalable, multimodal product deduplication designed specifically for the e-commerce domain. Our approach employs a domain-specific text model grounded in BERT architecture in conjunction with MaskedAutoEncoders for image representations. Both of these architectures are augmented with dimensionality reduction techniques to produce compact 128-dimensional embeddings without significant information loss. Complementing this, we also developed a novel decider model that leverages both text and image vectors. By integrating these feature extraction mechanisms with Milvus, an optimized vector database, our system can facilitate efficient and high-precision similarity searches across extensive product catalogs exceeding 200 million items with just 100GB of system RAM consumption. Empirical evaluations demonstrate that our matching system achieves a macro-average F1 score of 0.90, outperforming third-party solutions which attain an F1 score of 0.83. Our findings show the potential of combining domain-specific adaptations with state-of-the-art machine learning techniques to mitigate duplicate listings in large-scale e-commerce environments.

Via

Access Paper or Ask Questions

A Morphology-aware Network for Morphological Disambiguation

Feb 13, 2017

Eray Yildiz, Caglar Tirkaz, H. Bahadir Sahin, Mustafa Tolga Eren, Ozan Sonmez

Figure 1 for A Morphology-aware Network for Morphological Disambiguation

Figure 2 for A Morphology-aware Network for Morphological Disambiguation

Figure 3 for A Morphology-aware Network for Morphological Disambiguation

Figure 4 for A Morphology-aware Network for Morphological Disambiguation

Abstract:Agglutinative languages such as Turkish, Finnish and Hungarian require morphological disambiguation before further processing due to the complex morphology of words. A morphological disambiguator is used to select the correct morphological analysis of a word. Morphological disambiguation is important because it generally is one of the first steps of natural language processing and its performance affects subsequent analyses. In this paper, we propose a system that uses deep learning techniques for morphological disambiguation. Many of the state-of-the-art results in computer vision, speech recognition and natural language processing have been obtained through deep learning models. However, applying deep learning techniques to morphologically rich languages is not well studied. In this work, while we focus on Turkish morphological disambiguation we also present results for French and German in order to show that the proposed architecture achieves high accuracy with no language-specific feature engineering or additional resource. In the experiments, we achieve 84.12, 88.35 and 93.78 morphological disambiguation accuracy among the ambiguous words for Turkish, German and French respectively.

* 6 pages, 1 figure, Thirtieth AAAI Conference on Artificial Intelligence. 2016

Via

Access Paper or Ask Questions

Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers

Feb 09, 2017

H. Bahadir Sahin, Caglar Tirkaz, Eray Yildiz, Mustafa Tolga Eren, Ozan Sonmez

Figure 1 for Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers

Figure 2 for Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers

Figure 3 for Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers

Figure 4 for Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers

Abstract:Turkish Wikipedia Named-Entity Recognition and Text Categorization (TWNERTC) dataset is a collection of automatically categorized and annotated sentences obtained from Wikipedia. We constructed large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The constructed gazetteers contains approximately 300K entities with thousands of fine-grained entity types under 77 different domains. Since automated processes are prone to ambiguity, we also introduce two new content specific noise reduction methodologies. Moreover, we map fine-grained entity types to the equivalent four coarse-grained types: person, loc, org, misc. Eventually, we construct six different dataset versions and evaluate the quality of annotations by comparing ground truths from human annotators. We make these datasets publicly available to support studies on Turkish named-entity recognition (NER) and text categorization (TC).

* 10 page, 1 figure, white paper, update: added correct download link for dataset

Via

Access Paper or Ask Questions