Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Weiwei Cheng

Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

Sep 18, 2024

Kasra Hosseini, Thomas Kober, Josip Krapac, Roland Vollgraf, Weiwei Cheng, Ana Peleteiro Ramallo

Figure 1 for Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

Figure 2 for Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

Figure 3 for Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

Figure 4 for Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

Abstract:Evaluating production-level retrieval systems at scale is a crucial yet challenging task due to the limited availability of a large pool of well-trained human annotators. Large Language Models (LLMs) have the potential to address this scaling issue and offer a viable alternative to humans for the bulk of annotation tasks. In this paper, we propose a framework for assessing the product search engines in a large-scale e-commerce setting, leveraging Multimodal LLMs for (i) generating tailored annotation guidelines for individual queries, and (ii) conducting the subsequent annotation task. Our method, validated through deployment on a large e-commerce platform, demonstrates comparable quality to human annotations, significantly reduces time and cost, facilitates rapid problem discovery, and provides an effective solution for production-level quality control at scale.

* 13 pages, 5 figures, 4 Tables

Via

Access Paper or Ask Questions

What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain

Aug 13, 2024

Antonis Maronikolakis, Ana Peleteiro Ramallo, Weiwei Cheng, Thomas Kober

Figure 1 for What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain

Figure 2 for What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain

Figure 3 for What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain

Figure 4 for What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain

Abstract:Large language models (LLMs) are poised to revolutionize the domain of online fashion retail, enhancing customer experience and discovery of fashion online. LLM-powered conversational agents introduce a new way of discovery by directly interacting with customers, enabling them to express in their own ways, refine their needs, obtain fashion and shopping advice that is relevant to their taste and intent. For many tasks in e-commerce, such as finding a specific product, conversational agents need to convert their interactions with a customer to a specific call to different backend systems, e.g., a search system to showcase a relevant set of products. Therefore, evaluating the capabilities of LLMs to perform those tasks related to calling other services is vital. However, those evaluations are generally complex, due to the lack of relevant and high quality datasets, and do not align seamlessly with business needs, amongst others. To this end, we created a multilingual evaluation dataset of 4k conversations between customers and a fashion assistant in a large e-commerce fashion platform to measure the capabilities of LLMs to serve as an assistant between customers and a backend engine. We evaluate a range of models, showcasing how our dataset scales to business needs and facilitates iterative development of tools.

* Accepted at KDD workshop on Evaluation and Trustworthiness of Generative AI Models

Via

Access Paper or Ask Questions

Grounding Natural Language Instructions: Can Large Language Models Capture Spatial Information?

Sep 17, 2021

Julia Rozanova, Deborah Ferreira, Krishna Dubba, Weiwei Cheng, Dell Zhang, Andre Freitas

Figure 1 for Grounding Natural Language Instructions: Can Large Language Models Capture Spatial Information?

Figure 2 for Grounding Natural Language Instructions: Can Large Language Models Capture Spatial Information?

Figure 3 for Grounding Natural Language Instructions: Can Large Language Models Capture Spatial Information?

Figure 4 for Grounding Natural Language Instructions: Can Large Language Models Capture Spatial Information?

Abstract:Models designed for intelligent process automation are required to be capable of grounding user interface elements. This task of interface element grounding is centred on linking instructions in natural language to their target referents. Even though BERT and similar pre-trained language models have excelled in several NLP tasks, their use has not been widely explored for the UI grounding domain. This work concentrates on testing and probing the grounding abilities of three different transformer-based models: BERT, RoBERTa and LayoutLM. Our primary focus is on these models' spatial reasoning skills, given their importance in this domain. We observe that LayoutLM has a promising advantage for applications in this domain, even though it was created for a different original purpose (representing scanned documents): the learned spatial features appear to be transferable to the UI grounding setting, especially as they demonstrate the ability to discriminate between target directions in natural language instructions.

* *Equal contribution

Via

Access Paper or Ask Questions

Evaluating for Diversity in Question Generation over Text

Aug 17, 2020

Michael Sejr Schlichtkrull, Weiwei Cheng

Figure 1 for Evaluating for Diversity in Question Generation over Text

Figure 2 for Evaluating for Diversity in Question Generation over Text

Figure 3 for Evaluating for Diversity in Question Generation over Text

Figure 4 for Evaluating for Diversity in Question Generation over Text

Abstract:Generating diverse and relevant questions over text is a task with widespread applications. We argue that commonly-used evaluation metrics such as BLEU and METEOR are not suitable for this task due to the inherent diversity of reference questions, and propose a scheme for extending conventional metrics to reflect diversity. We furthermore propose a variational encoder-decoder model for this task. We show through automatic and human evaluation that our variational model improves diversity without loss of quality, and demonstrate how our evaluation scheme reflects this improvement.

Via

Access Paper or Ask Questions

On the Bayes-optimality of F-measure maximizers

Mar 06, 2015

Willem Waegeman, Krzysztof Dembczynski, Arkadiusz Jachnik, Weiwei Cheng, Eyke Hullermeier

Figure 1 for On the Bayes-optimality of F-measure maximizers

Figure 2 for On the Bayes-optimality of F-measure maximizers

Figure 3 for On the Bayes-optimality of F-measure maximizers

Figure 4 for On the Bayes-optimality of F-measure maximizers

Abstract:The F-measure, which has originally been introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure is a statistically and computationally challenging problem, since no closed-form solution exists. Adopting a decision-theoretic perspective, this article provides a formal and experimental analysis of different approaches for maximizing the F-measure. We start with a Bayes-risk analysis of related loss functions, such as Hamming loss and subset zero-one loss, showing that optimizing such losses as a surrogate of the F-measure leads to a high worst-case regret. Subsequently, we perform a similar type of analysis for F-measure maximizing algorithms, showing that such algorithms are approximate, while relying on additional assumptions regarding the statistical distribution of the binary response variables. Furthermore, we present a new algorithm which is not only computationally efficient but also Bayes-optimal, regardless of the underlying distribution. To this end, the algorithm requires only a quadratic (with respect to the number of binary responses) number of parameters of the joint distribution. We illustrate the practical performance of all analyzed methods by means of experiments with multi-label classification problems.

* JMLR 15 (2014) 3333-3388

Via

Access Paper or Ask Questions

Label Ranking with Abstention: Predicting Partial Orders by Thresholding Probability Distributions (Extended Abstract)

Dec 02, 2011

Weiwei Cheng, Eyke Hüllermeier

Figure 1 for Label Ranking with Abstention: Predicting Partial Orders by Thresholding Probability Distributions (Extended Abstract)

Abstract:We consider an extension of the setting of label ranking, in which the learner is allowed to make predictions in the form of partial instead of total orders. Predictions of that kind are interpreted as a partial abstention: If the learner is not sufficiently certain regarding the relative order of two alternatives, it may abstain from this decision and instead declare these alternatives as being incomparable. We propose a new method for learning to predict partial orders that improves on an existing approach, both theoretically and empirically. Our method is based on the idea of thresholding the probabilities of pairwise preferences between labels as induced by a predicted (parameterized) probability distribution on the set of all rankings.

* 4 pages, 1 figure, appeared at NIPS 2011 Choice Models and Preference Learning workshop

Via

Access Paper or Ask Questions