Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Costanza Conforti

Croissant: A Metadata Format for ML-Ready Datasets

Mar 28, 2024

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson(+9 more)

Abstract:Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks.

* Preprint. Contributors listed in alphabetical order

Via

Access Paper or Ask Questions

Building Representative Corpora from Illiterate Communities: A Review of Challenges and Mitigation Strategies for Developing Countries

Feb 04, 2021

Stephanie Hirmer, Alycia Leonard, Josephine Tumwesige, Costanza Conforti

Figure 1 for Building Representative Corpora from Illiterate Communities: A Review of Challenges and Mitigation Strategies for Developing Countries

Figure 2 for Building Representative Corpora from Illiterate Communities: A Review of Challenges and Mitigation Strategies for Developing Countries

Figure 3 for Building Representative Corpora from Illiterate Communities: A Review of Challenges and Mitigation Strategies for Developing Countries

Abstract:Most well-established data collection methods currently adopted in NLP depend on the assumption of speaker literacy. Consequently, the collected corpora largely fail to represent swathes of the global population, which tend to be some of the most vulnerable and marginalised people in society, and often live in rural developing areas. Such underrepresented groups are thus not only ignored when making modeling and system design decisions, but also prevented from benefiting from development outcomes achieved through data-driven NLP. This paper aims to address the under-representation of illiterate communities in NLP corpora: we identify potential biases and ethical issues that might arise when collecting data from rural communities with high illiteracy rates in Low-Income Countries, and propose a set of practical mitigation strategies to help future work.

* Accepted at EACL 2021

Via

Access Paper or Ask Questions

Will-They-Won't-They: A Very Large Dataset for Stance Detection on Twitter

May 01, 2020

Costanza Conforti, Jakob Berndt, Mohammad Taher Pilehvar, Chryssi Giannitsarou, Flavio Toxvaerd, Nigel Collier

Figure 1 for Will-They-Won't-They: A Very Large Dataset for Stance Detection on Twitter

Figure 2 for Will-They-Won't-They: A Very Large Dataset for Stance Detection on Twitter

Figure 3 for Will-They-Won't-They: A Very Large Dataset for Stance Detection on Twitter

Figure 4 for Will-They-Won't-They: A Very Large Dataset for Stance Detection on Twitter

Abstract:We present a new challenging stance detection dataset, called Will-They-Won't-They (WT-WT), which contains 51,284 tweets in English, making it by far the largest available dataset of the type. All the annotations are carried out by experts; therefore, the dataset constitutes a high-quality and reliable benchmark for future research in stance detection. Our experiments with a wide range of recent state-of-the-art stance detection systems show that the dataset poses a strong challenge to existing models in this domain.

* 10 pages, accepted at ACL2020

Via

Access Paper or Ask Questions

Natural language processing for achieving sustainable development: the case of neural labelling to enhance community profiling

Apr 27, 2020

Costanza Conforti, Stephanie Hirmer, David Morgan, Marco Basaldella, Yau Ben Or

Figure 1 for Natural language processing for achieving sustainable development: the case of neural labelling to enhance community profiling

Figure 2 for Natural language processing for achieving sustainable development: the case of neural labelling to enhance community profiling

Figure 3 for Natural language processing for achieving sustainable development: the case of neural labelling to enhance community profiling

Figure 4 for Natural language processing for achieving sustainable development: the case of neural labelling to enhance community profiling

Abstract:In recent years, there has been an increasing interest in the application of Artificial Intelligence - and especially Machine Learning - to the field of Sustainable Development (SD). However, until now, NLP has not been applied in this context. In this research paper, we show the high potential of NLP applications to enhance the sustainability of projects. In particular, we focus on the case of community profiling in developing countries, where, in contrast to the developed world, a notable data gap exists. In this context, NLP could help to address the cost and time barrier of structuring qualitative data that prohibits its widespread use and associated benefits. We propose the new task of Automatic UPV classification, which is an extreme multi-class multi-label classification problem. We release Stories2Insights, an expert-annotated dataset, provide a detailed corpus analysis, and implement a number of strong neural baselines to address the task. Experimental results show that the problem is challenging, and leave plenty of room for future research at the intersection of NLP and SD.

* 16 pages, 7 figures

Via

Access Paper or Ask Questions

Neural Architectures for Open-Type Relation Argument Extraction

Sep 30, 2018

Benjamin Roth, Costanza Conforti, Nina Poerner, Sanjeev Karn, Hinrich Schütze

Figure 1 for Neural Architectures for Open-Type Relation Argument Extraction

Figure 2 for Neural Architectures for Open-Type Relation Argument Extraction

Figure 3 for Neural Architectures for Open-Type Relation Argument Extraction

Figure 4 for Neural Architectures for Open-Type Relation Argument Extraction

Abstract:In this work, we introduce the task of Open-Type Relation Argument Extraction (ORAE): Given a corpus, a query entity Q and a knowledge base relation (e.g.,"Q authored notable work with title X"), the model has to extract an argument of non-standard entity type (entities that cannot be extracted by a standard named entity tagger, e.g. X: the title of a book or a work of art) from the corpus. A distantly supervised dataset based on WikiData relations is obtained and released to address the task. We develop and compare a wide range of neural models for this task yielding large improvements over a strong baseline obtained with a neural question answering system. The impact of different sentence encoding architectures and answer extraction methods is systematically compared. An encoder based on gated recurrent units combined with a conditional random fields tagger gives the best results.

Via

Access Paper or Ask Questions