Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Agata Savary

University of Tours, France

Formalising lexical and syntactic diversity for data sampling in French

Jan 14, 2025

Louis Estève, Manon Scholivet, Agata Savary

Abstract:Diversity is an important property of datasets and sampling data for diversity is useful in dataset creation. Finding the optimally diverse sample is expensive, we therefore present a heuristic significantly increasing diversity relative to random sampling. We also explore whether different kinds of diversity -- lexical and syntactic -- correlate, with the purpose of sampling for expensive syntactic diversity through inexpensive lexical diversity. We find that correlations fluctuate with different datasets and versions of diversity measures. This shows that an arbitrarily chosen measure may fall short of capturing diversity-related properties of datasets.

Via

Access Paper or Ask Questions

To Be or Not To Be a Verbal Multiword Expression: A Quest for Discriminating Features

Jul 22, 2020

Caroline Pasquer, Agata Savary, Jean-Yves Antoine, Carlos Ramisch, Nicolas Labroche, Arnaud Giacometti

Figure 1 for To Be or Not To Be a Verbal Multiword Expression: A Quest for Discriminating Features

Figure 2 for To Be or Not To Be a Verbal Multiword Expression: A Quest for Discriminating Features

Figure 3 for To Be or Not To Be a Verbal Multiword Expression: A Quest for Discriminating Features

Figure 4 for To Be or Not To Be a Verbal Multiword Expression: A Quest for Discriminating Features

Abstract:Automatic identification of mutiword expressions (MWEs) is a pre-requisite for semantically-oriented downstream applications. This task is challenging because MWEs, especially verbal ones (VMWEs), exhibit surface variability. However, this variability is usually more restricted than in regular (non-VMWE) constructions, which leads to various variability profiles. We use this fact to determine the optimal set of features which could be used in a supervised classification setting to solve a subproblem of VMWE identification: the identification of occurrences of previously seen VMWEs. Surprisingly, a simple custom frequency-based feature selection method proves more efficient than other standard methods such as Chi-squared test, information gain or decision trees. An SVM classifier using the optimal set of only 6 features outperforms the best systems from a recent shared task on the French seen data.

Via

Access Paper or Ask Questions

Object-oriented lexical encoding of multiword expressions: Short and sweet

Oct 23, 2018

Agata Savary, Simon Petitjean, Timm Lichte, Laura Kallmeyer, Jakub Waszczuk

Figure 1 for Object-oriented lexical encoding of multiword expressions: Short and sweet

Figure 2 for Object-oriented lexical encoding of multiword expressions: Short and sweet

Figure 3 for Object-oriented lexical encoding of multiword expressions: Short and sweet

Figure 4 for Object-oriented lexical encoding of multiword expressions: Short and sweet

Abstract:Multiword expressions (MWEs) exhibit both regular and idiosyncratic properties. Their idiosyncrasy requires lexical encoding in parallel with their component words. Their (at times intricate) regularity, on the other hand, calls for means of flexible factorization to avoid redundant descriptions of shared properties. However, so far, non-redundant general-purpose lexical encoding of MWEs has not received a satisfactory solution. We offer a proof of concept that this challenge might be effectively addressed within eXtensible MetaGrammar (XMG), an object-oriented metagrammar framework. We first make an existing metagrammatical resource, the FrenchTAG grammar, MWE-aware. We then evaluate the factorization gain during incremental implementation with XMG on a dataset extracted from an MWE-annotated reference corpus.

* 13 pages, 5 figures, 5 code listings, 1 tables

Via

Access Paper or Ask Questions