Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nathan Noiry

IDS

Subgroup analysis methods for time-to-event outcomes in heterogeneous randomized controlled trials

Jan 23, 2024

Valentine Perrin, Nathan Noiry, Nicolas Loiseau, Alex Nowak

Abstract:Non-significant randomized control trials can hide subgroups of good responders to experimental drugs, thus hindering subsequent development. Identifying such heterogeneous treatment effects is key for precision medicine and many post-hoc analysis methods have been developed for that purpose. While several benchmarks have been carried out to identify the strengths and weaknesses of these methods, notably for binary and continuous endpoints, similar systematic empirical evaluation of subgroup analysis for time-to-event endpoints are lacking. This work aims to fill this gap by evaluating several subgroup analysis algorithms in the context of time-to-event outcomes, by means of three different research questions: Is there heterogeneity? What are the biomarkers responsible for such heterogeneity? Who are the good responders to treatment? In this context, we propose a new synthetic and semi-synthetic data generation process that allows one to explore a wide range of heterogeneity scenarios with precise control on the level of heterogeneity. We provide an open source Python package, available on Github, containing our generation process and our comprehensive benchmark framework. We hope this package will be useful to the research community for future investigations of heterogeneity of treatment effects and subgroup analysis methods benchmarking.

* 9 pages, 8 figures, 2 tables. Code available at https://github.com/owkin/hte . Comments are welcome!

Via

Access Paper or Ask Questions

A Novel Information-Theoretic Objective to Disentangle Representations for Fair Classification

Oct 21, 2023

Pierre Colombo, Nathan Noiry, Guillaume Staerman, Pablo Piantanida

Figure 1 for A Novel Information-Theoretic Objective to Disentangle Representations for Fair Classification

Figure 2 for A Novel Information-Theoretic Objective to Disentangle Representations for Fair Classification

Figure 3 for A Novel Information-Theoretic Objective to Disentangle Representations for Fair Classification

Figure 4 for A Novel Information-Theoretic Objective to Disentangle Representations for Fair Classification

Abstract:One of the pursued objectives of deep learning is to provide tools that learn abstract representations of reality from the observation of multiple contextual situations. More precisely, one wishes to extract disentangled representations which are (i) low dimensional and (ii) whose components are independent and correspond to concepts capturing the essence of the objects under consideration (Locatello et al., 2019b). One step towards this ambitious project consists in learning disentangled representations with respect to a predefined (sensitive) attribute, e.g., the gender or age of the writer. Perhaps one of the main application for such disentangled representations is fair classification. Existing methods extract the last layer of a neural network trained with a loss that is composed of a cross-entropy objective and a disentanglement regularizer. In this work, we adopt an information-theoretic view of this problem which motivates a novel family of regularizers that minimizes the mutual information between the latent representation and the sensitive attribute conditional to the target. The resulting set of losses, called CLINIC, is parameter free and thus, it is easier and faster to train. CLINIC losses are studied through extensive numerical experiments by training over 2k neural networks. We demonstrate that our methods offer a better disentanglement/accuracy trade-off than previous techniques, and generalize better than training with cross-entropy loss solely provided that the disentanglement task is not too constraining.

* Findings AACL 2023

Via

Access Paper or Ask Questions

Toward Stronger Textual Attack Detectors

Oct 21, 2023

Pierre Colombo, Marine Picot, Nathan Noiry, Guillaume Staerman, Pablo Piantanida

Figure 1 for Toward Stronger Textual Attack Detectors

Figure 2 for Toward Stronger Textual Attack Detectors

Figure 3 for Toward Stronger Textual Attack Detectors

Figure 4 for Toward Stronger Textual Attack Detectors

Abstract:The landscape of available textual adversarial attacks keeps growing, posing severe threats and raising concerns regarding the deep NLP system's integrity. However, the crucial problem of defending against malicious attacks has only drawn the attention of the NLP community. The latter is nonetheless instrumental in developing robust and trustworthy systems. This paper makes two important contributions in this line of search: (i) we introduce LAROUSSE, a new framework to detect textual adversarial attacks and (ii) we introduce STAKEOUT, a new benchmark composed of nine popular attack methods, three datasets, and two pre-trained models. LAROUSSE is ready-to-use in production as it is unsupervised, hyperparameter-free, and non-differentiable, protecting it against gradient-based methods. Our new benchmark STAKEOUT allows for a robust evaluation framework: we conduct extensive numerical experiments which demonstrate that LAROUSSE outperforms previous methods, and which allows to identify interesting factors of detection rate variations.

* Findings EMNLP 2023

Via

Access Paper or Ask Questions

A Functional Data Perspective and Baseline On Multi-Layer Out-of-Distribution Detection

Jun 06, 2023

Eduardo Dadalto, Pierre Colombo, Guillaume Staerman, Nathan Noiry, Pablo Piantanida

Abstract:A key feature of out-of-distribution (OOD) detection is to exploit a trained neural network by extracting statistical patterns and relationships through the multi-layer classifier to detect shifts in the expected input data distribution. Despite achieving solid results, several state-of-the-art methods rely on the penultimate or last layer outputs only, leaving behind valuable information for OOD detection. Methods that explore the multiple layers either require a special architecture or a supervised objective to do so. This work adopts an original approach based on a functional view of the network that exploits the sample's trajectories through the various layers and their statistical dependencies. It goes beyond multivariate features aggregation and introduces a baseline rooted in functional anomaly detection. In this new framework, OOD detection translates into detecting samples whose trajectories differ from the typical behavior characterized by the training set. We validate our method and empirically demonstrate its effectiveness in OOD detection compared to strong state-of-the-art baselines on computer vision benchmarks.

Via

Access Paper or Ask Questions

Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

May 17, 2023

Anas Himmi, Ekhine Irurozki, Nathan Noiry, Stephan Clemencon, Pierre Colombo

Figure 1 for Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

Figure 2 for Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

Figure 3 for Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

Figure 4 for Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

Abstract:The evaluation of natural language processing (NLP) systems is crucial for advancing the field, but current benchmarking approaches often assume that all systems have scores available for all tasks, which is not always practical. In reality, several factors such as the cost of running baseline, private systems, computational limitations, or incomplete data may prevent some systems from being evaluated on entire tasks. This paper formalize an existing problem in NLP research: benchmarking when some systems scores are missing on the task, and proposes a novel approach to address it. Our method utilizes a compatible partial ranking approach to impute missing data, which is then aggregated using the Borda count method. It includes two refinements designed specifically for scenarios where either task-level or instance-level scores are available. We also introduce an extended benchmark, which contains over 131 million scores, an order of magnitude larger than existing benchmarks. We validate our methods and demonstrate their effectiveness in addressing the challenge of missing system evaluation on an entire task. This work highlights the need for more comprehensive benchmarking approaches that can handle real-world scenarios where not all systems are evaluated on the entire task.

Via

Access Paper or Ask Questions

Beyond Mahalanobis-Based Scores for Textual OOD Detection

Nov 24, 2022

Pierre Colombo, Eduardo D. C. Gomes, Guillaume Staerman, Nathan Noiry, Pablo Piantanida

Abstract:Deep learning methods have boosted the adoption of NLP systems in real-life applications. However, they turn out to be vulnerable to distribution shifts over time which may cause severe dysfunctions in production systems, urging practitioners to develop tools to detect out-of-distribution (OOD) samples through the lens of the neural network. In this paper, we introduce TRUSTED, a new OOD detector for classifiers based on Transformer architectures that meets operational requirements: it is unsupervised and fast to compute. The efficiency of TRUSTED relies on the fruitful idea that all hidden layers carry relevant information to detect OOD examples. Based on this, for a given input, TRUSTED consists in (i) aggregating this information and (ii) computing a similarity score by exploiting the training distribution, leveraging the powerful concept of data depth. Our extensive numerical experiments involve 51k model configurations, including various checkpoints, seeds, and datasets, and demonstrate that TRUSTED achieves state-of-the-art performances. In particular, it improves previous AUROC over 3 points.

* NeurIPS 2022

Via

Access Paper or Ask Questions

Mitigating Gender Bias in Face Recognition Using the von Mises-Fisher Mixture Model

Oct 24, 2022

Jean-Rémy Conti, Nathan Noiry, Vincent Despiegel, Stéphane Gentric, Stéphan Clémençon

Figure 1 for Mitigating Gender Bias in Face Recognition Using the von Mises-Fisher Mixture Model

Figure 2 for Mitigating Gender Bias in Face Recognition Using the von Mises-Fisher Mixture Model

Figure 3 for Mitigating Gender Bias in Face Recognition Using the von Mises-Fisher Mixture Model

Figure 4 for Mitigating Gender Bias in Face Recognition Using the von Mises-Fisher Mixture Model

Abstract:In spite of the high performance and reliability of deep learning algorithms in a wide range of everyday applications, many investigations tend to show that a lot of models exhibit biases, discriminating against specific subgroups of the population (e.g. gender, ethnicity). This urges the practitioner to develop fair systems with a uniform/comparable performance across sensitive groups. In this work, we investigate the gender bias of deep Face Recognition networks. In order to measure this bias, we introduce two new metrics, $\mathrm{BFAR}$ and $\mathrm{BFRR}$, that better reflect the inherent deployment needs of Face Recognition systems. Motivated by geometric considerations, we mitigate gender bias through a new post-processing methodology which transforms the deep embeddings of a pre-trained model to give more representation power to discriminated subgroups. It consists in training a shallow neural network by minimizing a Fair von Mises-Fisher loss whose hyperparameters account for the intra-class variance of each gender. Interestingly, we empirically observe that these hyperparameters are correlated with our fairness metrics. In fact, extensive numerical experiments on a variety of datasets show that a careful selection significantly reduces gender bias.

* Proceedings of the 39th International Conference on Machine Learning, PMLR 162:4344-4369, 2022

Via

Access Paper or Ask Questions

The Glass Ceiling of Automatic Evaluation in Natural Language Generation

Aug 31, 2022

Pierre Colombo, Maxime Peyrard, Nathan Noiry, Robert West, Pablo Piantanida

Figure 1 for The Glass Ceiling of Automatic Evaluation in Natural Language Generation

Figure 2 for The Glass Ceiling of Automatic Evaluation in Natural Language Generation

Figure 3 for The Glass Ceiling of Automatic Evaluation in Natural Language Generation

Figure 4 for The Glass Ceiling of Automatic Evaluation in Natural Language Generation

Abstract:Automatic evaluation metrics capable of replacing human judgments are critical to allowing fast development of new methods. Thus, numerous research efforts have focused on crafting such metrics. In this work, we take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics altogether. As metrics are used based on how they rank systems, we compare metrics in the space of system rankings. Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans. Automatic metrics are not complementary and rank systems similarly. Strikingly, human metrics predict each other much better than the combination of all automatic metrics used to predict a human metric. It is surprising because human metrics are often designed to be independent, to capture different aspects of quality, e.g. content fidelity or readability. We provide a discussion of these findings and recommendations for future work in the field of evaluation.

Via

Access Paper or Ask Questions

Learning Disentangled Textual Representations via Statistical Measures of Similarity

May 07, 2022

Pierre Colombo, Guillaume Staerman, Nathan Noiry, Pablo Piantanida

Figure 1 for Learning Disentangled Textual Representations via Statistical Measures of Similarity

Figure 2 for Learning Disentangled Textual Representations via Statistical Measures of Similarity

Figure 3 for Learning Disentangled Textual Representations via Statistical Measures of Similarity

Figure 4 for Learning Disentangled Textual Representations via Statistical Measures of Similarity

Abstract:When working with textual data, a natural application of disentangled representations is fair classification where the goal is to make predictions without being biased (or influenced) by sensitive attributes that may be present in the data (e.g., age, gender or race). Dominant approaches to disentangle a sensitive attribute from textual representations rely on learning simultaneously a penalization term that involves either an adversarial loss (e.g., a discriminator) or an information measure (e.g., mutual information). However, these methods require the training of a deep neural network with several parameter updates for each update of the representation model. As a matter of fact, the resulting nested optimization loop is both time consuming, adding complexity to the optimization dynamic, and requires a fine hyperparameter selection (e.g., learning rates, architecture). In this work, we introduce a family of regularizers for learning disentangled representations that do not require training. These regularizers are based on statistical measures of similarity between the conditional probability distributions with respect to the sensitive attributes. Our novel regularizers do not require additional training, are faster and do not involve additional tuning while achieving better results both when combined with pretrained and randomly initialized text encoders.

* ACL 2022

Via

Access Paper or Ask Questions

What are the best systems? New perspectives on NLP Benchmarking

Feb 10, 2022

Pierre Colombo, Nathan Noiry, Ekhine Irurozki, Stephan Clemencon

Figure 1 for What are the best systems? New perspectives on NLP Benchmarking

Figure 2 for What are the best systems? New perspectives on NLP Benchmarking

Figure 3 for What are the best systems? New perspectives on NLP Benchmarking

Figure 4 for What are the best systems? New perspectives on NLP Benchmarking

Abstract:In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure while being both more reliable and robust.

Via

Access Paper or Ask Questions