Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anthi Papadopoulou

Aristotle University of Thessaloniki

Neural Text Sanitization with Privacy Risk Indicators: An Empirical Analysis

Oct 22, 2023

Anthi Papadopoulou, Pierre Lison, Mark Anderson, Lilja Øvrelid, Ildikó Pilán

Abstract:Text sanitization is the task of redacting a document to mask all occurrences of (direct or indirect) personal identifiers, with the goal of concealing the identity of the individual(s) referred in it. In this paper, we consider a two-step approach to text sanitization and provide a detailed analysis of its empirical performance on two recently published datasets: the Text Anonymization Benchmark (Pil\'an et al., 2022) and a collection of Wikipedia biographies (Papadopoulou et al., 2022). The text sanitization process starts with a privacy-oriented entity recognizer that seeks to determine the text spans expressing identifiable personal information. This privacy-oriented entity recognizer is trained by combining a standard named entity recognition model with a gazetteer populated by person-related terms extracted from Wikidata. The second step of the text sanitization process consists in assessing the privacy risk associated with each detected text span, either isolated or in combination with other text spans. We present five distinct indicators of the re-identification risk, respectively based on language model probabilities, text span classification, sequence labelling, perturbations, and web search. We provide a contrastive analysis of each privacy indicator and highlight their benefits and limitations, notably in relation to the available labeled data.

Via

Access Paper or Ask Questions

Bootstrapping Text Anonymization Models with Distant Supervision

May 13, 2022

Anthi Papadopoulou, Pierre Lison, Lilja Øvrelid, Ildikó Pilán

Figure 1 for Bootstrapping Text Anonymization Models with Distant Supervision

Figure 2 for Bootstrapping Text Anonymization Models with Distant Supervision

Figure 3 for Bootstrapping Text Anonymization Models with Distant Supervision

Figure 4 for Bootstrapping Text Anonymization Models with Distant Supervision

Abstract:We propose a novel method to bootstrap text anonymization models based on distant supervision. Instead of requiring manually labeled training data, the approach relies on a knowledge graph expressing the background information assumed to be publicly available about various individuals. This knowledge graph is employed to automatically annotate text documents including personal data about a subset of those individuals. More precisely, the method determines which text spans ought to be masked in order to guarantee $k$-anonymity, assuming an adversary with access to both the text documents and the background information expressed in the knowledge graph. The resulting collection of labeled documents is then used as training data to fine-tune a pre-trained language model for text anonymization. We illustrate this approach using a knowledge graph extracted from Wikidata and short biographical texts from Wikipedia. Evaluation results with a RoBERTa-based model and a manually annotated collection of 553 summaries showcase the potential of the approach, but also unveil a number of issues that may arise if the knowledge graph is noisy or incomplete. The results also illustrate that, contrary to most sequence labeling problems, the text anonymization task may admit several alternative solutions.

Via

Access Paper or Ask Questions

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

Jan 25, 2022

Ildikó Pilán, Pierre Lison, Lilja Øvrelid, Anthi Papadopoulou, David Sánchez, Montserrat Batet

Figure 1 for The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

Figure 2 for The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

Figure 3 for The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

Figure 4 for The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

Abstract:We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared to previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored towards measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymisation-benchmark

Via

Access Paper or Ask Questions

RNNoise-Ex: Hybrid Speech Enhancement System based on RNN and Spectral Features

May 25, 2021

Constantine C. Doumanidis, Christina Anagnostou, Evangelia-Sofia Arvaniti, Anthi Papadopoulou

Figure 1 for RNNoise-Ex: Hybrid Speech Enhancement System based on RNN and Spectral Features

Figure 2 for RNNoise-Ex: Hybrid Speech Enhancement System based on RNN and Spectral Features

Figure 3 for RNNoise-Ex: Hybrid Speech Enhancement System based on RNN and Spectral Features

Figure 4 for RNNoise-Ex: Hybrid Speech Enhancement System based on RNN and Spectral Features

Abstract:Recent interest in exploiting Deep Learning techniques for Noise Suppression, has led to the creation of Hybrid Denoising Systems that combine classic Signal Processing with Deep Learning. In this paper, we concentrated our efforts on extending the RNNoise denoising system (arXiv:1709.08243) with the inclusion of complementary features during the training phase. We present a comprehensive explanation of the set-up process of a modified system and present the comparative results derived from a performance evaluation analysis, using a reference version of RNNoise as control.

* 6 pages, 5 figures, presented at ECESCON 12, for code see https://github.com/CedArctic/rnnoise-ex

Via

Access Paper or Ask Questions