Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Arthur Zimek

Achieving Hilbert-Schmidt Independence Under Rényi Differential Privacy for Fair and Private Data Generation

Aug 29, 2025

Tobias Hyrup, Emmanouil Panagiotou, Arjun Roy, Arthur Zimek, Eirini Ntoutsi, Peter Schneider-Kamp

Abstract:As privacy regulations such as the GDPR and HIPAA and responsibility frameworks for artificial intelligence such as the AI Act gain traction, the ethical and responsible use of real-world data faces increasing constraints. Synthetic data generation has emerged as a promising solution to risk-aware data sharing and model development, particularly for tabular datasets that are foundational to sensitive domains such as healthcare. To address both privacy and fairness concerns in this setting, we propose FLIP (Fair Latent Intervention under Privacy guarantees), a transformer-based variational autoencoder augmented with latent diffusion to generate heterogeneous tabular data. Unlike the typical setup in fairness-aware data generation, we assume a task-agnostic setup, not reliant on a fixed, defined downstream task, thus offering broader applicability. To ensure privacy, FLIP employs R\'enyi differential privacy (RDP) constraints during training and addresses fairness in the input space with RDP-compatible balanced sampling that accounts for group-specific noise levels across multiple sampling rates. In the latent space, we promote fairness by aligning neuron activation patterns across protected groups using Centered Kernel Alignment (CKA), a similarity measure extending the Hilbert-Schmidt Independence Criterion (HSIC). This alignment encourages statistical independence between latent representations and the protected feature. Empirical results demonstrate that FLIP effectively provides significant fairness improvements for task-agnostic fairness and across diverse downstream tasks under differential privacy constraints.

Via

Access Paper or Ask Questions

Metrics for Inter-Dataset Similarity with Example Applications in Synthetic Data and Feature Selection Evaluation -- Extended Version

Jan 16, 2025

Muhammad Rajabinasab, Anton D. Lautrup, Arthur Zimek

Figure 1 for Metrics for Inter-Dataset Similarity with Example Applications in Synthetic Data and Feature Selection Evaluation -- Extended Version

Figure 2 for Metrics for Inter-Dataset Similarity with Example Applications in Synthetic Data and Feature Selection Evaluation -- Extended Version

Figure 3 for Metrics for Inter-Dataset Similarity with Example Applications in Synthetic Data and Feature Selection Evaluation -- Extended Version

Figure 4 for Metrics for Inter-Dataset Similarity with Example Applications in Synthetic Data and Feature Selection Evaluation -- Extended Version

Abstract:Measuring inter-dataset similarity is an important task in machine learning and data mining with various use cases and applications. Existing methods for measuring inter-dataset similarity are computationally expensive, limited, or sensitive to different entities and non-trivial choices for parameters. They also lack a holistic perspective on the entire dataset. In this paper, we propose two novel metrics for measuring inter-dataset similarity. We discuss the mathematical foundation and the theoretical basis of our proposed metrics. We demonstrate the effectiveness of the proposed metrics by investigating two applications in the evaluation of synthetic data and in the evaluation of feature selection methods. The theoretical and empirical studies conducted in this paper illustrate the effectiveness of the proposed metrics.

* This is the extended version of a paper accepted at 2025 SIAM International Conference on Data Mining (SDM)

Via

Access Paper or Ask Questions

Transparent Neighborhood Approximation for Text Classifier Explanation

Nov 25, 2024

Yi Cai, Arthur Zimek, Eirini Ntoutsi, Gerhard Wunder

Figure 1 for Transparent Neighborhood Approximation for Text Classifier Explanation

Figure 2 for Transparent Neighborhood Approximation for Text Classifier Explanation

Figure 3 for Transparent Neighborhood Approximation for Text Classifier Explanation

Figure 4 for Transparent Neighborhood Approximation for Text Classifier Explanation

Abstract:Recent literature highlights the critical role of neighborhood construction in deriving model-agnostic explanations, with a growing trend toward deploying generative models to improve synthetic instance quality, especially for explaining text classifiers. These approaches overcome the challenges in neighborhood construction posed by the unstructured nature of texts, thereby improving the quality of explanations. However, the deployed generators are usually implemented via neural networks and lack inherent explainability, sparking arguments over the transparency of the explanation process itself. To address this limitation while preserving neighborhood quality, this paper introduces a probability-based editing method as an alternative to black-box text generators. This approach generates neighboring texts by implementing manipulations based on in-text contexts. Substituting the generator-based construction process with recursive probability-based editing, the resultant explanation method, XPROB (explainer with probability-based editing), exhibits competitive performance according to the evaluation conducted on two real-world datasets. Additionally, XPROB's fully transparent and more controllable construction process leads to superior stability compared to the generator-based explainers.

* IEEE DSAA'24

Via

Access Paper or Ask Questions

Robust Statistical Scaling of Outlier Scores: Improving the Quality of Outlier Probabilities for Outliers (Extended Version)

Aug 30, 2024

Philipp Röchner, Henrique O. Marques, Ricardo J. G. B. Campello, Arthur Zimek, Franz Rothlauf

Abstract:Outlier detection algorithms typically assign an outlier score to each observation in a dataset, indicating the degree to which an observation is an outlier. However, these scores are often not comparable across algorithms and can be difficult for humans to interpret. Statistical scaling addresses this problem by transforming outlier scores into outlier probabilities without using ground-truth labels, thereby improving interpretability and comparability across algorithms. However, the quality of this transformation can be different for outliers and inliers. Missing outliers in scenarios where they are of particular interest - such as healthcare, finance, or engineering - can be costly or dangerous. Thus, ensuring good probabilities for outliers is essential. This paper argues that statistical scaling, as commonly used in the literature, does not produce equally good probabilities for outliers as for inliers. Therefore, we propose robust statistical scaling, which uses robust estimators to improve the probabilities for outliers. We evaluate several variants of our method against other outlier score transformations for real-world datasets and outlier detection algorithms, where it can improve the probabilities for outliers.

* 15 pages, 4 figures, extended version of an original article accepted for publication in SISAP 2024 by Springer Nature

Via

Access Paper or Ask Questions

FSDEM: Feature Selection Dynamic Evaluation Metric

Aug 26, 2024

Muhammad Rajabinasab, Anton D. Lautrup, Tobias Hyrup, Arthur Zimek

Abstract:Expressive evaluation metrics are indispensable for informative experiments in all areas, and while several metrics are established in some areas, in others, such as feature selection, only indirect or otherwise limited evaluation metrics are found. In this paper, we propose a novel evaluation metric to address several problems of its predecessors and allow for flexible and reliable evaluation of feature selection algorithms. The proposed metric is a dynamic metric with two properties that can be used to evaluate both the performance and the stability of a feature selection algorithm. We conduct several empirical experiments to illustrate the use of the proposed metric in the successful evaluation of feature selection algorithms. We also provide a comparison and analysis to show the different aspects involved in the evaluation of the feature selection algorithms. The results indicate that the proposed metric is successful in carrying out the evaluation task for feature selection algorithms. This paper is an extended version of a paper accepted at SISAP 2024.

* Short version of this paper is accepted at 17th International Conference on Similarity Search and Applications, SISAP 2024

Via

Access Paper or Ask Questions

SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data

Apr 24, 2024

Anton Danholt Lautrup, Tobias Hyrup, Arthur Zimek, Peter Schneider-Kamp

Abstract:With the growing demand for synthetic data to address contemporary issues in machine learning, such as data scarcity, data fairness, and data privacy, having robust tools for assessing the utility and potential privacy risks of such data becomes crucial. SynthEval, a novel open-source evaluation framework distinguishes itself from existing tools by treating categorical and numerical attributes with equal care, without assuming any special kind of preprocessing steps. This~makes it applicable to virtually any synthetic dataset of tabular records. Our tool leverages statistical and machine learning techniques to comprehensively evaluate synthetic data fidelity and privacy-preserving integrity. SynthEval integrates a wide selection of metrics that can be used independently or in highly customisable benchmark configurations, and can easily be extended with additional metrics. In this paper, we describe SynthEval and illustrate its versatility with examples. The framework facilitates better benchmarking and more consistent comparisons of model capabilities.

Via

Access Paper or Ask Questions

Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis

Jan 10, 2024

Alastair Anderberg, James Bailey, Ricardo J. G. B. Campello, Michael E. Houle, Henrique O. Marques, Miloš Radovanović, Arthur Zimek

Figure 1 for Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis

Figure 2 for Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis

Figure 3 for Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis

Figure 4 for Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis

Abstract:We present a nonparametric method for outlier detection that takes full account of local variations in intrinsic dimensionality within the dataset. Using the theory of Local Intrinsic Dimensionality (LID), our 'dimensionality-aware' outlier detection method, DAO, is derived as an estimator of an asymptotic local expected density ratio involving the query point and a close neighbor drawn at random. The dimensionality-aware behavior of DAO is due to its use of local estimation of LID values in a theoretically-justified way. Through comprehensive experimentation on more than 800 synthetic and real datasets, we show that DAO significantly outperforms three popular and important benchmark outlier detection methods: Local Outlier Factor (LOF), Simplified LOF, and kNN.

* 13 pages, 3 figures. Extended version of a paper accepted for publication at the SIAM International Conference on Data Mining (SDM24)

Via

Access Paper or Ask Questions

Sharing is CAIRing: Characterizing Principles and Assessing Properties of Universal Privacy Evaluation for Synthetic Tabular Data

Dec 19, 2023

Tobias Hyrup, Anton Danholt Lautrup, Arthur Zimek, Peter Schneider-Kamp

Figure 1 for Sharing is CAIRing: Characterizing Principles and Assessing Properties of Universal Privacy Evaluation for Synthetic Tabular Data

Figure 2 for Sharing is CAIRing: Characterizing Principles and Assessing Properties of Universal Privacy Evaluation for Synthetic Tabular Data

Figure 3 for Sharing is CAIRing: Characterizing Principles and Assessing Properties of Universal Privacy Evaluation for Synthetic Tabular Data

Figure 4 for Sharing is CAIRing: Characterizing Principles and Assessing Properties of Universal Privacy Evaluation for Synthetic Tabular Data

Abstract:Data sharing is a necessity for innovative progress in many domains, especially in healthcare. However, the ability to share data is hindered by regulations protecting the privacy of natural persons. Synthetic tabular data provide a promising solution to address data sharing difficulties but does not inherently guarantee privacy. Still, there is a lack of agreement on appropriate methods for assessing the privacy-preserving capabilities of synthetic data, making it difficult to compare results across studies. To the best of our knowledge, this is the first work to identify properties that constitute good universal privacy evaluation metrics for synthetic tabular data. The goal of such metrics is to enable comparability across studies and to allow non-technical stakeholders to understand how privacy is protected. We identify four principles for the assessment of metrics: Comparability, Applicability, Interpretability, and Representativeness (CAIR). To quantify and rank the degree to which evaluation metrics conform to the CAIR principles, we design a rubric using a scale of 1-4. Each of the four properties is scored on four parameters, yielding 16 total dimensions. We study the applicability and usefulness of the CAIR principles and rubric by assessing a selection of metrics popular in other studies. The results provide granular insights into the strengths and weaknesses of existing metrics that not only rank the metrics but highlight areas of potential improvements. We expect that the CAIR principles will foster agreement among researchers and organizations on which universal privacy evaluation metrics are appropriate for synthetic tabular data.

Via

Access Paper or Ask Questions

Explaining text classifiers through progressive neighborhood approximation with realistic samples

Feb 11, 2023

Yi Cai, Arthur Zimek, Eirini Ntoutsi, Gerhard Wunder

Abstract:The importance of neighborhood construction in local explanation methods has been already highlighted in the literature. And several attempts have been made to improve neighborhood quality for high-dimensional data, for example, texts, by adopting generative models. Although the generators produce more realistic samples, the intuitive sampling approaches in the existing solutions leave the latent space underexplored. To overcome this problem, our work, focusing on local model-agnostic explanations for text classifiers, proposes a progressive approximation approach that refines the neighborhood of a to-be-explained decision with a careful two-stage interpolation using counterfactuals as landmarks. We explicitly specify the two properties that should be satisfied by generative models, the reconstruction ability and the locality-preserving property, to guide the selection of generators for local explanation methods. Moreover, noticing the opacity of generative models during the study, we propose another method that implements progressive neighborhood approximation with probability-based editions as an alternative to the generator-based solution. The explanation results from both methods consist of word-level and instance-level explanations benefiting from the realistic neighborhood. Through exhaustive experiments, we qualitatively and quantitatively demonstrate the effectiveness of the two proposed methods.

Via

Access Paper or Ask Questions

Power of Explanations: Towards automatic debiasing in hate speech detection

Sep 07, 2022

Yi Cai, Arthur Zimek, Gerhard Wunder, Eirini Ntoutsi

Figure 1 for Power of Explanations: Towards automatic debiasing in hate speech detection

Figure 2 for Power of Explanations: Towards automatic debiasing in hate speech detection

Figure 3 for Power of Explanations: Towards automatic debiasing in hate speech detection

Figure 4 for Power of Explanations: Towards automatic debiasing in hate speech detection

Abstract:Hate speech detection is a common downstream application of natural language processing (NLP) in the real world. In spite of the increasing accuracy, current data-driven approaches could easily learn biases from the imbalanced data distributions originating from humans. The deployment of biased models could further enhance the existing social biases. But unlike handling tabular data, defining and mitigating biases in text classifiers, which deal with unstructured data, are more challenging. A popular solution for improving machine learning fairness in NLP is to conduct the debiasing process with a list of potentially discriminated words given by human annotators. In addition to suffering from the risks of overlooking the biased terms, exhaustively identifying bias with human annotators are unsustainable since discrimination is variable among different datasets and may evolve over time. To this end, we propose an automatic misuse detector (MiD) relying on an explanation method for detecting potential bias. And built upon that, an end-to-end debiasing framework with the proposed staged correction is designed for text classifiers without any external resources required.

* IEEE DSAA'22

Via

Access Paper or Ask Questions