Abstract:Recent advances in machine learning have highlighted Federated Learning (FL) as a promising approach that enables multiple distributed users (so-called clients) to collectively train ML models without sharing their private data. While this privacy-preserving method shows potential, it struggles when data across clients is not independent and identically distributed (non-IID) data. The latter remains an unsolved challenge that can result in poorer model performance and slower training times. Despite the significance of non-IID data in FL, there is a lack of consensus among researchers about its classification and quantification. This systematic review aims to fill that gap by providing a detailed taxonomy for non-IID data, partition protocols, and metrics to quantify data heterogeneity. Additionally, we describe popular solutions to address non-IID data and standardized frameworks employed in FL with heterogeneous data. Based on our state-of-the-art review, we present key lessons learned and suggest promising future research directions.
Abstract:In recent years, many studies have been focusing on predicting the scientific impact of research papers. Most of these predictions are based on citations count or rely on features obtainable only from already published papers. In this study, we predict the likelihood for a research paper of winning an award only relying on information available at publication time. For each paper, we build the citation subgraph induced from its bibliography. We initially consider some features of this subgraph, such as the density and the global clustering coefficient, to make our prediction. Then, we mix this information with textual features, extracted from the abstract and the title, to obtain a more accurate final prediction. We made our experiments considering the ArnetMiner citation graph, while the ground truth on award-winning papers has been obtained from a collection of best paper awards from 32 computer science conferences. In our experiment, we obtained an encouraging F1 score of 0.694. Remarkably, The high recall and the low false negatives rate, show how the model performs very well at identifying papers that will not win an award. This behavior can help researchers in getting a first evaluation of their work at publication time. Lastly, we made some first experiments on interpretability. Our results highlight some interesting patterns both in topological and textual features.
Abstract:Background: AI-based analysis of sufficiently large, curated medical datasets has been shown to be promising for providing early detection, faster diagnosis, better decision-making, and more effective treatment. However, accessing such highly confidential and very sensitive medical data, obtained from a variety of sources, is usually highly restricted since improper use, unsafe storage, data leakage or abuse could violate a person's privacy. In this work we apply a federated learning paradigm over a heterogeneous, siloed sets of high-definition electrocardiogram arriving from 12-leads ECG sensors arrays to train AI models. We evaluated the capacity of the resulting models to achieve equivalent performance when compared to state-of-the-art models trained when the same data is collected in a central place. Methods: We propose a privacy preserving methodology for training AI models based on the federated learning paradigm over a heterogeneous, distributed, dataset. The methodology is applied to a broad range of machine learning techniques based on gradient boosting, convolutional neural network and recurrent neural networks with long short-term memory. The models were trained over a ECG dataset containing 12-leads recordings collected from 43,059 patients from six geographically separate and heterogeneous sources. Findings: The resulting set of AI models for detecting cardiovascular abnormalities achieved comparable predictive performances against models trained using a centralised learning approach. Interpretation: The approach of compute parameters contributing to the global model locally and then exchange only such parameters instead of the whole sensitive data as in ML contributes to preserve medical data privacy.
Abstract:Plants monitor their surrounding environment and control their physiological functions by producing an electrical response. We recorded electrical signals from different plants by exposing them to Sodium Chloride (NaCl), Ozone (O3) and Sulfuric Acid (H2SO4) under laboratory conditions. After applying pre-processing techniques such as filtering and drift removal, we extracted few statistical features from the acquired plant electrical signals. Using these features, combined with different classification algorithms, we used a decision tree based multi-class classification strategy to identify the three different external chemical stimuli. We here present our exploration to obtain the optimum set of ranked feature and classifier combination that can separate a particular chemical stimulus from the incoming stream of plant electrical signals. The paper also reports an exhaustive comparison of similar feature based classification using the filtered and the raw plant signals, containing the high frequency stochastic part and also the low frequency trends present in it, as two different cases for feature extraction. The work, presented in this paper opens up new possibilities for using plant electrical signals to monitor and detect other environmental stimuli apart from NaCl, O3 and H2SO4 in future.
Abstract:Plants sense their environment by producing electrical signals which in essence represent changes in underlying physiological processes. These electrical signals, when monitored, show both stochastic and deterministic dynamics. In this paper, we compute 11 statistical features from the raw non-stationary plant electrical signal time series to classify the stimulus applied (causing the electrical signal). By using different discriminant analysis based classification techniques, we successfully establish that there is enough information in the raw electrical signal to classify the stimuli. In the process, we also propose two standard features which consistently give good classification results for three types of stimuli - Sodium Chloride (NaCl), Sulphuric Acid (H2SO4) and Ozone (O3). This may facilitate reduction in the complexity involved in computing all the features for online classification of similar external stimuli in future.