Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aneta Polewko-Klim

Analysis of ensemble feature selection for correlated high-dimensional RNA-Seq cancer data

Apr 28, 2020

Aneta Polewko-Klim, Witold R. Rudnicki

Figure 1 for Analysis of ensemble feature selection for correlated high-dimensional RNA-Seq cancer data

Figure 2 for Analysis of ensemble feature selection for correlated high-dimensional RNA-Seq cancer data

Figure 3 for Analysis of ensemble feature selection for correlated high-dimensional RNA-Seq cancer data

Figure 4 for Analysis of ensemble feature selection for correlated high-dimensional RNA-Seq cancer data

Abstract:Discovery of diagnostic and prognostic molecular markers is important and actively pursued the research field in cancer research. For complex diseases, this process is often performed using Machine Learning. The current study compares two approaches for the discovery of relevant variables: by application of a single feature selection algorithm, versus by an ensemble of diverse algorithms. These approaches are used to identify variables that are relevant discerning of four cancer types using RNA-seq profiles from the Cancer Genome Atlas. The comparison is carried out in two directions: evaluating the predictive performance of models and monitoring the stability of selected variables. The most informative features are identified using a four feature selection algorithms, namely U-test, ReliefF, and two variants of the MDFS algorithm. Discerning normal and tumor tissues is performed using the Random Forest algorithm. The highest stability of the feature set was obtained when U-test was used. Unfortunately, models built on feature sets obtained from the ensemble of feature selection algorithms were no better than for models developed on feature sets obtained from individual algorithms. On the other hand, the feature selectors leading to the best classification results varied between data sets.

* 14 pages, 1 table, 29 figure, submitted to International Conference on Computational Science, Amsterdam 2020

Via

Access Paper or Ask Questions

Bootstrap Bias Corrected Cross Validation applied to Super Learning

Mar 18, 2020

Krzysztof Mnich, Agnieszka Kitlas Golińska, Aneta Polewko-Klim, Witold R. Rudnicki

Figure 1 for Bootstrap Bias Corrected Cross Validation applied to Super Learning

Figure 2 for Bootstrap Bias Corrected Cross Validation applied to Super Learning

Figure 3 for Bootstrap Bias Corrected Cross Validation applied to Super Learning

Figure 4 for Bootstrap Bias Corrected Cross Validation applied to Super Learning

Abstract:Super learner algorithm can be applied to combine results of multiple base learners to improve quality of predictions. The default method for verification of super learner results is by nested cross validation. It has been proposed by Tsamardinos et al., that nested cross validation can be replaced by resampling for tuning hyper-parameters of the learning algorithms. We apply this idea to verification of super learner and compare with other verification methods, including nested cross validation. Tests were performed on artificial data sets of diverse size and on seven real, biomedical data sets. The resampling method, called Bootstrap Bias Correction, proved to be a reasonably precise and very cost-efficient alternative for nested cross validation.

* 14 pages, 4 tables, 1 figure, submitted to International Conference on Computational Science, Amsterdam 2020

Via

Access Paper or Ask Questions

MDFS - MultiDimensional Feature Selection

Oct 31, 2018

Radosław Piliszek, Krzysztof Mnich, Szymon Migacz, Paweł Tabaszewski, Andrzej Sułecki, Aneta Polewko-Klim, Witold Rudnicki

Figure 1 for MDFS - MultiDimensional Feature Selection

Figure 2 for MDFS - MultiDimensional Feature Selection

Figure 3 for MDFS - MultiDimensional Feature Selection

Figure 4 for MDFS - MultiDimensional Feature Selection

Abstract:Identification of informative variables in an information system is often performed using simple one-dimensional filtering procedures that discard information about interactions between variables. Such approach may result in removing some relevant variables from consideration. Here we present an R package MDFS (MultiDimensional Feature Selection) that performs identification of informative variables taking into account synergistic interactions between multiple descriptors and the decision variable. MDFS is an implementation of an algorithm based on information theory. Computational kernel of the package is implemented in C++. A high-performance version implemented in CUDA C is also available. The applications of MDFS are demonstrated using the well-known Madelon dataset that has synergistic variables by design. The dataset comes from the UCI Machine Learning Repository. It is shown that multidimensional analysis is more sensitive than one-dimensional tests and returns more reliable rankings of importance.

* 12 pages, 3 figures, 5 tables, license: CC-BY

Via

Access Paper or Ask Questions