CBIO
Abstract:Inferring the structure of gene regulatory networks (GRN) from gene expression data has many applications, from the elucidation of complex biological processes to the identification of potential drug targets. It is however a notoriously difficult problem, for which the many existing methods reach limited accuracy. In this paper, we formulate GRN inference as a sparse regression problem and investigate the performance of a popular feature selection method, least angle regression (LARS) combined with stability selection. We introduce a novel, robust and accurate scoring technique for stability selection, which improves the performance of feature selection with LARS. The resulting method, which we call TIGRESS (Trustful Inference of Gene REgulation using Stability Selection), was ranked among the top methods in the DREAM5 gene network reconstruction challenge. We investigate in depth the influence of the various parameters of the method and show that a fine parameter tuning can lead to significant improvements and state-of-the-art performance for GRN inference. TIGRESS reaches state-of-the-art performance on benchmark data. This study confirms the potential of feature selection techniques for GRN inference. Code and data are available on http://cbio.ensmp.fr/~ahaury. Running TIGRESS online is possible on GenePattern: http://www.broadinstitute.org/cancer/software/genepattern/.
Abstract:Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases. Here we propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases.
Abstract:We consider the problem of learning a binary classifier from a training set of positive and unlabeled examples, both in the inductive and in the transductive setting. This problem, often referred to as \emph{PU learning}, differs from the standard supervised classification problem by the lack of negative examples in the training set. It corresponds to an ubiquitous situation in many applications such as information retrieval or gene ranking, when we have identified a set of data of interest sharing a particular property, and we wish to automatically retrieve additional data sharing the same property among a large and easily available pool of unlabeled data. We propose a conceptually simple method, akin to bagging, to approach both inductive and transductive PU learning problems, by converting them into series of supervised binary classification problems discriminating the known positive examples from random subsamples of the unlabeled set. We empirically demonstrate the relevance of the method on simulated and real data, where it performs at least as well as existing methods while being faster.