Review Article
Stable feature selection for biomarker discovery

https://doi.org/10.1016/j.compbiolchem.2010.07.002Get rights and content

Abstract

Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchical framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development.

Introduction

Recent advances in genomics and proteomics enable the discovery of biomarkers for diagnosis and treatment of complex diseases at the molecular level (Srinivas et al., 2002). A biomarker may be defined as “a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention” (Biomarkers Definitions Working Group, 2001).

The discovery of biomarkers from high-throughput “omics” data is typically modeled as selecting the most discriminating features (or variables) for classification (e.g., discriminating healthy versus diseased, or different tumor stages) (Hilario and Kalousis, 2008, Azuaje et al., 2009). In the language of statistics and machine learning, this is often referred to as feature selection. Existing feature selection algorithms can be generally organized into three categories: filter methods, wrapper methods and embedded methods. The filter method evaluates the relevance of features by looking only at the intrinsic properties of the data. The wrapper method assesses the goodness of feature subsets using the performance of a learning algorithm. The embedded method combines feature selection and classifier construction using an integrated computational process. Feature selection has attracted strong research interest in the past several decades. For recent reviews of feature selection techniques used in bioinformatics, the reader is referred to Saeys et al. (2007), Ma and Huang (2008), Hilario and Kalousis (2008) and Duval and Hao (2010).

The non-reproducibility of reported markers remains one of the major obstacles to clinical applications. The reproducibility requirement on markers could be interpreted as: “The identified feature subset should always exhibit good performance at distinguishing cases from controls across different studies.” In other words, if the reproducibility of report feature subset is not good, it means that such subset is not the true marker. Therefore, reproducibility is just one sufficient property of real markers.

If the feature selection algorithm is able to find true markers, then the reproducibility issue is resolved consequently. To date, we mainly use the classification accuracy of selected feature subset to facilitate marker identification. While many feature selection algorithms have been proposed, they do not necessarily identify the same candidate feature subsets if we repeat the biomarker discovery procedure (Yu et al., 2008). Even for the same data, one may find many different subsets of features (either from the same feature selection method or from different feature selection methods) that can achieve the same or similar predictive accuracy (Ein-Dor et al., 2005, Michiels et al., 2005, Zucknick et al., 2008). Suppose there is only one real marker, it is obvious that such accuracy-based strategy cannot distinguish true marker from false ones effectively.

The deficiency of using only predictive accuracy for marker discovery motivates people to seek additional assessment metrics. One such measure is the stability of feature selection results with respect to sampling variations. The “feature selection stability” is closely related to the “marker reproducibility”. We will discuss this issue under two different conditions on the premise that reported markers are accurate enough.

  • If there is only one true marker, then we can achieve perfect reproducibility if and only if this marker is selected. Furthermore, we also have to select this ground-truth marker if our objective is to obtain perfect stability. That is, stability is a necessary condition for reproducibility in this case.

  • If there are more than one true markers, good stability is also correlated to high reproducibility. Since high reproducibility means high classification accuracy across different data variations, those real markers have higher probability of being selected if we aim at improving stability.

In summary, stability is a good indicator of marker reproducibility. Good stability of feature selection is equally important as high classification accuracy in biomarker discovery (Jurman et al., 2008). The instability of feature selection results will reduce our confidence in discovered markers.

The stability issue in feature selection has received much attention recently. In this article, we shall review existing methods for stable feature selection in biomarker discovery applications, summarize them with an unified framework and provide a convenient reference for future research and development.

This article differs from existing review papers on feature selection in the following aspects:

  • Compared to current feature selection reviews (Saeys et al., 2007, Ma and Huang, 2008, Hilario and Kalousis, 2008, Duval and Hao, 2010), this review focuses only on those feature selection approaches that incorporate “stability” into the algorithmic design. Therefore, many methods such as information theory-based feature selection algorithms are not included in this paper.

  • This article mainly focuses on “methods” for finding reliable markers rather than “metrics” of measuring the stability of selected feature subsets (Boulesteix and Slawski, 2009), although we also list these metrics for completeness.

The remainder of the paper is organized as follows. In Section 2, we discuss several sources that cause the instability of feature selection. In Section 3, we summarize available stable feature selection algorithms and describe different classes of methods in detail. In Section 4, we provide a list of stability measures and illustrate their definitions. We give a discussion in Section 5. Finally, we conclude this paper in Section 6.

Section snippets

Causes of instability

There are mainly three sources of instability in biomarker discovery:

  • 1.

    Algorithm design without considering stability: Classic feature selection methods aim at selecting a minimum subset of features to construct a classifier of the best predictive accuracy (Yu et al., 2008). They often ignore “stability” in the algorithm design.

  • 2.

    The existence of multiple sets of true markers: It is possible that there exist multiple sets of potential true markers in real data. On the one hand, when there are many

Existing methods

To date, there are many methods available for stable feature selection. We wish to cover all existing methods in a systematic and expandable manner. Fig. 1 illustrates our approach to summarizing different methods based on the way they treat different sources of instabilities. Briefly, the ensemble feature selection method and the method using prior feature relevance incorporate stability consideration into the algorithm design stage. To handle data with highly correlated features, the group

Stability measure

In stable feature selection, one important issue is how to measure the “stability” of feature selection algorithms, i.e., how to qualify the selection sensitivity to variations in the training set. The stability measure can be used in different contexts. On the one hand, it is indispensable for evaluating different algorithms in performance comparison. On the other hand, it can be used for internal validation in feature selection algorithms that take into account stability.

Noticing that there

Discussions

We summarize three sources of instability for feature selection in Section 2. Among these sources, probably the small number of samples in high-dimensional feature space is the most difficult one in biomarker discovery. Besides feature selection, other data analysis tasks also face the same challenges. Research progresses in related fields will facilitate the development of effective stable feature selection methods as well.

Group feature selection is the most extensively studied method among

Conclusions

To discover reproducible markers from “omics” data, the stability issue of feature selection has received much attention recently. This review summarizes existing stable feature selection methods and stability measures. Stable feature selection is a very important research problem, from both theoretical perspective and practical aspect. More research efforts should be devoted to this challenging topic.

Acknowledgments

This work was supported by the Fundamental Research Funds for the Central Universities of China (DUT10JR05 and DUT10ZD110), the general research fund 621707 from the Hong Kong Research Grant Council and the research proposal competition awards RPC07/08.EG25 and RPC10.EG04 from the Hong Kong University of Science and Technology.

References (83)

  • F. Azuaje et al.

    Computational biology for cardiovascular biomarker discovery

    Briefings in Bioinformatics

    (2009)
  • F.R. Bach

    Bolasso: model consistent lasso estimation through the bootstrap

  • S. Baek et al.

    Development of biomarker classifiers from high-dimensional data

    Briefings in Bioinformatics

    (2009)
  • Biomarkers and surrogate endpoints: preferred definitions and conceptual framework

    Clinical Pharmacology and Therapeutics

    (2001)
  • A.L. Boulesteix et al.

    Stability and aggregation of ranked gene lists

    Briefings in Bioinformatics

    (2009)
  • L. Breiman

    Bagging predictors

    Machine Learning

    (1996)
  • X. Chen et al.

    Integrating biological knowledge with gene expression profiles for survival prediction of cancer

    Journal of Computational Biology

    (2009)
  • H.Y. Chuang et al.

    Network-based classification of breast cancer metastasis

    Molecular Systems Biology

    (2007)
  • C.A. Davis et al.

    Reliable gene signatures for microarray classification: assessment of stability and performance

    Bioinformatics

    (2006)
  • I. Dinu et al.

    Gene-set analysis and reduction

    Briefings in Bioinformatics

    (2009)
  • Dunne, K., Cunningham, P., Azuaje, F., 2002. Solutions to instability problems with sequential wrapper-based approaches...
  • J. Dutkowski et al.

    On consensus biomarker selection

    BMC Bioinformatics

    (2007)
  • B. Duval et al.

    Advances in metaheuristics for gene selection and classification of microarray data

    Briefings in Bioinformatics

    (2010)
  • C. Dwork et al.

    Rank aggregation methods for the web

  • B. Efron et al.

    On testing the significance of sets of genes

    Annals of Applied Statistics

    (2007)
  • L. Ein-Dor et al.

    Outcome signature genes in breast cancer: is there a unique set?

    Bioinformatics

    (2005)
  • L. Ein-Dor et al.

    Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer

    Proceedings of the National Academy of Sciences of the United States of America

    (2006)
  • Z. Guo et al.

    Towards precise classification of cancers based on robust gene functional expression profiles

    BMC Bioinformatics

    (2005)
  • I. Guyon et al.

    Gene selection for cancer classification using support vector machines

    Machine Learning

    (2002)
  • T. Hastie et al.

    Supervised harvesting of expression trees

    Genome Biology

    (2001)
  • T. Helleputte et al.

    Feature selection by transfer learning with linear regularized models

  • T. Helleputte et al.

    Partially supervised feature selection with regularized linear models

  • M. Hilario et al.

    Approaches to dimensionality reduction in proteomic biomarker studies

    Briefings in Bioinformatics

    (2008)
  • T.K. Ho

    The random subspace method for constructing decision forests

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1998)
  • D. Huang et al.

    Effective gene selection method with small sample sets using gradient-based and point injection techniques

    IEEE/ACM Transactions on Computational Biology and Bioinformatics

    (2007)
  • T. Hwang et al.

    Identification of differentially expressed subnetworks based on multivariate ANOVA

    BMC Bioinformatics

    (2009)
  • T. Hwang et al.

    Robust and efficient identification of biomarkers by classifying features on graphs

    Bioinformatics

    (2008)
  • G. Jin et al.

    The knowledge-integrated network biomarkers discovery for major adverse cardiac events

    Journal of Proteome Research

    (2008)
  • R. Jornsten et al.

    Simultaneous gene clustering and subset selection for sample classification via mdl

    Bioinformatics

    (2003)
  • G. Jurman et al.

    Algebraic stability indicators for ranked lists in molecular profiling

    Bioinformatics

    (2008)
  • A. Kalousis et al.

    Stability of feature selection algorithms: a study on high-dimensional spaces

    Knowledge and Information Systems

    (2007)
  • Cited by (0)

    View full text