Elsevier

Journal of Theoretical Biology

Volume 362, 7 December 2014, Pages 53-61
Journal of Theoretical Biology

Review on statistical methods for gene network reconstruction using expression data

https://doi.org/10.1016/j.jtbi.2014.03.040Get rights and content

Highlights

  • We review statistical methods for reconstructing gene regulatory networks.

  • We discuss statistical and computational challenges in modeling gene interactions.

  • For each method we compare their modeling paradigms and data types required.

Abstract

Network modeling has proven to be a fundamental tool in analyzing the inner workings of a cell. It has revolutionized our understanding of biological processes and made significant contributions to the discovery of disease biomarkers. Much effort has been devoted to reconstruct various types of biochemical networks using functional genomic datasets generated by high-throughput technologies. This paper discusses statistical methods used to reconstruct gene regulatory networks using gene expression data. In particular, we highlight progress made and challenges yet to be met in the problems involved in estimating gene interactions, inferring causality and modeling temporal changes of regulation behaviors. As rapid advances in technologies have made available diverse, large-scale genomic data, we also survey methods of incorporating all these additional data to achieve better, more accurate inference of gene networks.

Introduction

From ecological networks describing biotic interactions of different species to intricate biochemical networks modeling actions of molecules at a cellular level, the study of biology has seen a fast-expanding effort to analyze individual biological components in the context of large-scale, complex systems with interacting constituents. Most notably, rapid advances in genomic technology have generated an enormous wealth of data on which mathematical and statistical tools can be applied to infer qualitative and quantitative relationships between DNA, RNA, proteins and other cellular molecules. Such a process of reconstructing biochemical networks using genomic data, also known as network inference or reverse engineering, has helped to elucidate the nature of complex biological processes and disease mechanisms in a variety of organisms, bringing us one step closer to understanding how genetic blueprints combined with non-genetic, environmental factors influence the characteristics of a living system. In particular, comprehending the associations between genotypic and phenotypic characteristics has important ramifications in pathological studies for explaining disease pathways and identifying biomarkers for prognosis and diagnosis.

At a high level, genes, proteins or other metabolites can be conceptualized as nodes and their interactions as edges in a graph. In metabolic networks, reactions are represented as directed edges pointing from reaction substrates to products. While metabolic networks tend to focus on proteins or protein-complexes functioning as enzymes, general protein–protein interaction (PPI) networks are undirected graphs where an edge indicates physical binding between two proteins.

At a more fundamental level, understanding biological processes requires understanding gene regulatory networks since all proteins are encoded by genes. In such a network, transcription factors (TFs), RNA and other small molecules act as regulators to activate or repress the expression levels of genes. Thus gene interactions can occur in the form of direct physical binding of proteins (TFs) to their target sequences, but in a broader sense also include indirect interactions when the expression of a gene influence the expressions of others with regulations caused by one or more intermediaries. Although experimental evidence can be gathered to search for and verify gene interactions, computational tools utilizing gene expression data offer a much more time and cost efficient way to reconstruct these networks. In the past decade, high quality gene expression data have been made readily available in the form of microarray or RNA-seq data.

The idea of modeling the aforementioned biochemical processes as networks is conceptually appealing as many biologically interesting questions can find their counterparts in graph theory. For example, many biochemical networks demonstrate a high clustering coefficient (Barabási and Oltvai, 2004), indicative of a scale-free topology with a few highly connected nodes, or known as hubs. Comparisons with generative network models that give rise to such a topology can help to explain the evolution of organisms at a cellular level. Another important architectural feature of these networks is modules, where a number of nodes form a densely connected community and have sparser connections with the rest of the network. Community or module detection is of great importance in analyzing biochemical networks since identifying groups of molecules performing a specific cellular function is a key issue in system biology. In a PPI network, highly connected nodes are often proteins interacting as part of a complex or other functional modules, which are fundamental in cellular functions and have been shown to play an important role in disease pathologies (Lim et al., 2006, Soler-López et al., 2011). In gene networks, genes modules are likely to have related biological functions or participate in the same biological pathway.

In this paper we review methods for the reconstruction of biological networks with an emphasis on gene networks. In reality, the relationships between genes are directional in nature and they can change over time or in response to external stimulus. Therefore when modeling gene networks a researcher is faced with the choice of whether to include extra features such as causality and temporal behaviors into the model. This choice of modeling paradigm is largely dependent on the type and quality of data available, relevant biological questions to be addressed, and statistical and computational considerations. In Section 2, we focus on methods used to reconstruct static gene networks, highlighting the statistical and computational challenges in inferring undirected or directed network edges and identifying tightly connected communities as potential functional groups. In Section 3, we discuss methods that model temporal changes of gene regulations in a dynamic network. In Section 4, we expand on the data type under consideration from gene expression to other types of genomic data. We survey some methods available for integrating the additional information given, and the connection between biochemical networks and disease biomarkers.

Section snippets

Inferring undirected gene association networks

Gene expression data has the form of a matrix with p genes arranged in rows and their expression levels measured under n experimental conditions in columns. A typical feature of this type of data is their high dimensionality with p much larger than n, posing many estimation and computation challenges. Most methods for inferring edges in gene networks are based on a notion of similarity or coexpression measure. Coexpression is one of the earliest tools used to infer edges in a gene network and

Dynamic gene networks

The types of gene networks discussed so far have all been static, describing only the network topologies and qualitative features of gene relationships. They do not capture the dynamic nature of real networks and cannot yield quantitative predictions of gene behaviors.

Boolean networks is one of the earliest dynamic models proposed (Kauffman, 1969) that simplifies regulation dynamics as a directed graph, where each node is a binary variable and its change of state between consecutive time points

Network reconstruction beyond a single data type

Decades of genomic research have fueled the development of numerous experimental and computational techniques and led to the curation of a large number of databases, such as TRANSFAC (Wingender et al., 2001), KEGG (Okuda et al., 2008), DAVID (Dennis et al., 2003), Cytoscape (Kohl et al., 2011) and NCBI GEO (Barrett et al., 2009). These databases have compiled large amounts of information on gene expression profiles, TF binding motifs, SNP data, PPIs and other biochemical interactions.

Conclusion

Statistical methods for network reconstruction were reviewed with the main focus on those applicable to gene expression data. When inferring an undirected network, key issues involved include: (i) the selection of an appropriate coexpression measure and (ii) the selection of a community detection method for identifying gene functional groups. As discussed in Section 2.1, choosing an effective coexpression measure depends on the nature of the gene interactions one wants to capture. The latter

Acknowledgement

This work is partly supported by an NIH grant U01HG007031 and an NSF grant DMS-1160319.

References (148)

  • U. Alon et al.

    Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

    Proc. Natl. Acad. Sci.

    (1999)
  • P. Aloy et al.

    InterPreTSprotein interaction prediction through tertiary structure

    Bioinformatics

    (2003)
  • A.A. Amini et al.

    Pseudo-likelihood methods for community detection in large sparse networks

    Ann. Stat.

    (2013)
  • F. Azuaje

    Bioinformatics and Biomarker Discovery“Omic” Data Analysis for Personalized Medicine

    (2010)
  • Z. Bar-Joseph et al.

    Computational discovery of gene modules and regulatory networks

    Nat. Biotechnol.

    (2003)
  • Z. Bar-Joseph et al.

    Studying and modelling dynamic biological processes using time-series gene expression data

    Nat. Rev. Genetics

    (2012)
  • A.L. Barabási et al.

    Network biologyunderstanding the cell׳s functional organization

    Nat. Rev. Genetics

    (2004)
  • T. Barrett et al.

    NCBI GEOarchive for high-throughput functional genomic data

    Nucl. Acids Res.

    (2009)
  • K. Basso et al.

    Reverse engineering of regulatory networks in human b cells

    Nat. Genetics

    (2005)
  • S.A. Becker et al.

    Context-specific metabolic networks are consistent with experiments

    PLoS Comput. Biol.

    (2008)
  • A. Ben-Dor et al.

    Clustering gene expression patterns

    J. Comput. Biol.

    (1999)
  • D.R. Bickel

    Probabilities of spurious connections in gene networksapplication to expression time series

    Bioinformatics

    (2005)
  • P. Bickel et al.

    A nonparametric view of network models and Newman–Girvan and other modularities

    Proc. Natl. Acad. Sci.

    (2009)
  • P.J. Bickel et al.

    Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels

    Ann. Stat.

    (2013)
  • K. Bleakley et al.

    Supervised reconstruction of biological networks with local models

    Bioinformatics

    (2007)
  • R. Bonneau et al.

    The Inferelatoran algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo

    Genome Biol.

    (2006)
  • Butte, A.J., Kohane, I.S., 2000. Mutual information relevance networks: functional genomic clustering using pairwise...
  • L. Cai et al.

    Clustering analysis of SAGE data using a Poisson approach

    Genome Biol.

    (2004)
  • S.L. Carter et al.

    Gene co-expression network topology provides a framework for molecular characterization of cellular state

    Bioinformatics

    (2004)
  • A. Celisse et al.

    Consistency of maximum-likelihood and variational estimators in the stochastic block model

    Electron. J. Stat.

    (2012)
  • L. Cerulo et al.

    Learning gene regulatory networks from only positive and unlabeled data

    BMC Bioinformat.

    (2010)
  • A. Channarond et al.

    Classification and estimation in the stochastic block model based on the empirical degrees

    Electron. J. Stat.

    (2012)
  • K.C. Chen et al.

    A stochastic differential equation model for quantifying transcriptional regulatory network in Saccharomyces cerevisiae

    Bioinformatics

    (2005)
  • Y. Cheng et al.

    Biclustering of expression data

    Int. Conf. Intell. Syst. Mol. Biol.

    (2000)
  • G.F. Cooper et al.

    A Bayesian method for the induction of probabilistic networks from data

    Mach. Learn.

    (1992)
  • C.O. Daub et al.

    Estimating mutual information using B-spline functions — an improved similarity measure for analysing gene expression data

    BMC Bioinformat.

    (2004)
  • J.J. Daudin et al.

    A mixture model for random graphs

    Stat. Comput.

    (2008)
  • A. De La Fuente et al.

    Discovery of meaningful associations in genomic data using partial correlation coefficients

    Bioinformatics

    (2004)
  • R. De Smet et al.

    Advantages and limitations of current network inference methods

    Nat. Rev. Microbiol.

    (2010)
  • A. Decelle et al.

    Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications

    Phys. Rev. E

    (2011)
  • D. Dembélé et al.

    Fuzzy C-means method for clustering microarray data

    Bioinformatics

    (2003)
  • G.J. Dennis et al.

    DAVIDdatabase for annotation, visualization, and integrated discovery

    Genome Biol.

    (2003)
  • P. D׳haeseleer et al.

    Genetic network inferencefrom co-expression clustering to reverse engineering

    Bioinformatics

    (2000)
  • D. di Bernardo et al.

    Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks

    Nat. Biotechnol.

    (2005)
  • D.I. Edwards

    Introduction to Graphical Modelling

    (2000)
  • M.B. Eisen et al.

    Cluster analysis and display of genome-wide expression patterns

    Proc. Natl. Acad. Sci.

    (1998)
  • J. Ernst et al.

    A semi-supervised method for predicting transcription factor-gene interactions in Escherichia coli

    PLoS Comput. Biol.

    (2008)
  • A.M. Feist et al.

    Reconstruction of biochemical networks in microorganisms

    Nat. Rev. Microbiol.

    (2009)
  • V. Filkov et al.

    Analysis techniques for microarray time-series data

    J. Comput. Biol.

    (2002)
  • D. Fishkind et al.

    Consistent adjacency-spectral partitioning for the stochastic block model when the model parameters are unknown

    SIAM J. Matrix Anal. Appl.

    (2013)
  • Cited by (0)

    View full text