Review on statistical methods for gene network reconstruction using expression data
Introduction
From ecological networks describing biotic interactions of different species to intricate biochemical networks modeling actions of molecules at a cellular level, the study of biology has seen a fast-expanding effort to analyze individual biological components in the context of large-scale, complex systems with interacting constituents. Most notably, rapid advances in genomic technology have generated an enormous wealth of data on which mathematical and statistical tools can be applied to infer qualitative and quantitative relationships between DNA, RNA, proteins and other cellular molecules. Such a process of reconstructing biochemical networks using genomic data, also known as network inference or reverse engineering, has helped to elucidate the nature of complex biological processes and disease mechanisms in a variety of organisms, bringing us one step closer to understanding how genetic blueprints combined with non-genetic, environmental factors influence the characteristics of a living system. In particular, comprehending the associations between genotypic and phenotypic characteristics has important ramifications in pathological studies for explaining disease pathways and identifying biomarkers for prognosis and diagnosis.
At a high level, genes, proteins or other metabolites can be conceptualized as nodes and their interactions as edges in a graph. In metabolic networks, reactions are represented as directed edges pointing from reaction substrates to products. While metabolic networks tend to focus on proteins or protein-complexes functioning as enzymes, general protein–protein interaction (PPI) networks are undirected graphs where an edge indicates physical binding between two proteins.
At a more fundamental level, understanding biological processes requires understanding gene regulatory networks since all proteins are encoded by genes. In such a network, transcription factors (TFs), RNA and other small molecules act as regulators to activate or repress the expression levels of genes. Thus gene interactions can occur in the form of direct physical binding of proteins (TFs) to their target sequences, but in a broader sense also include indirect interactions when the expression of a gene influence the expressions of others with regulations caused by one or more intermediaries. Although experimental evidence can be gathered to search for and verify gene interactions, computational tools utilizing gene expression data offer a much more time and cost efficient way to reconstruct these networks. In the past decade, high quality gene expression data have been made readily available in the form of microarray or RNA-seq data.
The idea of modeling the aforementioned biochemical processes as networks is conceptually appealing as many biologically interesting questions can find their counterparts in graph theory. For example, many biochemical networks demonstrate a high clustering coefficient (Barabási and Oltvai, 2004), indicative of a scale-free topology with a few highly connected nodes, or known as hubs. Comparisons with generative network models that give rise to such a topology can help to explain the evolution of organisms at a cellular level. Another important architectural feature of these networks is modules, where a number of nodes form a densely connected community and have sparser connections with the rest of the network. Community or module detection is of great importance in analyzing biochemical networks since identifying groups of molecules performing a specific cellular function is a key issue in system biology. In a PPI network, highly connected nodes are often proteins interacting as part of a complex or other functional modules, which are fundamental in cellular functions and have been shown to play an important role in disease pathologies (Lim et al., 2006, Soler-López et al., 2011). In gene networks, genes modules are likely to have related biological functions or participate in the same biological pathway.
In this paper we review methods for the reconstruction of biological networks with an emphasis on gene networks. In reality, the relationships between genes are directional in nature and they can change over time or in response to external stimulus. Therefore when modeling gene networks a researcher is faced with the choice of whether to include extra features such as causality and temporal behaviors into the model. This choice of modeling paradigm is largely dependent on the type and quality of data available, relevant biological questions to be addressed, and statistical and computational considerations. In Section 2, we focus on methods used to reconstruct static gene networks, highlighting the statistical and computational challenges in inferring undirected or directed network edges and identifying tightly connected communities as potential functional groups. In Section 3, we discuss methods that model temporal changes of gene regulations in a dynamic network. In Section 4, we expand on the data type under consideration from gene expression to other types of genomic data. We survey some methods available for integrating the additional information given, and the connection between biochemical networks and disease biomarkers.
Section snippets
Inferring undirected gene association networks
Gene expression data has the form of a matrix with p genes arranged in rows and their expression levels measured under n experimental conditions in columns. A typical feature of this type of data is their high dimensionality with p much larger than n, posing many estimation and computation challenges. Most methods for inferring edges in gene networks are based on a notion of similarity or coexpression measure. Coexpression is one of the earliest tools used to infer edges in a gene network and
Dynamic gene networks
The types of gene networks discussed so far have all been static, describing only the network topologies and qualitative features of gene relationships. They do not capture the dynamic nature of real networks and cannot yield quantitative predictions of gene behaviors.
Boolean networks is one of the earliest dynamic models proposed (Kauffman, 1969) that simplifies regulation dynamics as a directed graph, where each node is a binary variable and its change of state between consecutive time points
Network reconstruction beyond a single data type
Decades of genomic research have fueled the development of numerous experimental and computational techniques and led to the curation of a large number of databases, such as TRANSFAC (Wingender et al., 2001), KEGG (Okuda et al., 2008), DAVID (Dennis et al., 2003), Cytoscape (Kohl et al., 2011) and NCBI GEO (Barrett et al., 2009). These databases have compiled large amounts of information on gene expression profiles, TF binding motifs, SNP data, PPIs and other biochemical interactions.
Conclusion
Statistical methods for network reconstruction were reviewed with the main focus on those applicable to gene expression data. When inferring an undirected network, key issues involved include: (i) the selection of an appropriate coexpression measure and (ii) the selection of a community detection method for identifying gene functional groups. As discussed in Section 2.1, choosing an effective coexpression measure depends on the nature of the gene interactions one wants to capture. The latter
Acknowledgement
This work is partly supported by an NIH grant U01HG007031 and an NSF grant DMS-1160319.
References (148)
- et al.
Identifying gene regulatory networks from experimental data
Parallel Comput.
(2001) - et al.
Whole-genome metabolic network reconstruction and constraint-based modeling
Meth. Enzymol.
(2011) - et al.
Stochastic blockmodelsfirst steps
Soc. Netw.
(1983) Metabolic stability and epigenesis in randomly constructed genetic nets
J. Theoret. Biol.
(1969)- et al.
Techniques for clustering gene expression data
Comput. Biol. Med.
(2008) - et al.
Measuring similarities between gene expression profiles through new data transformations
BMC Bioinformat.
(2007) - et al.
Dynamic Bayesian network and nonparametric regression for nonlinear modeling of gene networks from time series gene expression data
Biosystems
(2004) - et al.
A protein–protein interaction network for human inherited ataxias and disorders of Purkinje cell degeneration
Cell
(2006) - et al.
Mixed membership stochastic blockmodels
J. Mach. Learn. Res.
(2008) - Akutsu, T., Miyano, S., Kuhara, S., 1999. Identification of genetic networks from a small number of gene expression...