TIGRE Project

TIGRE: Transcription factor Inference through Gaussian process Reconstruction of Expression

Overview

Our goal in this project is to develop and apply new methods for inferring the parameters of mechanistic models of biological systems and to apply these methods in order to uncover the mechanisms of transcriptional regulation. This goal will be achieved by unifying two disparate approaches to network analysis: the systems biology approach of specifying differential equation models of transcription (see this book), and the statistical/machine learning approach of constructing probabilistic models of the data (see Related Papers below). We will infer the parameters of differential equation models through constructing probabilistic models which respect the relationships specified by the differential equation. This goal will be achieved by combining information from gene expression time series data with differential equation models of transcriptional regulation. The advantage of using probabilistic models is that we can simultaneously handle uncertainty in the model parameters along with experimental and biological noise.

Progress so far is:

New: We have an approach to ranking targets of a transcription factor through a linear differential models of transcription and translation. (see DISIMRANK software below, and this PNAS paper).
New: The R translation of our GPSIM and DISIMRANK software has been accepted to Bioconductor as the TIGRE package.
We have an approach to inferring protein concentration given the gene expression of known targets in a single input motif. (see GPSIM software below, this NIPS paper and this ECCB paper).
We have shown how multi-output Gaussian processes can be computed efficiently using conditional independence assumptions (see this NIPS paper)
We have shown how sampling can be done efficiently using Gaussian processes even when the posterior over the GP is strongly correlated (see this NIPS paper)

The project is sponsored by EPSRC Grant No EP/F005687/1 "Gaussian Processes for Systems Identification with Applications in Systems Biology" and is a collaboration with Dr Nick Monk of University of Nottingham, Dr Johannes Jaeger of University of Cambridge and Dr Antti Honkela of Helsinki University of Technology (visitor and collaborator).

Personnel at Manchester

Michalis Titsias, post-doctoral research assistant

Mauricio Alvarez, PhD student

Software

The following software has been made available either wholly or partly as a result of work on this project:

GPSIM: Gaussian Process Modelling of single input module motif networks.

MULTIGP: Modelling multiple outputs with Gaussian processes (will eventually supercede the gpsim toolbox).

DISIMRANK: Ranking potential targets using a driven input single input model motif.

Publications

The following publications have been produced as a result of this project.

Journal Papers

A. Honkela, C. Girardot, E. H. Gustafson, Y.a.H. Liu, E. E. M. Furlong, N. D. Lawrence and M. Rattray. (2010) "A model-based method for transcription factor target identification with limited data" in Proc. Natl. Acad. Sci. USA [Software][DOI][Google Scholar Search]

Abstract

We present a computational method for identifying potential targets of a transcription factor (TF) using wild-type gene expression time series data. For each putative target gene we fit a simple differential equation model of transcriptional regulation, and the model likelihood serves as a score to rank targets. The expression profile of the TF is modeled as a sample from a Gaussian process prior distribution that is integrated out using a nonparametric Bayesian procedure. This results in a parsimonious model with relatively few parameters that can be applied to short time series datasets without noticeable overfitting. We assess our method using genome-wide chromatin immunoprecipitation (ChIP-chip) and loss-of-function mutant expression data for two TFs, Twist, and Mef2, controlling mesoderm development in Drosophila. Lists of top-ranked genes identified by our method are significantly enriched for genes close to bound regions identified in the ChIP-chip data and for genes that are differentially expressed in loss-of-function mutants. Targets of Twist display diverse expression profiles, and in this case a model-based approach performs significantly better than scoring based on correlation with TF expression. Our approach is found to be comparable or superior to ranking based on mutant differential expression scores. Also, we show how integrating complementary wild-type spatial expression data can further improve target ranking performance.

P. Gao, A. Honkela, M. Rattray and N. D. Lawrence. (2008) "Gaussian process modelling of latent chemical species: applications to inferring transcription factor activities" in Bioinformatics 24, pp i70--i75 [Software][PDF][DOI][Google Scholar Search]

Abstract

Motivation: Inference of latent chemical species in biochemical interaction networks is a key problem in estimation of the structure and parameters of the genetic, metabolic and protein interaction networks that underpin all biological processes. We present a framework for Bayesian marginalisation of these latent chemical species through Gaussian process priors.

Results: We demonstrate our general approach on three different biological examples of single input motifs, including both activation and repression of transcription. We focus in particular on the problem of inferring transcription factor activity when the concentration of active protein cannot easily be measured. We show how the uncertainty in the inferred transcription factor activity can be integrated out in order to derive a likelihood function that can be used for the estimation of regulatory model parameters. An advantage of our approach is that we avoid the use of a coarse-grained discretization of continuous-time functions, which would lead to a large number of additional parameters to be estimated. We develop efficient exact and approximate inference schemes, which are much more efficient than competing sampling-based schemes and therefore provide us with a practical toolkit for model-based inference.

Availability: The software and data for recreating all the experiments in this paper is available in MATLAB from http://www.cs.man.ac.uk/%7Eneill/gpsim

Contact: Neil Lawrence

Book

N. D. Lawrence, M. Girolami, M. Rattray and G. Sanguinetti (eds) (2010) "Learning and inference in computational systems biology", MIT Press, Cambridge, MA.

Synopsis

Computational systems biology aims to develop algorithms that uncover the structure and parameterization of the underlying mechanistic model—in other words, to answer specific questions about the underlying mechanisms of a biological system—in a process that can be thought of as learning or inference. This volume offers state-of-the-art perspectives from computational biology, statistics, modeling, and machine learning on new methodologies for learning and inference in biological networks. The chapters offer practical approaches to biological inference problems ranging from genome-wide inference of genetic regulation to pathway-specific studies. Both deterministic models (based on ordinary differential equations) and stochastic models (which anticipate the increasing availability of data from small populations of cells) are considered. Several chapters emphasize Bayesian inference, so the editors have included an introduction to the philosophy of the Bayesian approach and an overview of current work on Bayesian inference. Taken together, the methods discussed by the experts in Learning and Inference in Computational Systems Biology provide a foundation upon which the next decade of research in systems biology can be built.

Book Chapters

N. D. Lawrence (2010) "Introduction to learning and inference in computational systems biology" in N. D. Lawrence, M. Girolami, M. Rattray and G. Sanguinetti (eds) Learning and Inference in Computational Systems Biology, MIT Press, Cambridge, MA. [MIT Press Site][Google Scholar Search]

Abstract

N. D. Lawrence and M. Rattray. (2010) "A brief introduction to Bayesian inference" in N. D. Lawrence, M. Girolami, M. Rattray and G. Sanguinetti (eds) Learning and Inference in Computational Systems Biology, MIT Press, Cambridge, MA. [MIT Press Site][Google Scholar Search]

Abstract

N. D. Lawrence, M. Rattray, P. Gao and M. K. Titsias. (2010) "Gaussian processes for missing species in biochemical systems" in N. D. Lawrence, M. Girolami, M. Rattray and G. Sanguinetti (eds) Learning and Inference in Computational Systems Biology, MIT Press, Cambridge, MA. [MIT Press Site][Google Scholar Search]

Abstract

Conference Papers

M. A. Álvarez, D. Luengo, M. K. Titsias and N. D. Lawrence. (2010) "Efficient multioutput Gaussian processes through variational inducing kernels" in Y. W. Teh and D. M. Titterington (eds) Proceedings of the Thirteenth International Workshop on Artificial Intelligence and Statistics, JMLR W&CP 9, Chia Laguna Resort, Sardinia, Italy, pp 25--32. [Software][PDF][Google Scholar Search]

Abstract

Interest in multioutput kernel methods is increasing, whether under the guise of multitask learning, multisensor networks or structured output data. From the Gaussian process perspective a multioutput Mercer kernel is a covariance function over correlated output functions. One way of constructing such kernels is based on convolution processes (CP). A key problem for this approach is efficient inference. Álvarez and Lawrence [Alvarez:convolved08] recently presented a sparse approximation for CPs that enabled efficient inference. In this paper, we extend this work in two directions: we introduce the concept of variational inducing functions to handle potential non-smooth functions involved in the kernel CP construction and we consider an alternative approach to approximate inference based on variational methods, extending the work by Titsias [Titsias:variational09] to the multiple output case. We demonstrate our approaches on prediction of school marks, compiler performance and financial time series.

M. Álvarez, D. Luengo and N. D. Lawrence. (2009) "Latent force models" in D. van Dyk and M. Welling (eds) Proceedings of the Twelfth International Workshop on Artificial Intelligence and Statistics, JMLR W&CP 5, Clearwater Beach, FL, pp 9--16. [Software][PDF][Google Scholar Search]

Abstract

Purely data driven approaches for machine learning present difficulties when data is scarce relative to the complexity of the model or when the model is forced to extrapolate. On the other hand, purely mechanistic approaches need to identify and specify all the interactions in the problem at hand (which may not be feasible) and still leave the issue of how to parameterize the system. In this paper, we present a hybrid approach using Gaussian processes and differential equations to combine data driven modeling with a physical model of the system. We show how different, physically-inspired, kernel functions can be developed through sensible, simple, mechanistic assumptions about the underlying system. The versatility of our approach is illustrated with three case studies from computational biology, motion capture and geostatistics.

M. K. Titsias (2009) "Variational learning of inducing variables in sparse Gaussian processes" in D. van Dyk and M. Welling (eds) Proceedings of the Twelfth International Workshop on Artificial Intelligence and Statistics, JMLR W&CP 5, Clearwater Beach, FL, pp 567--574. [Google Scholar Search]

M. Álvarez and N. D. Lawrence. (2009) "Sparse convolved Gaussian processes for multi-output regression" in D. Koller, D. Schuurmans, Y. Bengio and L. Bottou (eds) Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, pp 57--64. [Software][PDF][Google Scholar Search]

Abstract

We present a sparse approximation approach for dependent output Gaussian processes (GP). Employing a latent function framework, we apply the convolution process formalism to establish dependencies between output variables, where each latent function is represented as a GP. Based on these latent functions, we establish an approximation scheme using a conditional independence assumption between the output processes, leading to an approximation of the full covariance which is determined by the locations at which the latent functions are evaluated. We show results of the proposed methodology for synthetic data and real world applications on pollution prediction and a sensor network.

M. K. Titsias, N. D. Lawrence and M. Rattray. (2009) "Efficient sampling for Gaussian process inference using control variables" in D. Koller, D. Schuurmans, Y. Bengio and L. Bottou (eds) Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, pp 1681--1688. [PDF][Google Scholar Search]

Abstract

Sampling functions in Gaussian process (GP) models is challenging because of the highly correlated posterior distribution. We describe an efficient Markov chain Monte Carlo algorithm for sampling from the posterior process of the GP model. This algorithm uses control variables which are auxiliary function values that provide a low dimensional representation of the function. At each iteration, the algorithm proposes new values for the control variables and generates the function from the conditional GP prior. The control variable input locations are found by continuously minimizing an objective function. We demonstrate the algorithm on regression and classification problems and we use it to estimate the parameters of a differential equation model of gene regulation.

N. D. Lawrence, G. Sanguinetti and M. Rattray. (2007) "Modelling transcriptional regulation using Gaussian processes" in B. Schölkopf, J. C. Platt and T. Hofmann (eds) Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, pp 785--792. [Errata][Software][Gzipped Postscript][PDF][Google Scholar Search]

Abstract

Modelling the dynamics of transcriptional processes in the cell requires the knowledge of a number of key biological quantities. While some of them are relatively easy to measure, such as mRNA decay rates and mRNA abundance levels, it is still very hard to measure the active concentration levels of the transcription factor proteins that drive the process and the sensitivity of target genes to these concentrations. In this paper we show how these quantities for a given transcription factor can be inferred from gene expression levels of a set of known target genes. We treat the protein concentration as a latent function with a Gaussian Process prior, and include the sensitivities, mRNA decay rates and baseline expression levels as hyperparameters. We apply this procedure to a human leukemia dataset, focusing on the tumour repressor p53 and obtaining results in good accordance with recent biological studies.

Related References

G. Della Gatta, M. Bansal, A. A.i.Impiombato, D. Antonini, C. Missero and D. d. Bernardo. (2008) "Direct targets of the trp63 transcription factor revealed by a combination of gene expression profiling and reverse engineering." in Genome Research 18 (6), pp 939--948 [DOI][Google Scholar Search]

Abstract

Genome-wide identification of bona-fide targets of transcription factors in mammalian cells is still a challenge. We present a novel integrated computational and experimental approach to identify direct targets of a transcription factor. This consists of measuring time-course (dynamic) gene expression profiles upon perturbation of the transcription factor under study, and in applying a novel "reverse-engineering" algorithm (TSNI) to rank genes according to their probability of being direct targets. Using primary keratinocytes as a model system, we identified novel transcriptional target genes of TRP63, a crucial regulator of skin development. TSNI-predicted TRP63 target genes were validated by Trp63 knockdown and by ChIP-chip to identify TRP63-bound regions in vivo. Our study revealed that short sampling times, in the order of minutes, are needed to capture the dynamics of gene expression in mammalian cells. We show that TRP63 transiently regulates a subset of its direct targets, thus highlighting the importance of considering temporal dynamics when identifying transcriptional targets. Using this approach, we uncovered a previously unsuspected transient regulation of the AP-1 complex by TRP63 through direct regulation of a subset of AP-1 components. The integrated experimental and computational approach described here is readily applicable to other transcription factors in mammalian systems and is complementary to genome-wide identification of transcription-factor binding sites.

U. Alon (2006) "An introduction to systems biology: design principles of biological circuits", Chapman and Hall/CRC, London.

G. Sanguinetti, M. Rattray and N. D. Lawrence. (2006) "A probabilistic dynamical model for quantitative inference of the regulatory mechanism of transcription" in Bioinformatics 22 (14), pp 1753--1759 [Software][PDF][DOI][Google Scholar Search]

Abstract

Motivation: Quantitative estimation of the regulatory relationship between transcription factors and genes is a fundamental stepping stone when trying to develop models of cellular processes. This task, however, is difficult for a number of reasons: transcription factors' expression levels are often low and noisy, and many transcription factors are post-transcriptionally regulated. It is therefore useful to infer the activity of the transcription factors from the expression levels of their target genes.

Results: We introduce a novel probabilistic model to infer transcription factor activities from microarray data when the structure of the regulatory network is known. The model is based on regression, retaining the computational efficiency to allow genome-wide investigation, but is rendered more flexible by sampling regression coefficients independently for each gene. This allows us to determine the strength with which a transcription factor regulates each of its target genes, therefore providing a quantitative description of the transcriptional regulatory network. The probabilistic nature of the model also means that we can associate credibility intervals to our estimates of the activities. We demonstrate our model on two yeast data sets. In both cases the network structure was obtained using Chromatine Immunoprecipitation data. We show how predictions from our model are consistent with the underlying biology and offer novel quantitative insights into the regulatory structure of the yeast cell.

Availability: MATLAB code is available from http://umber.sbs.man.ac.uk/resources/puma.

C. Sabatti and G. M. James. (2006) "Bayesian sparse hidden components analysis for transcription regulation networks" in Bioinformatics 22 (6), pp 739--746 [Software][Google Scholar Search]

Abstract

Motivation: In systems like Escherichia Coli, the abundance of sequence information, gene expression array studies and small scale experiments allows one to reconstruct the regulatory network and to quantify the effects of transcription factors on gene expression. However, this goal can only be achieved if all information sources are used in concert.

Results: Our method integrates literature information, DNA sequences and expression arrays. A set of relevant transcription factors is defined on the basis of literature. Sequence data are used to identify potential target genes and the results are used to define a prior distribution on the topology of the regulatory network. A Bayesian hidden component model for the expression array data allows us to identify which of the potential binding sites are actually used by the regulatory proteins in the studied cell conditions, the strength of their control, and their activation profile in a series of experiments. We apply our methodology to 35 expression studies in E.Coli with convincing results.

Availability: www.genetics.ucla.edu/labs/sabatti/software.html

Supplementary information: The supplementary material are available at Bioinformatics online.

Contact: csabatti@mednet.ucla.edu

G. Sanguinetti, N. D. Lawrence and M. Rattray. (2006) "Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities" in Bioinformatics 22 (22), pp 2275--2281 [Errata][Software][PDF][DOI][Google Scholar Search]

Abstract

Motivation: Quantitative estimation of the regulatory relationship between transcription factors and genes is a fundamental stepping stone when trying to develop models of cellular processes. Recent experimental high-throughput techniques such as Chromatine Immunoprecipitation provide important information about the architecture of the regulatory networks in the cell. However, it is very difficult to measure the concentration levels of transcription factor proteins and determine their regulatory effect on gene transcription. It is therefore an important computational challenge to infer these quantities using gene expression data and network architecture data.

Results: We develop a probabilistic state space model that allows genome-wide inference of both transcription factor protein concentrations and their effect on the transcription rates of each target gene from microarray data. We use variational inference techniques to learn the model parameters and perform posterior inference of protein concentrations and regulatory strengths. The probabilistic nature of the model also means that we can associate credibility intervals to our estimates, as well as providing a tool to detect which binding events lead to significant regulation. We demonstrate our model on artificial data and on two yeast data sets in which the network structure has previously been obtained using Chromatine Immunoprecipitation data. Predictions from our model are consistent with the underlying biology and offer novel quantitative insights into the regulatory structure of the yeast cell.

Availability: MATLAB code is available from http://umber.sbs.man.ac.uk/resources/puma.

I. Nachman, A. Regev and N. Friedman. (2004) "Inferring quantitative models of regulatory networks from expression data" in Bioinformatics 20 (Suppl. 1), pp 248--256 [Google Scholar Search]

Abstract

Motivation: Genetic networks regulate key processes in living cells. Various methods have been suggested to reconstruct network architecture from gene expression data. However, most approaches are based on qualitative models that provide only rough approximations of the underlying events, and lack the quantitative aspects that are critical for understanding the proper function of biomolecular systems.

Results: We present fine-grained dynamical models of gene transcription and develop methods for reconstructing them from gene expression data within the framework of a generative probabilistic model. Unlike previous works, we employ quantitative transcription rates, and simultaneously estimate both the kinetic parameters that govern these rates, and the activity levels of unobserved regulators that control them. We apply our approach to expression data sets from yeast and show that we can learn the unknown regulator activity profiles, as well as the binding affinity parameters.We also introduce a novel structure learning algorithm, and demonstrate its power to accurately reconstruct the regulatory network from those data sets.

Keywords: transcription regulation, parameter learning, structure learning, regulatory networks

Contact: nir@cs.huji.ac.il