Ondex SABR project

From data to knowledge – the Ondex System for integrating Life Sciences data sources

http://www.scivee.tv/node/12026 Watch a short video of Prof Chris Rawlings explaining the project at ISMB 2009

This project addresses the problem that a prerequisite to a systems approach to biological research (SABR) is the integration and analysis of heterogeneous experimental data, which are stored in hundreds of life-science databases and millions of scientific publications. The Ondex SABR project is funded by the Biotechnology and Biological Sciences Research Council (BBSRC) for 3 years starting on April 1st 2008 under the SABR Systems Biology initiative to create an e-tool project for supporting systems biology research. Validation of the project software will be achieved by providing direct support to three research challenges from systems biology (see Biological applications for more details):

  • Identifying new genetic and molecular targets to improve bioenergy crops;
  • Integration augmentation and validation of yeast metabolome models; and
  • Supporting research into the role of telomere function in ageing.

In addition, other biological projects will be supported through existing collaborations and by developing new ones through outreach activities, including:

  • The development of quantitative models of root development (CPIB Nottingham);
  • Integrating data to support circadian clock modelling (CSBE, Edinburgh); and
  • A computational pipeline to provide annotations for the Vibrio salmonicida genome and to support bioprospecting of environmental genomics projects to identify novel cold adapted proteins (University of Tromso, Norway);
  • Other systems biology applications identified during the project

All these biological application cases have a common requirement for integration of a wide variety of datasets and this project has been established to demonstrate that they can be supported using the proposed developments to the Ondex system. These developments build on and combines four major established components from leading experts:

  • Database integration

    A prototype Ondex software framework enables data from diverse biological data sets to be linked, integrated and visualised through graph analysis techniques (http://ondex.sf.net/). Ondex uses a semantically rich core data structure, has explicit support for workflow and has the ability to bring together information from structured databases and unstructured sources such as biological sequence data and free text.The prototype has already been successfully applied to a range of biological problems and forms the core of the project.

  • Workflows

    A prerequisite to data integration and analysis is the collection and pre-processing of relevant data sources. Ondex will use Taverna to combine data collection and pre-processing steps, with the data integration, text mining and analysis steps into workflows which store the data into the same semantically rich graph based data structure (http://taverna.sourceforge.net/). In addition, mapping all data into a common core structure improves searchability. Taverna is a popular workflow workbench widely used in the bioinformatics community (1,500 downloads per month) from the myGrid project (http://www.mygrid.org.uk) and now absorbed into the UK’s OMII-UK.

  • Graph based analysis

    Graph based analysis and visualisation methods can be used to extract meaning and new hypotheses from the integrated data. Biological data such as metabolic pathways, protein interactions etc are best seen as a network or graph. Ondex is based upon such a structure and this is one of the reasons it is well-adapted to the integrated analysis of biological systems.

  • Text mining

    Text mining offers the possibility of extracting precise facts from literature and of finding interesting associations among disparate facts, leading to the discovery of new or unsuspected knowledge, exploiting the NaCTeM toolkit (http://www.nactem.ac.uk/).


Ondex data integration platform

The Ondex system stores data as a graph of Concepts and relations. Concepts represent data entities and relations link these entities together. Additional semantic annotation is added using concept classes, relation types, evidences and controlled vocabularies. Data is imported by data source specific parsers. Mapping methods create new Relations between Concepts. Local and global consistency checks are performed. Data integration can be configured and executed using web services via Taverna (http://www.mygrid.org.uk/tools/taverna/). The Ondex system is open source and written in Java.

Aims

A wide range of biological applications can be addressed by setting up problem specific data integration and analysis workflows. Even though a prototype, the current system has been used:

  • for microarray data analysis;
  • to support the curation of scientific databases;
  • for scoring the quality of terms and definitions in ontologies such as the Gene Ontology;
  • for extracting cell-cell communication networks from scientific literature;
  • for the annotation of the barley micro array (with the IPK Gatersleben, Germany); and
  • for the annotation of the Vibrio salmonicida genome (with the Protein group in Tromsø, Norway).

Both ONDEX and Taverna are Open Source and are freely available to academic and commercial researchers. NaCTeM’s text mining services are freely available to the UK academic community.

The aim of the project is to build on the success of the ONDEX prototype, and to create a robust, fully featured, extensible, easy to use and professionally-supported e-tool that will underpin systems biology projects in the UK.

We propose to achieve this by:

  • Extending core data structures, interfaces and data integration framework to support probabilistic relationships and thereby enable the use of statistical data analysis methods.
  • Upgrading and adapting the existing text mining tools with state-of-the-art techniques using semantic deep parsing to extract more complex relationships and richer semantics from bio-text sources.
  • Upgrading the workflow management to incorporate new developments from myGrid and support long-running and asynchronous workflows needed for compute and data intensive analyses.
  • Developing the user interfaces and other components to support comparative analyses (e.g. for comparing pathways, gene orders, data graphs from different species etc).
  • Exploiting the new data structures with statistical data analysis methods and associated visualisation methods.
  • Improving the software engineering and usability to make ONDEX more ready for use by non-experts.

These technological developments will make it possible to address a wide range of new biological problems. The Ondex system will provide data integration support not only to the BBSRC Systems Biology Centres (see Biological applications) but also to a range of other systems biology projects that will be supported through the outreach activities.