Projects




Ondex SABR project


From data to knowledge - the Ondex System for integrating Life Sciences data sources

http://www.scivee.tv/node/12026 Watch a short video of Prof Chris Rawlings explaining the project at ISMB 2009

The Ondex SABR project (BB/F006039/1) was funded by the Biotechnology and Biological Sciences Research Council (BBSRC) for 3 years starting on April 1st 2008 under the SABR Systems Biology initiative to create an e-tool project for supporting systems biology research. It was a collaboration between Rothamsted Research, The University of Manchester and Newcastle University. See information on our collaborators. This project addressed the problem that a prerequisite to a systems approach to biological research (SABR) is the integration and analysis of heterogeneous experimental data, which are stored in hundreds of life-science databases and millions of scientific publications.

Validation of the project software will be achieved by providing direct support to three research challenges from systems biology:
  • Identifying new genetic and molecular targets to improve bioenergy crops;
  • Integration augmentation and validation of yeast metabolome models; and
  • Supporting research into the role of telomere function in ageing.

In addition, other biological projects will be supported through existing collaborations and by developing new ones through outreach activities, including:

  • Integrating data to support circadian clock modelling (CSBE, Edinburgh); and
  • A computational pipeline to provide annotations for the Vibrio salmonicida genome and to support bioprospecting of environmental genomics projects to identify novel cold adapted proteins (University of Tromso, Norway);
  • Other systems biology applications identified during the project

All these biological application cases have a common requirement for integration of a wide variety of datasets and this project has been established to demonstrate that they can be supported using the proposed developments to the Ondex system. These developments build on and combines four major established components from leading experts:

  • Database integration

    A prototype Ondex software framework enables data from diverse biological data sets to be linked, integrated and visualised through graph analysis techniques. Ondex uses a semantically rich core data structure, has explicit support for workflow and has the ability to bring together information from structured databases and unstructured sources such as biological sequence data and free text.The prototype has already been successfully applied to a range of biological problems and forms the core of the project.

  • Workflows

    A prerequisite to data integration and analysis is the collection and pre-processing of relevant data sources. Ondex will use Taverna to combine data collection and pre-processing steps, with the data integration, text mining and analysis steps into workflows which store the data into the same semantically rich graph based data structure (http://taverna.sourceforge.net/). In addition, mapping all data into a common core structure improves searchability. Taverna is a popular workflow workbench widely used in the bioinformatics community (1,500 downloads per month) from the myGrid project (http://www.mygrid.org.uk) and now absorbed into the UK’s OMII-UK.

  • Graph based analysis

    Graph based analysis and visualisation methods can be used to extract meaning and new hypotheses from the integrated data. Biological data such as metabolic pathways, protein interactions etc are best seen as a network or graph. Ondex is based upon such a structure and this is one of the reasons it is well-adapted to the integrated analysis of biological systems.

  • Text mining

    Text mining offers the possibility of extracting semantic entities and facts from literature and of finding interesting associations among disparate facts, leading to the discovery of new or unsuspected knowledge, exploiting the NaCTeM toolkit (http://www.nactem.ac.uk). Text mining components are selected to create workflows through NaCTeM's U-Compare platform (http://u-compare.org) which links with Taverna.


Ondex data integration platform

The Ondex system stores data as a graph of Concepts and relations. Concepts represent data entities and relations link these entities together. Additional semantic annotation is added using concept classes, relation types, evidences and controlled vocabularies. Data is imported by data source specific parsers. Mapping methods create new Relations between Concepts. Local and global consistency checks are performed. Data integration can be configured and executed using web services via Taverna (http://www.mygrid.org.uk/tools/taverna/). The Ondex system is open source and written in Java.

Aims

A wide range of biological applications can be addressed by setting up problem specific data integration and analysis workflows. Even though a prototype, the current system has been used:

  • for microarray data analysis;
  • to support the curation of scientific databases;
  • for scoring the quality of terms and definitions in ontologies such as the Gene Ontology;
  • for extracting cell-cell communication networks from scientific literature;
  • for the annotation of the barley micro array (with the IPK Gatersleben, Germany); and
  • for the annotation of the Vibrio salmonicida genome (with the Protein group in Tromsø, Norway).

Both ONDEX and Taverna are Open Source and are freely available to academic and commercial researchers. NaCTeM’s text mining services are freely available to the UK academic community.

The aim of the project is to build on the success of the ONDEX prototype, and to create a robust, fully featured, extensible, easy to use and professionally-supported e-tool that will underpin systems biology projects in the UK.

We propose to achieve this by:

  • Extending core data structures, interfaces and data integration framework to support probabilistic relationships and thereby enable the use of statistical data analysis methods.
  • Upgrading and adapting the existing text mining tools with state-of-the-art techniques using semantic deep parsing to extract more complex relationships and richer semantics from bio-text sources.
  • Upgrading the workflow management to incorporate new developments from myGrid and support long-running and asynchronous workflows needed for compute and data intensive analyses.
  • Developing the user interfaces and other components to support comparative analyses (e.g. for comparing pathways, gene orders, data graphs from different species etc).
  • Exploiting the new data structures with statistical data analysis methods and associated visualisation methods.
  • Improving the software engineering and usability to make ONDEX more ready for use by non-experts.

These technological developments will make it possible to address a wide range of new biological problems. The Ondex system will provide data integration support not only to the BBSRC Systems Biology Centres (see Biological applications) but also to a range of other systems biology projects that will be supported through the outreach activities.





Mining Candidate Gene Networks From Genetic Studies of Crops and Animals

The QTLNetMiner project is a spinoff from the Ondex SABR project which re-uses components of the Ondex data integration framework and data visualisation tools to create a specialised resource for researchers working on complex traits plants and animals. QTLNetMiner has been designed as a web-based resource that supports the identification and prioritization of candidate functional genes using evidence from:

  • Quantitative genetic experiments, in particular QTL studies
  • The scientific literature (using text mining of association between trait terms and gene names;
  • Lists of candidate genes provided by the user; typically coming from related transcriptomic studies
  • An integrated knowledge base of gene function information including annotations from gene ontologies and biochemical pathway resources in the species of interest and relevant model organism genomes

The achievements of the project include the refinement of the Ondex graph visualisation tool as a Java applet known as OndexWeb, the development of indexing methods for the Ondex knowledge graph so that it can deliver results interactively to the client applications and the creation of a method to rank genes based on the network of evidence that supports them as being functionally related to the trait terms used in the query.

A QTLNetMiner user is presented with a simple to use query and visualisation interface supporting several different views which present the sets of candidate genes so that the source and quality of the evidence that relates a trait term or set of terms by traversal of the knowledge graph can be explored. These include:

  • a query interface for trait terms and QTL intervals with added trait query builder based on ontologies
  • a ranked table of genes with associated sets of evidence type together with the strength of the evidence.
  • a map of the linked genes located on visual representations of the chromosomes
  • an OndexWeb graph visualisation with full interactive access to the evidence found to support the candidate gene and the relationships that link the evidence
The current QTLNetMiner web site has knowledgebases built for Arabidopsis, Poplar, Potato and combined Solanaceous crops with prototypes for Barley as well as for the livestock species Chicken, Cow and Pig developed in the original BBSRC-funded project.

Acknowledgements

The Ondex knowledge graph traversal and semantic motif query methods used by QTLNetMiner were developed by Matthew Hindle as part of this PhD research at Rothamsted.

The QTLNetMiner project (BB/I023860/1) was originally funded by the (BBSRC) as a TRDF project for 12 months from Jan 2011-2012. It was a collaboration between Rothamsted Research and the Roslin Institute to demonstrate that the methods and software framework were sufficiently general to be used in crop plant and livestock animal studies. Since then, collaborations with the Feingold laboratory in INTA (Argentina) and with Uwe Sholtz from IPK in Gatersleben have enabled us to further develop the software and implement knowledge bases for Tomato, Solanacae and Barley species respectively.





OndexWeb


Ondex Networks for your Website

OndexWeb is a new web-based implementation of the network visualization and exploration toolse from the Ondex data integration platform. New features such as context-sensitive menus and annotation tools provide users with intuitive ways to explore and manipulate the appearance of heterogeneous biological networks. Ondex Web is open source, written in Java and can be easily embedded into web sites as an applet. Ondex Web supports loading data from a variety of network formats, such as XGMML, NWB, Pajek and OXL.

OndexWeb is a key user interface component used in the QTLNetMiner system.

Find out more

and try it at the OndexWeb website http://ondex.rothamsted.ac.uk/OndexWeb.

Read our paper on what OndexWeb can do for you

Jan Taubert, Keywan Hassani-Pak, Minja Zorc, Christopher Rawlings (2013) Ondex Web: interactive web-based visualization and exploration of biological networks visualization and analysis Bioinformatics (2013) doi: 10.1093/bioinformatics/btt740 >

Acknowledgements

OndexWeb was originally developed within the Ondex SABR project but substantially extended to meet the needs of the QTLNetMiner project





Ondex Chemogenomics


Accelerating Discovery by Mining and Visualising Integrated Chemogenomics Data

The motivation behind this project was to enable (in this case) plant scientists and discovery chemists to collaborate more effectively, exploring a shared knowledge base of molecular and chemical data relating to agrochemistry and biological processes. In many industrial lifescience organisations with a pipeline taking bioactive compounds to the market place, there is a problem of delivering research information and knowledge discovery tools that meet the requirements of both biologists and chemists. Both groups of scientists have a shared understanding of biochemical pathways and molecular interaction networks, providing an ideal basis for the development of data visualisation and data analysis methods to reveal new information from integrated genetic, biochemical and chemical data sets.

These requirements are generic to companies working on bioactive compounds, notably the agrichemistry, pharmaceutical and associated biotechnology SMEs providing goods and services to these companies.

The major developments to Ondex, now available in the December 2013 release, support the the analysis and visualisation of small molecule chemistry data. New interfaces have been developed to public bioactivity resources (e.g.such as ChEMBL, providing a link to target proteins. The methods for analysis and visualisations for small chemical compounds and their functional properties use the existing open source java based Chemistry Development Kit (CDK). An interface to the European protein structure databank resource PDBE enables Ondex users to visualise the 3D structure of the proteins held in the knowledge base. New parsers now support the import and integration of private data sets in standard chemistry data formats (e.g. SD files, SMILES and InChi).

A summary of these new features are below:

  • Integration of functionality from the EBI Chemistry Development Kit (CDK) into the Ondex frontend and backend, including chemical structure drawing via JChemPaint and protein structure rendering via JMol as part of the attributes on concepts
  • Mouse-over and in-situ drawing of chemical structure for compounds and as thumbnails for nodes representing molecular compounds. Same features also available in OndexWeb. Integration features for ChEMBL substances and bioactivity data for Ondex backend and front end using ChEMBL RESTful webservices; also available for on-the-fly loading using the context-sensitive right-click menu
  • Integration of the complete ChEBI database (3 star and complete) and its ontology
  • New mappings/relations between compounds can be generated based on chemical similarity (Tanimoto distance)
  • Extended search modes available using webservices:
    • Retrieve UniProt proteins directly by searching for their identifier
    • Retrieve ChEMBL compounds directly by search for their identifier
    • Search for similar compounds with a Smiles or InChi query in an integrated network based on a customizable Tanimoto distance cut-off
  • New parsers for chemical structure data including data in SDF, Smiles, InChi format
  • New parser for Expasy enzyme data file, ChEBI and ChemBL data
  • Extensive updates to UniProt parser
  • Integration of JalView for the generation and analysis of multiple sequence alignments of proteins (right-click context sensitive menu)

Example screenshots from the new features can be found here .

Acknowledgements

This project was funded by Technology Strategy Board - TSB (TP Number 5082-33372) and BBSRC (TS/I003707/1) from November 2012 until March 2013. The lead partner was Syngenta and we gratefully acknowledge the contributions from Mark Forster and Bob Vaughan.