Friday, May 08, 2015

KaBOB VS Bio2RDF


KaBOB: ontology-based semantic integration of biomedical databaseshttp://www.biomedcentral.com/1471-2105/16/126/abstract

KaBOB recent paper describes how a mashup have been created using 14 ontologies and 18 data sources converted to RDF, all loaded into a triplestore which is not made public. Great work, a mashup well designed based on ontologies and data normalization a quality standard never really put into Bio2RDF's triplestores. Nice work but not available to the bioinformatician community and it is a lot of work to rebuild it from scratch.

The first step of my hackhaton project is to rebuil such a mashup from the dame data collection and expose it on the web as linked data, I will use the kabob.bio2rdf.org namespace for it.

In the past I would have created a triplestore for it, Virtuoso can easily handle 500 millions triples beast. I will try differently and will use Elasticsearch instead and Kibana as a user interface available at http://melina.bio2rdf.org.

KaBOB currently imports the following 14 ontologies:

1. Basic Formal Ontology (BFO) [9]
2. BRENDA Tissue / Enzyme Source (BTO) [10]
3. Chemical Entities of Biological Interest (ChEBI) [11] (54,838 from ONTOBEE)
4. Cell Type Ontology (CL) [12]
5. Gene Ontology including biological process, molecular function, and cellular component
(GO) [7] (42,807 from ONTOBEE)
6. Information Artifact Ontology (IAO) [6]
7. Protein-Protein Interaction Ontology (MI) [13]
8. Mammalian Phenotype Ontology (MP) [14]
9. NCBI Taxonomy [15]
10. Ontology for Biomedical Investigation (OBI) [16]
11. Protein Modification (MOD) [17]
12. Protein Ontology (PR) [18]
13. Relation Ontology (RO) [19]
14. Sequence Ontology (SO) [8]

KaBOB currently imports data from the following 18 data sources:

1. Database of Interacting Proteins (DIP) [20]
2. DrugBank [21] (19,844 from Bio2RDF)
3. Genetic Association Database (GAD) [22] ()
4. UniProt Gene Ontology Annotation (GOA) [23]
5. HUGO Gene Nomenclature Committee (HGNC) [24] (43,407 from Bio2RDF)
6. HomoloGene [25] (18,712 from Bio2RDF)
7. Human Protein Reference Database (HPRD) [26]
8. InterPro [27]  (25,272 from Bio2RDF)
9. iRefWeb [28]
10. Mouse Genome Informatics (MGI) [29] ()
11. miRBase [30]
12. NCBI Gene [31] (47,728 from Bio2RDF)
13. Online Mendelian Inheritance in Man (OMIM) [32] (14,609 from Bio2RDF)
14. PharmGKB [33] ()
15. Reactome [34] ()
16. Rat Genome Database (RGD) [35]
17. Transfac [36]
18. UniProt [37] (124,567)

In red is the number of document/graph loaded in ES.

Data source :

OBO : http://www.ontobee.org/sparql

Uniprot : http://beta.sparql.uniprot.org/sparql

and Bio2RDF corresponding SPARQL endpoints.

Bio2RDF 10th birthday this year, and I am back on the biohacking road

This weekend is the first biohackathon about BD2K in San Diego:

https://github.com/Network-of-BioThings/nob-hq/wiki/1st-BD2K-3rd-Network-of-BioThings-Hackathon

It is a good occasion to explore new avenue to expose RDF biological knowledge in the big data era. So let's try Elasticsearch... (https://www.elastic.co/products/elasticsearch)

it is free, fast and it scale. This would not be doable without the recent availability of the RDF version format in JSON, the JSON-LD project (http://json-ld.org/).

I will use the JSON-LD converter written by Peter Ansell, one of the major contributor to Bio2RDF, (https://github.com/jsonld-java).

So let's try to load some of Bio2RDF triples into ElasticSearch ! I have 24 hours to explore this new approach.

Here is what we will try to achieve :

  1. RDF2ES : Bring KaBOB online as RDF REST services using ElasticSearch

    1. Description.  KaBOB is a semantic integration of 18 different biomedically relevant knowledge sources.  The linked paper describes processes for instantiating it as RDF, but does not provide a functional implementation.  This is likely because of the significant challenges involved in stably hosting a very large SPARQL endpoint.  Perhaps SPARQL isn’t the best way to share this content.  This project is to figure out a way to the useful data integration work done in kaBOB available via a set of web services that are both fast and reliable.  Willing to sacrifice some of the flexibility of a full sparql endpoint to gain a functional app.  Perhaps using Elastic Search.
      1. First we will load part of Kabob data source for human into an ElasticSearch cluster. (OMIM, GO, CHEBI, Drugbank, OBO ontologies, Reactome, Uniprot and entrez gene)
      2. Second we will build REST services to access it, there will be available for hacking.
      3. Third we will explore this data using Kibana tool.
      4. Finally, we will illustrate how a Talend workflow consuming RDF data can replace a complex SPARQL query. The querying workflow will be exposed at MyExperiments.
    2. input.  Instructions for integrating 18 different biological data sources + code at: https://github.com/UCDenver-ccp/datasource https://github.com/drlivingston/kr https://github.com/drlivingston/kabob I will use bio2rdf version of kabob selected dataset.

      If someone has access to Kabob RDF data, we could load it into ES triplestore.
output. web services that provide useful answers to questions about genes, biological process, and diseases, Those REST services will be created the way Bio2RDF API have been done, they are generated using Talend ESB tool (http://bio2rdf.org/test) and virtuoso triplestore will be replaced by ES storage.

We will try to create a type ahead user experience over those dataset, a feature that Bio2RDF have always been missing. (bio2rdf.org)

Finally, we will explore the data visualisation potential of the Kabina tool over ElasticSearch data in JSON-LD format.