Friday, May 08, 2015

Bio2RDF 10th birthday this year, and I am back on the biohacking road

This weekend is the first biohackathon about BD2K in San Diego:

https://github.com/Network-of-BioThings/nob-hq/wiki/1st-BD2K-3rd-Network-of-BioThings-Hackathon

It is a good occasion to explore new avenue to expose RDF biological knowledge in the big data era. So let's try Elasticsearch... (https://www.elastic.co/products/elasticsearch)

it is free, fast and it scale. This would not be doable without the recent availability of the RDF version format in JSON, the JSON-LD project (http://json-ld.org/).

I will use the JSON-LD converter written by Peter Ansell, one of the major contributor to Bio2RDF, (https://github.com/jsonld-java).

So let's try to load some of Bio2RDF triples into ElasticSearch ! I have 24 hours to explore this new approach.

Here is what we will try to achieve :

  1. RDF2ES : Bring KaBOB online as RDF REST services using ElasticSearch

    1. Description.  KaBOB is a semantic integration of 18 different biomedically relevant knowledge sources.  The linked paper describes processes for instantiating it as RDF, but does not provide a functional implementation.  This is likely because of the significant challenges involved in stably hosting a very large SPARQL endpoint.  Perhaps SPARQL isn’t the best way to share this content.  This project is to figure out a way to the useful data integration work done in kaBOB available via a set of web services that are both fast and reliable.  Willing to sacrifice some of the flexibility of a full sparql endpoint to gain a functional app.  Perhaps using Elastic Search.
      1. First we will load part of Kabob data source for human into an ElasticSearch cluster. (OMIM, GO, CHEBI, Drugbank, OBO ontologies, Reactome, Uniprot and entrez gene)
      2. Second we will build REST services to access it, there will be available for hacking.
      3. Third we will explore this data using Kibana tool.
      4. Finally, we will illustrate how a Talend workflow consuming RDF data can replace a complex SPARQL query. The querying workflow will be exposed at MyExperiments.
    2. input.  Instructions for integrating 18 different biological data sources + code at: https://github.com/UCDenver-ccp/datasource https://github.com/drlivingston/kr https://github.com/drlivingston/kabob I will use bio2rdf version of kabob selected dataset.

      If someone has access to Kabob RDF data, we could load it into ES triplestore.
output. web services that provide useful answers to questions about genes, biological process, and diseases, Those REST services will be created the way Bio2RDF API have been done, they are generated using Talend ESB tool (http://bio2rdf.org/test) and virtuoso triplestore will be replaced by ES storage.

We will try to create a type ahead user experience over those dataset, a feature that Bio2RDF have always been missing. (bio2rdf.org)

Finally, we will explore the data visualisation potential of the Kabina tool over ElasticSearch data in JSON-LD format.

No comments: