Friday, May 08, 2015
KaBOB VS Bio2RDF
KaBOB: ontology-based semantic integration of biomedical databaseshttp://www.biomedcentral.com/1471-2105/16/126/abstract
KaBOB recent paper describes how a mashup have been created using 14 ontologies and 18 data sources converted to RDF, all loaded into a triplestore which is not made public. Great work, a mashup well designed based on ontologies and data normalization a quality standard never really put into Bio2RDF's triplestores. Nice work but not available to the bioinformatician community and it is a lot of work to rebuild it from scratch.
The first step of my hackhaton project is to rebuil such a mashup from the dame data collection and expose it on the web as linked data, I will use the kabob.bio2rdf.org namespace for it.
In the past I would have created a triplestore for it, Virtuoso can easily handle 500 millions triples beast. I will try differently and will use Elasticsearch instead and Kibana as a user interface available at http://melina.bio2rdf.org.
KaBOB currently imports the following 14 ontologies:
1. Basic Formal Ontology (BFO) [9]
2. BRENDA Tissue / Enzyme Source (BTO) [10]
3. Chemical Entities of Biological Interest (ChEBI) [11] (54,838 from ONTOBEE)
4. Cell Type Ontology (CL) [12]
5. Gene Ontology including biological process, molecular function, and cellular component
(GO) [7] (42,807 from ONTOBEE)
6. Information Artifact Ontology (IAO) [6]
7. Protein-Protein Interaction Ontology (MI) [13]
8. Mammalian Phenotype Ontology (MP) [14]
9. NCBI Taxonomy [15]
10. Ontology for Biomedical Investigation (OBI) [16]
11. Protein Modification (MOD) [17]
12. Protein Ontology (PR) [18]
13. Relation Ontology (RO) [19]
14. Sequence Ontology (SO) [8]
KaBOB currently imports data from the following 18 data sources:
1. Database of Interacting Proteins (DIP) [20]
2. DrugBank [21] (19,844 from Bio2RDF)
3. Genetic Association Database (GAD) [22] ()
4. UniProt Gene Ontology Annotation (GOA) [23]
5. HUGO Gene Nomenclature Committee (HGNC) [24] (43,407 from Bio2RDF)
6. HomoloGene [25] (18,712 from Bio2RDF)
7. Human Protein Reference Database (HPRD) [26]
8. InterPro [27] (25,272 from Bio2RDF)
9. iRefWeb [28]
10. Mouse Genome Informatics (MGI) [29] ()
11. miRBase [30]
12. NCBI Gene [31] (47,728 from Bio2RDF)
13. Online Mendelian Inheritance in Man (OMIM) [32] (14,609 from Bio2RDF)
14. PharmGKB [33] ()
15. Reactome [34] ()
16. Rat Genome Database (RGD) [35]
17. Transfac [36]
18. UniProt [37] (124,567)
In red is the number of document/graph loaded in ES.
Data source :
OBO : http://www.ontobee.org/sparql
Uniprot : http://beta.sparql.uniprot.org/sparql
and Bio2RDF corresponding SPARQL endpoints.
Bio2RDF 10th birthday this year, and I am back on the biohacking road
This weekend is the first biohackathon about BD2K in San Diego:
https://github.com/Network-of-BioThings/nob-hq/wiki/1st-BD2K-3rd-Network-of-BioThings-Hackathon
It is a good occasion to explore new avenue to expose RDF biological knowledge in the big data era. So let's try Elasticsearch... (https://www.elastic.co/products/elasticsearch)
it is free, fast and it scale. This would not be doable without the recent availability of the RDF version format in JSON, the JSON-LD project (http://json-ld.org/).
I will use the JSON-LD converter written by Peter Ansell, one of the major contributor to Bio2RDF, (https://github.com/jsonld-java).
So let's try to load some of Bio2RDF triples into ElasticSearch ! I have 24 hours to explore this new approach.
Here is what we will try to achieve :
We will try to create a type ahead user experience over those dataset, a feature that Bio2RDF have always been missing. (bio2rdf.org)
Finally, we will explore the data visualisation potential of the Kabina tool over ElasticSearch data in JSON-LD format.
https://github.com/Network-of-BioThings/nob-hq/wiki/1st-BD2K-3rd-Network-of-BioThings-Hackathon
It is a good occasion to explore new avenue to expose RDF biological knowledge in the big data era. So let's try Elasticsearch... (https://www.elastic.co/products/elasticsearch)
it is free, fast and it scale. This would not be doable without the recent availability of the RDF version format in JSON, the JSON-LD project (http://json-ld.org/).
I will use the JSON-LD converter written by Peter Ansell, one of the major contributor to Bio2RDF, (https://github.com/jsonld-java).
So let's try to load some of Bio2RDF triples into ElasticSearch ! I have 24 hours to explore this new approach.
Here is what we will try to achieve :
RDF2ES : Bring KaBOB online as RDF REST services using ElasticSearch
- Description. KaBOB is a semantic integration of 18 different biomedically relevant knowledge sources. The linked paper describes processes for instantiating it as RDF, but does not provide a functional implementation. This is likely because of the significant challenges involved in stably hosting a very large SPARQL endpoint. Perhaps SPARQL isn’t the best way to share this content. This project is to figure out a way to the useful data integration work done in kaBOB available via a set of web services that are both fast and reliable. Willing to sacrifice some of the flexibility of a full sparql endpoint to gain a functional app. Perhaps using Elastic Search.
- First we will load part of Kabob data source for human into an ElasticSearch cluster. (OMIM, GO, CHEBI, Drugbank, OBO ontologies, Reactome, Uniprot and entrez gene)
- Second we will build REST services to access it, there will be available for hacking.
- Third we will explore this data using Kibana tool.
- Finally, we will illustrate how a Talend workflow consuming RDF data can replace a complex SPARQL query. The querying workflow will be exposed at MyExperiments.
- input. Instructions for integrating 18 different biological data sources + code at: https://github.com/UCDenver-ccp/datasource https://github.com/drlivingston/kr https://github.com/drlivingston/kabob I will use bio2rdf version of kabob selected dataset.
If someone has access to Kabob RDF data, we could load it into ES triplestore.
We will try to create a type ahead user experience over those dataset, a feature that Bio2RDF have always been missing. (bio2rdf.org)
Finally, we will explore the data visualisation potential of the Kabina tool over ElasticSearch data in JSON-LD format.
Subscribe to:
Posts (Atom)