Tuesday, December 22, 2015

Happy 10th birthday Bio2RDF and welcome to its 500th citation !

That is it, 10 years of being http://bio2rdf.org linked data service returning RDF from dereferenceable URIs according to the Linked data principles.

It all started in the November 23 2005

and here we are today with the 500th citations.

Friday, May 08, 2015


KaBOB: ontology-based semantic integration of biomedical databaseshttp://www.biomedcentral.com/1471-2105/16/126/abstract

KaBOB recent paper describes how a mashup have been created using 14 ontologies and 18 data sources converted to RDF, all loaded into a triplestore which is not made public. Great work, a mashup well designed based on ontologies and data normalization a quality standard never really put into Bio2RDF's triplestores. Nice work but not available to the bioinformatician community and it is a lot of work to rebuild it from scratch.

The first step of my hackhaton project is to rebuil such a mashup from the dame data collection and expose it on the web as linked data, I will use the kabob.bio2rdf.org namespace for it.

In the past I would have created a triplestore for it, Virtuoso can easily handle 500 millions triples beast. I will try differently and will use Elasticsearch instead and Kibana as a user interface available at http://melina.bio2rdf.org.

KaBOB currently imports the following 14 ontologies:

1. Basic Formal Ontology (BFO) [9]
2. BRENDA Tissue / Enzyme Source (BTO) [10]
3. Chemical Entities of Biological Interest (ChEBI) [11] (54,838 from ONTOBEE)
4. Cell Type Ontology (CL) [12]
5. Gene Ontology including biological process, molecular function, and cellular component
(GO) [7] (42,807 from ONTOBEE)
6. Information Artifact Ontology (IAO) [6]
7. Protein-Protein Interaction Ontology (MI) [13]
8. Mammalian Phenotype Ontology (MP) [14]
9. NCBI Taxonomy [15]
10. Ontology for Biomedical Investigation (OBI) [16]
11. Protein Modification (MOD) [17]
12. Protein Ontology (PR) [18]
13. Relation Ontology (RO) [19]
14. Sequence Ontology (SO) [8]

KaBOB currently imports data from the following 18 data sources:

1. Database of Interacting Proteins (DIP) [20]
2. DrugBank [21] (19,844 from Bio2RDF)
3. Genetic Association Database (GAD) [22] ()
4. UniProt Gene Ontology Annotation (GOA) [23]
5. HUGO Gene Nomenclature Committee (HGNC) [24] (43,407 from Bio2RDF)
6. HomoloGene [25] (18,712 from Bio2RDF)
7. Human Protein Reference Database (HPRD) [26]
8. InterPro [27]  (25,272 from Bio2RDF)
9. iRefWeb [28]
10. Mouse Genome Informatics (MGI) [29] ()
11. miRBase [30]
12. NCBI Gene [31] (47,728 from Bio2RDF)
13. Online Mendelian Inheritance in Man (OMIM) [32] (14,609 from Bio2RDF)
14. PharmGKB [33] ()
15. Reactome [34] ()
16. Rat Genome Database (RGD) [35]
17. Transfac [36]
18. UniProt [37] (124,567)

In red is the number of document/graph loaded in ES.

Data source :

OBO : http://www.ontobee.org/sparql

Uniprot : http://beta.sparql.uniprot.org/sparql

and Bio2RDF corresponding SPARQL endpoints.

Bio2RDF 10th birthday this year, and I am back on the biohacking road

This weekend is the first biohackathon about BD2K in San Diego:


It is a good occasion to explore new avenue to expose RDF biological knowledge in the big data era. So let's try Elasticsearch... (https://www.elastic.co/products/elasticsearch)

it is free, fast and it scale. This would not be doable without the recent availability of the RDF version format in JSON, the JSON-LD project (http://json-ld.org/).

I will use the JSON-LD converter written by Peter Ansell, one of the major contributor to Bio2RDF, (https://github.com/jsonld-java).

So let's try to load some of Bio2RDF triples into ElasticSearch ! I have 24 hours to explore this new approach.

Here is what we will try to achieve :

  1. RDF2ES : Bring KaBOB online as RDF REST services using ElasticSearch

    1. Description.  KaBOB is a semantic integration of 18 different biomedically relevant knowledge sources.  The linked paper describes processes for instantiating it as RDF, but does not provide a functional implementation.  This is likely because of the significant challenges involved in stably hosting a very large SPARQL endpoint.  Perhaps SPARQL isn’t the best way to share this content.  This project is to figure out a way to the useful data integration work done in kaBOB available via a set of web services that are both fast and reliable.  Willing to sacrifice some of the flexibility of a full sparql endpoint to gain a functional app.  Perhaps using Elastic Search.
      1. First we will load part of Kabob data source for human into an ElasticSearch cluster. (OMIM, GO, CHEBI, Drugbank, OBO ontologies, Reactome, Uniprot and entrez gene)
      2. Second we will build REST services to access it, there will be available for hacking.
      3. Third we will explore this data using Kibana tool.
      4. Finally, we will illustrate how a Talend workflow consuming RDF data can replace a complex SPARQL query. The querying workflow will be exposed at MyExperiments.
    2. input.  Instructions for integrating 18 different biological data sources + code at: https://github.com/UCDenver-ccp/datasource https://github.com/drlivingston/kr https://github.com/drlivingston/kabob I will use bio2rdf version of kabob selected dataset.

      If someone has access to Kabob RDF data, we could load it into ES triplestore.
output. web services that provide useful answers to questions about genes, biological process, and diseases, Those REST services will be created the way Bio2RDF API have been done, they are generated using Talend ESB tool (http://bio2rdf.org/test) and virtuoso triplestore will be replaced by ES storage.

We will try to create a type ahead user experience over those dataset, a feature that Bio2RDF have always been missing. (bio2rdf.org)

Finally, we will explore the data visualisation potential of the Kabina tool over ElasticSearch data in JSON-LD format.

Sunday, September 11, 2011

Bio2RDF: moving forward as a community

 Last week we held our first virtual meeting towards re-invigorating the Bio2RDF project with a significantly larger and vested community. From discussions, we plan to establish 3 focus groups around :

A. policy (information, governance, sustainability, outreach)
B. technical (architecture, infrastructure and RDFization)
C. social (user experience and social networking)

The next step then is for groups to:
1. identify and certify discussion leads (responsibilities: set meeting times and agenda, facilitate and encourage discussion among members, draft reports)
2. identify additional people to recruit from the wider community that would provide additional expertise (interested, but didn't attend the first? sign up now !)
3. extend and prioritize discussion items (what exactly will this group focus its efforts on in the short and long term)
4. identify and assign bite-sized tasks (so we can get things done one step at a time :)
5. collate results and present to the wider community

I suggest that groups self-organize a first meeting in the next two weeks to deal with items 1-4, and either meet again or use the Google documents to collaboratively report findings.

Finally, I'd like for us to hold another meeting with times that are much more accommodating for Europe + North America ;)  Please fill the doodle poll (http://www.doodle.com/fsuz6mgs5cztf2e2)
As always, feel free to contact me if you have any questions, and please sign up to the Bio2RDF mailing list for all future discussions.

Wednesday, October 20, 2010

Tuesday, October 05, 2010

Bio2RDF return to Japan

Bio2RDF is returning in Japan again this year. We will give a talk about Bio2RDF at Biocuration 2010 . Biocuration is from October 11th to October 14th at Odaiba, Tokyo.