Tuesday, April 21, 2009

2,4 billions triples of Bioinformatics RAW DATA NOW

In his recent talk at TED, Tim Berner Lee invited the data provider to make available data in RDF format to help the building process of linked data web. He asked them to offer RAW DATA NOW.



We totally share this approach in the Bio2RDF community, our goal is to make public datasets from the bioinformatics community available in RDF format via standard SPARQL endpoints (Virtuoso server is used for that). We strongly believe in the semantic web approach to solve science problem but we do not want to wait for data provider to do the RAW DATA conversion job. Converting data to RDF is not fun, we did a lot of this dirty job, and here are the results for actual Bio2RDF release of 34 data sources.

Our current datasets in N3 format are available here :

http://quebec.bio2rdf.org/download/n3/

We invite semantic search engine provider to index these files.

The way we produce them is documented in our Wiki at SourceForge in the Cookbook section :

http://bio2rdf.wiki.sourceforge.net/Namespace%27s+update

The actual list of SPARQL endpoints in the linked data cloud is hosted here :

http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets/Statistics

Bio2RDF 2,4 billions triples graph of linked data represents 51 % of the actual global linked data graph size.

Finally, this is what this highly connected knowledge world look like.



I would take this occasion to thanks all the enthusiast biologist and researcher who invest themselves by annotating article, protein and gene product. Without this essential work of connecting documents and concepts together, this project would not have been possible.

For the 20th anniversary of the web, I would also want to thanks Tim Berner Lee for his inspiring vision. Bio2RDF may not be the awaited killer app of the life science to demonstrate the semantic web potential, but let's say that it is only the beginning of the linked data cloud build by and for scientists.

The WWW2009 workshop Linked Data on the Web (LDOW2009) was held today, I would like to say how important the work of this community is. Finally a last word to congratulate Virtuoso team and especially Orri Erling for his fantastic work with the new Virtuoso 6.0 server soon to be released. I cannot wait to see Bio2RDF data into this amazing engine.

Bio2RDF's map new graphic representation

This word net represents the actual namespace connection between Bio2RDF SPARQL endpoints. RDF datasets which were analyzed comes from Bio2RDF's download page. These representations are generated with Many Eyes visualization tools.


Static version.

This graph represent connections between namespaces of Bio2RDF's network graph of SPARQL endpoints, highlighted orange dots corresponds to Bio2RDF rdfised database.


Static version.

Thursday, April 02, 2009

New Bio2RDF query services

The 0.3 release provides the ability to link to licence providers, so the applicable license for a namespace may be available by following a URL. The URL syntax for this is /license/namespace:identifier . It was easier to require the identifier to be present than to not have it. So far, the identifier portion is not being used, so it merely has to be present for the URL resolution to occur, but in future there is the allowance to have different licenses being given based on the identifier, which is useful for databases which are not completely released under a single license.

We provide countlinks and countlinksns which count the number of reverse links to a particular namespace and identifier, from all namespaces, or from within a given namespaces respectively. Currently these only function on virtuoso endpoints due to their use of aggregation extensions to SPARQL. The URL syntax is /countlinks/namespace:identifier and /countlinksns/targetnamespace/namespace:identifier

There is also the ability to count the number of triples in each SPARQL endpoint that point to a given Bio2RDF URI (or its equivalent identifier for non-Bio2RDF SPARQL endpoints). This ability is provided using /counttriples/namespace:identifier

We also provide search and searchns, which attempt to search globally using SPARQL (aren't currently linked to the rdfiser search pages which may be accessed using certain searchns URI's), or search within a particular namespace for text searches. The searches are all performed using the virtuoso fulltext search paradigm, ie, bif:contains, and other sparql endpoints haven't yet been implemented even with regex because it is reasonably slow but it would be simple to construct a query if people thought it was necessary. The URL syntax is /search/searchTerm and /searchns/targetnamespace:searchTerm

The coverage of each of these queries over the current Bio2RDF namespaces can be found here.

If anyone has any (possibly already SPARQL) queries on biology related databases that they regularly execute that can either be parameterised or turned into Pipes then it would be great to include them in future distributions for others to use.

RDF use and generation improvements

The 0.3 version of the Bio2RDF Servlet implements true RDF handling in the background to provide consistency of output and the potential to support multiple output formats such as NTriples and Turtle in the future, although the only output currently supported is RDF/XML. The Sesame library is being used to provide this functionality.

Provide more RDFiser scripts as part of the source distribution, including Chebi, GO, Homologene, NCBI Geneid, HGNC, OBO and Ecocyc along with guides on the Bio2RDF wiki about how to use the scripts to regenerate new RDF versions using future versions of each database.

Live recent network statistics available

The 0.3 releases provide the ability to show live statistics to diagnose some network issues without having to look at log files. The URL is /admin/stats
  • Shows the last time the internal provider blacklist reset, indicating how much activity is being displayed as the statistics are reset everytime the blacklist is reset. This blacklist is only implemented to prevent malfunctioning queries from being further communicated with.
  • By default shows the IP's accessing the server, with an indication of the total number and duration of their queries. Can be configured in low use and private situations to also show the queries being performed
  • Shows the servers which have been unresponsive since the last blacklist reset including a basic reason, such as an HTTP 503 or 400 error
There is also a live blacklisting functionality provided in version 0.3.2 to prevent crawlers who regularly utilise functionality that they shouldn't according to the Bio2RDF robots.txt file. The settings for this have been set rather high by default, and this functionality can be turned off completely by people who download and install the package and datasets locally. Specifically, a regular user of the public mirrors should make sure that they are not making either more than 40 requests in each 12 minute statistics period, or if they are making more than 40 requests in each 12 minute period, more than 25% of the queries should be for non-Robots.txt queries. These parameters will possibly change depending on further investigation. An individual can access /error/blacklist even if they are not blacklisted currently to show a list of requests from their IP address since the start of the last 12 minute statistics period.

Support provided for more non-Bio2RDF providers

The 0.3 Bio2RDF Servlet release implements support for more non-Bio2RDF SPARQL endpoints such as LinkedCT, DrugBank, Dailymed, Diseasome, Neurocommons, DBPedia, and Flyted/Flybase .

The relevant namespaces for these inside of Bio2RDF are:
  • DBpedia - dbpedia, dbpedia_property, dbpedia_class
  • LinkedCT - linkedct_ontology, linkedct_intervention, linkedct_trials, linkedct_collabagency, linkedct_condition, linkedct_link, linkedct_location, linkedct_overall_official, linkedct_oversight, linkedct_primary_outcomes, linkedct_reference, linkedct_results_reference, linkedct_secondary_outcomes, linkedct_arm_group
  • Dailymed - dailymed_ontology, dailymed_drugs, dailymed_inactiveingredient, dailymed_routeofadministration, dailymed_organization
  • DrugBank - drugbank_ontology, drugbank_druginteractions, drugbank_drugs, drugbank_enzymes, drugbank_drugtype, drugbank_drugcategory, drugbank_dosageforms, drugbank_targets
  • Diseasome - diseasome_ontology, diseasome_diseases, diseasome_genes, diseasome_chromosomallocation, diseasome_diseaseclass
  • Neurocommons - Uses the equivalent Bio2RDF namespaces, with live owl:sameAs links back to the relevant Neurocommons namespaces. Used for pubmed, geneid, taxonomy, mesh, prosite and go so far
  • Flyted/Flybase - Not converted yet, only direct access provided using search functionalities
Provide live owl:sameAs references which match the URI's used in SPARQL queries to keep linkages to the original databases without leaving the Bio2RDF database:identifier paradigm, so if people know the DBPedia, etc., URI's, the link to their current knowledge is given

Some http://database.bio2rdf.org/database:identifier URI's are produced by the owl:sameAs additions, but these aren't standard, and are only shown where there is still at least one SPARQL endpoint available which still uses them. People should utilise the http://bio2rdf.org/database:identifier versions when linking to Bio2RDF.

Any further contributions to this list, or additions of other datasets which already utilise Bio2RDF URI's would be very useful! See the list of namespaces already implemented here.

Provider, query and namespace statistics now available

At the time of posting Bio2RDF supported:
  • 230 namespaces
  • 35 different internal query titles (some of these map to the same URI pattern, so there are not this many URI query options)
  • 140 provider options, including a large number of /html/database:identifier providers which redirect to HTML pages which describe the Bio2RDF Identifier as well as the Bio2RDF SPARQL endpoints
More statistics can be found here

A list of the actual provider URL's mapped back to namespaces and queries can be found by downloading the Bio2RDF Servlet and changing a setting in log4j.properties to make the page more verbose. If the setting were turned on for the public mirrors it would result in a very large file each time.

LSID support for Bio2RDF

From release 0.3.2 of the Bio2RDF Servlet, any URI similar to http://bio2rdf.org/namespace:identifier will be accessible using its equivalent LSID, with http://bio2rdf.org/ as the proxy, using http://bio2rdf.org/urn:lsid:bio2rdf.org:namespace:identifier . The LSID syntax will not be available for use with custom services such as http://bio2rdf.org/links/namespace:identifier or http://bio2rdf.org/search/searchterm.

This will NOT become the standard identifier, but it provides compatibility with some users who wish to utilise LSID's.