The Bio2RDF URI is formed by taking a datasource and assigning a prefix to it. The prefix is a string which is only allowed to contain letters, numbers, the underscore (_), and the hyphen (-). The unique identifier for each object inside of the namespace, as the primary key for an object, is then included with the namespace prefix to make up the Bio2RDF URI, http://bio2rdf.org/namespaceprefix:identifier. In this example a user wants to find information about Propanolol, and they know there is a Wikipedia article about the topic. Since DBpedia mirrors the Wikipedia structure and represents it using RDF, they could go to http://bio2rdf.org/dbpedia:Propranolol.
If the user then wants to find out where the Wikipedia article Propanolol is referenced in other databases, they can go to http://bio2rdf.org/links/dbpedia:Propranolol (may take a long time given the number of databases that are being used). If they know they only need to find out where the article is referenced in DrugBank, they can use http://bio2rdf.org/linksns/drugbank_drugs/dbpedia:Propranolol (should be much quicker because the number of databases is reduced here).
There is also search functionality embedded into the Bio2RDF system. Searches can be conducted on particular namespaces, or across the entire Bio2RDF system. If a user wants to conduct a search on namespace "chebi" for instance, and they want to search for "propanolol", they could go to http://bio2rdf.org/searchns/chebi/propanolol. If they then also wish to search for "propanolol" including the other namespaces they can go to http://bio2rdf.org/search/propanolol (this may be slow because of the number of databases that are available for search).
If a namespace has been configured with the ability to redirect to its original interface the redirection can be triggered by sending users to http://bio2rdf.org/html/namespace:identifier . For example, a user might be interested in http://bio2rdf.org/drugbank_drugs:DB00571 (the DrugBank identifier for Propanolol), and they want to see the original DrugBank interface. They could then go to http://bio2rdf.org/html/drugbank_drugs:DB00571 and their browser would be redirected to the description of that drug on the original DrugBank interface. Although not all namespaces have their original HTML interfaces encoded into the Bio2RDF system, some do, and it is a useful way of getting back to the non-RDF web.
If someone is interested in taking the Bio2RDF RDF versions and using them internally, they can make sure they request either of the supported RDF formats (RDF/XML and N3), but adding /rdfxml/ or /n3/ to the front of any of the URL's they desire. Each of the links given for URI's in this post have been to request the Bio2RDF HTML versions using /page/, but they can equivalently be requested using http://bio2rdf.org/rdfxml/linksns/drugbank_drugs/dbpedia:Propranolol or http://bio2rdf.org/n3/search/propanolol respectively for RDF/XML and N3 for example.
There are also advanced features for people wanting to determine the provenance of particular documents, since RDF doesn't natively support provenance for individual statements when multiple sources are merged into single documents, as Bio2RDF does. If the user wishes to know which sources of information were used in a particular document they can insert /queryplan/ at the start of the URI in order to get its provenance information http://bio2rdf.org/queryplan/linksns/drugbank_drugs/dbpedia:Propranolol. This information is returned as a set of objects, including Query Types, Providers and Namespaces, among other things. This information can then be used to recreate the exact set of queries, both SPARQL and otherwise, that were used to access the information, as long as the user has access to all of the provider endpoints in the query plan. In order to replicate the queries, users could perform a SPARQL query on the resulting document such as "SELECT ?endpoint ?query WHERE { ?queryBundle a <http://bio2rdf.org/ns/querybundle:QueryBundle> . ?queryBundle <http://bio2rdf.org/ns/querybundle:hasQueryLiteral> ?query . ?queryBundle <http://bio2rdf.org/ns/querybundle:hasQueryBundleEndpoint> ?endpoint . }". This query may not return exactly the same results, as there are also normalisation rules, which require knowledge of the Provider configuration in use (all of which is included in the document). To get these a more advanced query that referenced the "rdf:type to query for is http://bio2rdf.org/ns/querybundle:hasProviderConfigurationUri" predicate that is also attached to each querybundle would be required in order to determine which Provider was being used, and which RDF Normalisation rules (predicate to query for is http://bio2rdf.org/ns/provider:needsRdfNormalisation) were required by that provider configuration.
If there are too many results to return in one hit from a particular endpoint, the results given to the user will not be complete. Although there is currently no way of signalling this to users in the RDF document, users can manually inspect the queryplan to determine what the maximum will be and if the number of results is equal to or greater than this number, they can request subsequence offsets using the /pageoffsetNN/ mechanism, where NN is one or more digits indicating which page of results are being requested. /pageoffset32/ for instance would be interpreted as the 32nd page of results, while /pageoffset1/ is the first page, which is the default if nothing is specified. Each pageoffset may not return the same number of results because the resolution is implemented by distributing queries across endpoints, and it is not efficient (or possible in some cases), to query endpoints for the number of results before getting the information, and there is no natural ordering between the results returned by different endpoint. The resolver should be interpreted to be returning at least NNNN results from each endpoint where possible, and the distinct set of RDF statements that occur in these results are included in the document that is shown to the user. The default limit for the Bio2RDF system is currently 2000, so users can know if they receive more than 2000 results that they may be able to request the next pageoffset, ie, /pageoffset2/, etc., in order to retrieve more results if possible. Some queries may not include the limit as part of the query, and hence they will also not return different results for each pageoffset, so users should be careful that they don't request too many pageoffsets for this reason. The HTML interface for paging requests a maximum of 20 pageoffsets if needed, so links to the other pageoffsets are not picked up by robots (although /pageoffsetNN/ links should not be followed by robots as specified in the Bio2RDF robots.txt file).
The pageoffset can be included together with other instructions about the format and whether the query plan is required in the following order, with each part optional (except for the query) /FORMAT/queryplan/pageoffsetNN/query, where /FORMAT/ can be /rdfxml/, /n3/ or /page/, /queryplan/ is used to get the information about how the query would be resolved without performing the query, and the NN in the pageoffset section determines which page to resolve. For example, the HTML version of the queryplan for the 2nd pageoffset for the "linksns/drugbank_drugs/dbpedia:Propranolol" query can be found using http://bio2rdf.org/page/queryplan/pageoffset2/linksns/drugbank_drugs/dbpedia:Propranolol. A known issue is that the URL links to the RDF/XML and N3 versions at the bottom of the HTML page will request the actual query instead of the queryplan and it will also not have the pageoffset. This will be fixed in a future version, but if the URL is constructed in the correct way it will still currently work.
Because of the way the HTML redirections have been included into the system, requesting the queryplan for the HTML redirection encoded in N3 looks like /n3/queryplan/html/drugbank_drugs:DB00571, since the query in this case is "html/drugbank_drugs:DB00571", and the other parts are used to define the result format and provenance record being required respectively. http://bio2rdf.org/n3/queryplan/html/drugbank_drugs:DB00571
Sunday, August 16, 2009
Monday, July 20, 2009
The story so far of Linked Data, Bio2RDF is part of it !
In the latest publication of Tim Berner-Lee, he tells the recent story of emerging Linked Data, Bio2RDF is mentioned as an important Biology contributor. This paper is a must for anyone interested in this fantastic new approach.
http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf
In this map of Linked Data, Bio2RDF contribution is shown in purple. The corresponding SPARQL endpoints are available here :
http://delicious.com/tag/bio2rdf:sparql
http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf
In this map of Linked Data, Bio2RDF contribution is shown in purple. The corresponding SPARQL endpoints are available here :
http://delicious.com/tag/bio2rdf:sparql
Wednesday, July 01, 2009
Bio2RDF is now using Virtuoso 6 and its new facet browser
Bio2RDF is moving from Virtuoso 5 to Virtuoso 6 server. The new software support facet browsing in real time.
We invite you to explore our graph with a full text search query for hexokinase. Once the results list is shown try the options in the right menu. Enjoy the discovery experience.
Try the 2009 version of "Atlas about Human and Mouse" :
http://atlas.bio2rdf.org/fct/
the graph can also be queried in sparql :
http://atlas.bio2rdf.org/sparql
The list of the Bio2RDF converted graph will be published and updated here :
The facet browsers list :
http://delicious.com/tag/bio2rdf:fct
The sparql endpoints list :
http://delicious.com/tag/bio2rdf:sparql
We invite you to explore our graph with a full text search query for hexokinase. Once the results list is shown try the options in the right menu. Enjoy the discovery experience.
Try the 2009 version of "Atlas about Human and Mouse" :
http://atlas.bio2rdf.org/fct/
the graph can also be queried in sparql :
http://atlas.bio2rdf.org/sparql
The list of the Bio2RDF converted graph will be published and updated here :
The facet browsers list :
http://delicious.com/tag/bio2rdf:fct
The sparql endpoints list :
http://delicious.com/tag/bio2rdf:sparql
Bio2RDF visit at HCLS annual meeting
Bio2RDF team members Marc-Alexande Nolin, Michel Dumontier and Francois Belleau, have been invited to present actual state of the Bio2RDF project at the annual face to face meeting of the HCLS community. Here is a link to the presentation :
http://www.slideshare.net/fbelleau/bio2rdf-w3c-hcls2009
Thanks to the organizers of the event.
http://www.slideshare.net/fbelleau/bio2rdf-w3c-hcls2009
Thanks to the organizers of the event.
Bio2RDF @ W3C HCLS2009
View more presentations from Belleau François.
Monday, June 29, 2009
0.6.1 bug fix release now available
A maintenance release, version 0.6.1 was released today on sourceforge [1]. There were a few coding bugs in the 0.6.0 release relating to the namespace match method "all", the rdf rule order was not being imported from the configuration properly resulting in queries which relied on more than one rule not getting any results back, and included static RDF/XML sections were not being included. There was also a fix related to default providers that eliminates duplicate queries for namespaces where a namespace was assigned to a default provider for a query that allowed default providers.
The configuration files have also been updated, although people using the live configuration method (the default) would have received the configuration changes already. Some performance improvements related to logging have also been made that in some circumstances will dramatically improve the performance of the package, although the majority of the overall request latency is still related to internet latency related to the SPARQL queries.
From this version on, I will also be releasing MD5 hashes for each of the downloaded files so people can check that their downloaded file matches the release on sourceforge.
[1] https://sourceforge.net/project/platformdownload.php?group_id=142631
The configuration files have also been updated, although people using the live configuration method (the default) would have received the configuration changes already. Some performance improvements related to logging have also been made that in some circumstances will dramatically improve the performance of the package, although the majority of the overall request latency is still related to internet latency related to the SPARQL queries.
From this version on, I will also be releasing MD5 hashes for each of the downloaded files so people can check that their downloaded file matches the release on sourceforge.
[1] https://sourceforge.net/project/platformdownload.php?group_id=142631
Tuesday, June 23, 2009
Version 0.6.0 of the Bio2RDF server software released
The next version of the Bio2RDF software, version 0.6.0 was released today on sourceforge [1]
It has some major feature additions over the previous version, with the highlights being an RDF based configuration, the ability to update the configuration while the server is running, and support for sophisticated profiles so that users can pick and choose sources without having to change the basic configuration sources that are shared between different users. If users want to add or subtract from the base configuration they can create a small RDF file on their server and use that file to pick which sources they want to use and which queries they want to be able to execute.
If anyone wants to check out the example [2] and use it as a guide to mock up some SPARQL queries or definitions for endpoints that go with the queries it would be great to see what other resources we can combine into the global Bio2RDF configuration. If you need pointers in how to get your own configuration working feel free to ask me.
[1] https://sourceforge.net/project/platformdownload.php?group_id=142631
[2] http://bio2rdf.wiki.sourceforge.net/sample+configuration
It has some major feature additions over the previous version, with the highlights being an RDF based configuration, the ability to update the configuration while the server is running, and support for sophisticated profiles so that users can pick and choose sources without having to change the basic configuration sources that are shared between different users. If users want to add or subtract from the base configuration they can create a small RDF file on their server and use that file to pick which sources they want to use and which queries they want to be able to execute.
If anyone wants to check out the example [2] and use it as a guide to mock up some SPARQL queries or definitions for endpoints that go with the queries it would be great to see what other resources we can combine into the global Bio2RDF configuration. If you need pointers in how to get your own configuration working feel free to ask me.
[1] https://sourceforge.net/
[2] http://bio2rdf.wiki.
Friday, May 08, 2009
Version 0.5.0 of the Bio2RDF server software released
The next version of the server software has been released on sourceforge. [1]
It contains a number of changes that will hopefully make it more useful for the tasks we want to do with linked rdf queries.
One major one is the introduction of content negotiation, which has been tested for N3 (using text/rdf+n3) and RDF/XML (using application/rdf+xml). It was made possible this quickly after the last release by the use of the content negotiation code from Pubby, the driver behind the DBpedia web interface and URI resolution mechanism. It is also possible to explicitly get to the N3 format currently by prefixing the URL with /n3/ See [2] for an example. The ability to explicitly get to the RDF/XML will be added in future.
Another change that will hopefully be useful is the introduction of clear RDF level error messages when either the syntax of a URI is not recognised, or the syntax was recognised but there were no providers that were relevant to the URI. See [3], [4] and [5] for a demonstration of the error messages.
There is also the ability to page through the results, which is necessary when there are more than 2000 results to a query from a particular endpoint. To use the paging facility the URI needs to be prefixed by /pageoffsetNN/, where NN is a number indicating which page you would like to look at. The queries are not ordered currently, but in the short term it would be reasonable to believe that they should be consistent enough to get through all of the results. Ordered queries take a lot longer than unordered queries, so it is unlikely that the public mirrors will ever introduce ordered queries. An example of the paging URL could be [6] or [7].
There is also the ability to get an RDF document describing what actions would be taken for a particular query. It is interoperable with the /n3/ and /pageoffsetNN/ URI manipulations so URI's like [8] can be made up and resolved. This RDF document is setup to contain all of the necessary information for the client to then complete the query with their own network resources if necessary. In future, clients should be able to patch into this functionality without having to keep a local copy of the configuration on hand, although a distributed configuration idea is also in the works for sometime in the future. Currently the distribution is readonly from [9]. The [9] URL has also been made content negotiable for HTML/RDFXML/N3 content types, with a default to HTML if the content type is not recognised by the Sesame Rio library, but it can still be accessed in a particular format without content negotiation by appending /html /n3 or /rdfxml .
Since the last release the GeoSpecies dataset has also been partially integrated, although it doesn't seem to have a sparql endpoint so currently it is only available for basic construct queries. [10] Not all of the namespaces inside the geospecies dataset have rules for normalisation to Bio2RDF URI syntax, but the rest will be integrated eventually.
The order of normalisation rules is now respected when applying them, with lower numbers being applied before higher numbers. Numbers with the same order cannot be relied on to be applied in a consistent manner if they overlap syntactically.
The MyExperiment SPARQL endpoint [11] has also been integrated into Bio2RDF since the last release, so for instance, a user in the MyExperiment system can be resolved using [12], but there are also other things like workflows which could in the future provide valuable interconnections for the linked rdf web. Further integration with MyExperiment would be invaluable to the future of the Bio2RDF network I think.
Partial support for INCHI resolution has also made it into this release, although there are some syntax bugs with rdf.openmolecules.net that stop Sesame being able to parse the resulting RDF/XML so the inchi's are only being resolved using pubchem so far. Some INCHI's, particularly those which contain + signs will also be unresolvable for the current time because the Apache HTTPD and Apache Tomcat and URLRewrite stack we are using unurlencodes the plus signs to spaces somewhere along the line and it is hard to figure out what configuration is needed to avoid it happening. It was hard enough figuring out how to make encoded slashes (%2F) usable inside identifiers (they need to be double encoded as %252F to avoid detection by the HTTPD/Tomcat/URLRewrite algorithms), so I am not sure what progress will be made with the plus signs in the near future.
DOI resolution has also been integrated from both the Uniprot Citations database and the BioGuid.info, but will likely only be fully useful for science related DOI's I think.
There are currently 368 namespaces known by the server software for Bio2RDF, with 231 information provider configurations (although the real number of providers is less than this due to duplication on a few providers to enable reverseconstruct, and unpercentencoded queries where necessary) The number of combinations that are currently encapsulated by the server configuration can be found at [13]
It is hard to believe so much could be packed into a new release two weeks after the last release!
See the complete list of changes at [14].
If anyone has alternative configurations that they have made up using the software I am more than willing to include them in the distribution so others can utilise them. The configuration file syntax is still in flux, and won't likely become stable until the 1.0 release, but it is mostly additions to support new features, so configurations based on older software versions are still useful and able to be migrated to the new scheme.
[1] https://sourceforge.net/project/platformdownload.php?group_id=142631
[2] http://qut.bio2rdf.org/n3/geneid:14456
[3] http://qut.bio2rdf.org/dummyquery/go:0004535
[4] http://qut.bio2rdf.org/image/geneid:14936
[5] http://qut.bio2rdf.org/GO:0004535
[6] http://qut.bio2rdf.org/n3/pageoffset2/chr:10090-chr14
[7] http://qut.bio2rdf.org/pageoffset2/chr:10090-chr14
[8] http://qut.bio2rdf.org/n3/queryplan/pageoffset2/chr:10090-chr14
[9] http://qut.bio2rdf.org/admin/configuration
[10] http://qut.bio2rdf.org/geospecies_bioclass:13
[11] http://rdf.myexperiment.org/sparql
[12] http://qut.bio2rdf.org/myexp_user:1177
[13] http://qut.bio2rdf.org/admin/namespaceproviders
[14] http://bio2rdf.wiki.sourceforge.net/Road+map
It contains a number of changes that will hopefully make it more useful for the tasks we want to do with linked rdf queries.
One major one is the introduction of content negotiation, which has been tested for N3 (using text/rdf+n3) and RDF/XML (using application/rdf+xml). It was made possible this quickly after the last release by the use of the content negotiation code from Pubby, the driver behind the DBpedia web interface and URI resolution mechanism. It is also possible to explicitly get to the N3 format currently by prefixing the URL with /n3/ See [2] for an example. The ability to explicitly get to the RDF/XML will be added in future.
Another change that will hopefully be useful is the introduction of clear RDF level error messages when either the syntax of a URI is not recognised, or the syntax was recognised but there were no providers that were relevant to the URI. See [3], [4] and [5] for a demonstration of the error messages.
There is also the ability to page through the results, which is necessary when there are more than 2000 results to a query from a particular endpoint. To use the paging facility the URI needs to be prefixed by /pageoffsetNN/, where NN is a number indicating which page you would like to look at. The queries are not ordered currently, but in the short term it would be reasonable to believe that they should be consistent enough to get through all of the results. Ordered queries take a lot longer than unordered queries, so it is unlikely that the public mirrors will ever introduce ordered queries. An example of the paging URL could be [6] or [7].
There is also the ability to get an RDF document describing what actions would be taken for a particular query. It is interoperable with the /n3/ and /pageoffsetNN/ URI manipulations so URI's like [8] can be made up and resolved. This RDF document is setup to contain all of the necessary information for the client to then complete the query with their own network resources if necessary. In future, clients should be able to patch into this functionality without having to keep a local copy of the configuration on hand, although a distributed configuration idea is also in the works for sometime in the future. Currently the distribution is readonly from [9]. The [9] URL has also been made content negotiable for HTML/RDFXML/N3 content types, with a default to HTML if the content type is not recognised by the Sesame Rio library, but it can still be accessed in a particular format without content negotiation by appending /html /n3 or /rdfxml .
Since the last release the GeoSpecies dataset has also been partially integrated, although it doesn't seem to have a sparql endpoint so currently it is only available for basic construct queries. [10] Not all of the namespaces inside the geospecies dataset have rules for normalisation to Bio2RDF URI syntax, but the rest will be integrated eventually.
The order of normalisation rules is now respected when applying them, with lower numbers being applied before higher numbers. Numbers with the same order cannot be relied on to be applied in a consistent manner if they overlap syntactically.
The MyExperiment SPARQL endpoint [11] has also been integrated into Bio2RDF since the last release, so for instance, a user in the MyExperiment system can be resolved using [12], but there are also other things like workflows which could in the future provide valuable interconnections for the linked rdf web. Further integration with MyExperiment would be invaluable to the future of the Bio2RDF network I think.
Partial support for INCHI resolution has also made it into this release, although there are some syntax bugs with rdf.openmolecules.net that stop Sesame being able to parse the resulting RDF/XML so the inchi's are only being resolved using pubchem so far. Some INCHI's, particularly those which contain + signs will also be unresolvable for the current time because the Apache HTTPD and Apache Tomcat and URLRewrite stack we are using unurlencodes the plus signs to spaces somewhere along the line and it is hard to figure out what configuration is needed to avoid it happening. It was hard enough figuring out how to make encoded slashes (%2F) usable inside identifiers (they need to be double encoded as %252F to avoid detection by the HTTPD/Tomcat/URLRewrite algorithms), so I am not sure what progress will be made with the plus signs in the near future.
DOI resolution has also been integrated from both the Uniprot Citations database and the BioGuid.info, but will likely only be fully useful for science related DOI's I think.
There are currently 368 namespaces known by the server software for Bio2RDF, with 231 information provider configurations (although the real number of providers is less than this due to duplication on a few providers to enable reverseconstruct, and unpercentencoded queries where necessary) The number of combinations that are currently encapsulated by the server configuration can be found at [13]
It is hard to believe so much could be packed into a new release two weeks after the last release!
See the complete list of changes at [14].
If anyone has alternative configurations that they have made up using the software I am more than willing to include them in the distribution so others can utilise them. The configuration file syntax is still in flux, and won't likely become stable until the 1.0 release, but it is mostly additions to support new features, so configurations based on older software versions are still useful and able to be migrated to the new scheme.
[1] https://sourceforge.net/project/platformdownload.php?group_id=142631
[2] http://qut.bio2rdf.org/n3/geneid:14456
[3] http://qut.bio2rdf.org/dummyquery/go:0004535
[4] http://qut.bio2rdf.org/image/geneid:14936
[5] http://qut.bio2rdf.org/GO:0004535
[6] http://qut.bio2rdf.org/n3/pageoffset2/chr:10090-chr14
[7] http://qut.bio2rdf.org/pageoffset2/chr:10090-chr14
[8] http://qut.bio2rdf.org/n3/queryplan/pageoffset2/chr:10090-chr14
[9] http://qut.bio2rdf.org/admin/configuration
[10] http://qut.bio2rdf.org/geospecies_bioclass:13
[11] http://rdf.myexperiment.org/sparql
[12] http://qut.bio2rdf.org/myexp_user:1177
[13] http://qut.bio2rdf.org/admin/namespaceproviders
[14] http://bio2rdf.wiki.sourceforge.net/Road+map
Subscribe to:
Posts (Atom)