Sunday, August 16, 2009

HOWTO: Using Bio2RDF

The Bio2RDF URI is formed by taking a datasource and assigning a prefix to it. The prefix is a string which is only allowed to contain letters, numbers, the underscore (_), and the hyphen (-). The unique identifier for each object inside of the namespace, as the primary key for an object, is then included with the namespace prefix to make up the Bio2RDF URI, http://bio2rdf.org/namespaceprefix:identifier. In this example a user wants to find information about Propanolol, and they know there is a Wikipedia article about the topic. Since DBpedia mirrors the Wikipedia structure and represents it using RDF, they could go to http://bio2rdf.org/dbpedia:Propranolol.

If the user then wants to find out where the Wikipedia article Propanolol is referenced in other databases, they can go to http://bio2rdf.org/links/dbpedia:Propranolol (may take a long time given the number of databases that are being used). If they know they only need to find out where the article is referenced in DrugBank, they can use http://bio2rdf.org/linksns/drugbank_drugs/dbpedia:Propranolol (should be much quicker because the number of databases is reduced here).

There is also search functionality embedded into the Bio2RDF system. Searches can be conducted on particular namespaces, or across the entire Bio2RDF system. If a user wants to conduct a search on namespace "chebi" for instance, and they want to search for "propanolol", they could go to http://bio2rdf.org/searchns/chebi/propanolol. If they then also wish to search for "propanolol" including the other namespaces they can go to http://bio2rdf.org/search/propanolol (this may be slow because of the number of databases that are available for search).

If a namespace has been configured with the ability to redirect to its original interface the redirection can be triggered by sending users to http://bio2rdf.org/html/namespace:identifier . For example, a user might be interested in http://bio2rdf.org/drugbank_drugs:DB00571 (the DrugBank identifier for Propanolol), and they want to see the original DrugBank interface. They could then go to http://bio2rdf.org/html/drugbank_drugs:DB00571 and their browser would be redirected to the description of that drug on the original DrugBank interface. Although not all namespaces have their original HTML interfaces encoded into the Bio2RDF system, some do, and it is a useful way of getting back to the non-RDF web.

If someone is interested in taking the Bio2RDF RDF versions and using them internally, they can make sure they request either of the supported RDF formats (RDF/XML and N3), but adding /rdfxml/ or /n3/ to the front of any of the URL's they desire. Each of the links given for URI's in this post have been to request the Bio2RDF HTML versions using /page/, but they can equivalently be requested using http://bio2rdf.org/rdfxml/linksns/drugbank_drugs/dbpedia:Propranolol or http://bio2rdf.org/n3/search/propanolol respectively for RDF/XML and N3 for example.

There are also advanced features for people wanting to determine the provenance of particular documents, since RDF doesn't natively support provenance for individual statements when multiple sources are merged into single documents, as Bio2RDF does. If the user wishes to know which sources of information were used in a particular document they can insert /queryplan/ at the start of the URI in order to get its provenance information http://bio2rdf.org/queryplan/linksns/drugbank_drugs/dbpedia:Propranolol. This information is returned as a set of objects, including Query Types, Providers and Namespaces, among other things. This information can then be used to recreate the exact set of queries, both SPARQL and otherwise, that were used to access the information, as long as the user has access to all of the provider endpoints in the query plan. In order to replicate the queries, users could perform a SPARQL query on the resulting document such as "SELECT ?endpoint ?query WHERE { ?queryBundle a <http://bio2rdf.org/ns/querybundle:QueryBundle> . ?queryBundle <http://bio2rdf.org/ns/querybundle:hasQueryLiteral> ?query . ?queryBundle <http://bio2rdf.org/ns/querybundle:hasQueryBundleEndpoint> ?endpoint . }". This query may not return exactly the same results, as there are also normalisation rules, which require knowledge of the Provider configuration in use (all of which is included in the document). To get these a more advanced query that referenced the "rdf:type to query for is http://bio2rdf.org/ns/querybundle:hasProviderConfigurationUri" predicate that is also attached to each querybundle would be required in order to determine which Provider was being used, and which RDF Normalisation rules (predicate to query for is http://bio2rdf.org/ns/provider:needsRdfNormalisation) were required by that provider configuration.

If there are too many results to return in one hit from a particular endpoint, the results given to the user will not be complete. Although there is currently no way of signalling this to users in the RDF document, users can manually inspect the queryplan to determine what the maximum will be and if the number of results is equal to or greater than this number, they can request subsequence offsets using the /pageoffsetNN/ mechanism, where NN is one or more digits indicating which page of results are being requested. /pageoffset32/ for instance would be interpreted as the 32nd page of results, while /pageoffset1/ is the first page, which is the default if nothing is specified. Each pageoffset may not return the same number of results because the resolution is implemented by distributing queries across endpoints, and it is not efficient (or possible in some cases), to query endpoints for the number of results before getting the information, and there is no natural ordering between the results returned by different endpoint. The resolver should be interpreted to be returning at least NNNN results from each endpoint where possible, and the distinct set of RDF statements that occur in these results are included in the document that is shown to the user. The default limit for the Bio2RDF system is currently 2000, so users can know if they receive more than 2000 results that they may be able to request the next pageoffset, ie, /pageoffset2/, etc., in order to retrieve more results if possible. Some queries may not include the limit as part of the query, and hence they will also not return different results for each pageoffset, so users should be careful that they don't request too many pageoffsets for this reason. The HTML interface for paging requests a maximum of 20 pageoffsets if needed, so links to the other pageoffsets are not picked up by robots (although /pageoffsetNN/ links should not be followed by robots as specified in the Bio2RDF robots.txt file).

The pageoffset can be included together with other instructions about the format and whether the query plan is required in the following order, with each part optional (except for the query) /FORMAT/queryplan/pageoffsetNN/query, where /FORMAT/ can be /rdfxml/, /n3/ or /page/, /queryplan/ is used to get the information about how the query would be resolved without performing the query, and the NN in the pageoffset section determines which page to resolve. For example, the HTML version of the queryplan for the 2nd pageoffset for the "linksns/drugbank_drugs/dbpedia:Propranolol" query can be found using http://bio2rdf.org/page/queryplan/pageoffset2/linksns/drugbank_drugs/dbpedia:Propranolol. A known issue is that the URL links to the RDF/XML and N3 versions at the bottom of the HTML page will request the actual query instead of the queryplan and it will also not have the pageoffset. This will be fixed in a future version, but if the URL is constructed in the correct way it will still currently work.

Because of the way the HTML redirections have been included into the system, requesting the queryplan for the HTML redirection encoded in N3 looks like /n3/queryplan/html/drugbank_drugs:DB00571, since the query in this case is "html/drugbank_drugs:DB00571", and the other parts are used to define the result format and provenance record being required respectively. http://bio2rdf.org/n3/queryplan/html/drugbank_drugs:DB00571