- The document discusses OpenFlyData, a project that integrates biological data from multiple sources using Semantic Web technologies like RDF and SPARQL. It describes applications that allow searching gene expression data across databases.
- Key challenges addressed are that biological data is scattered across sites and integration requires mapping heterogeneous identifiers. The architecture uses a SPARQL endpoint and mappings to expose data from sources like FlyBase, BDGP and FlyAtlas.
- Performance testing showed good query times for real-time user interaction, though some queries took seconds and text matching had issues without custom solutions. Future work aims to add sources and develop more applications.
1. OpenFlyData: the way to go for biological data integration Dr Jun Zhao Image Bioinformatics Research Group Department of Zoology University of Oxford
8. System architecture SPARQL endpoint Web browser FlyUI application FlyUI widget HTTP Client side SPARQL server (SPARQLite, Tomcat, Apache) RDF cache (Jena TDB) FlyBase BDGP FlyTED FlyAtlas Server side
9.
10. The heterogeneous Drosophila gene names DATA SOURCE POSSIBLE GENE IDENTIFIERS EXAMPLES FlyBase symbol schuy full name schumacher-levy annotation symbol CG17736 Unique FlyBase id FBgn0036925 Curated synonyms CG17736, schuy, etc BDGP FlyBase id FBgn0036925 Annotation symbol CG17736 FlyAtlas Affy microarray probe id 16166608_a_at FlyTED Uncontrolled gene name schuy, CG17736/schuy
11.
12. SPARQL queries PREFIX chado: <http://purl.org/net/chado/schema> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX xs: <http://www.w3.org/2001/XML_Schema#> SELECT ?flybaseID WHERE { ?feature rdf:type chado:Feature ; chado:name “schuy”^^xs:string ; chado:uniquename ?flybaseID . } SELECT ?feature.uniquename AS flybaseID FROM feature WHERE feature.name = “schuy” SPARQL SQL
13. SPARQL protocol GET /query/flybase?query=[URL encoded query] HTTP/1.1 Host: openflydata.org Accept: application/sparql-results+json POST /query/flybase HTTP/1.1 Host: openflydata.org Accept: application/sparql-results+json Content-Type: application/x-www-form-urlencoded Content-Length: 456 query=[URL encoded query] HTTP GET HTTP POST
Note that the thumbnail images are retrieved from the original web sites
FlyUI: a library of Javascript widgets as front ends to SPARQL data sources Built on Yahoo User Interface (YUI) library Widgets are composed in a browser to create the complete application Each widget provides: A Service that implements SPARQL queries A Model encapsulating SPARQL query results A Renderer
Initially hoped to use D2R server's SPARQL query rewriting, but some queries would kill the server, so went for SPARQLite alternative Different techniques for generating RDF applied to different kinds of data source Resulting RDF is loaded into the Jena TDB triple store.
This is mostly based on available off-the-shelf software Choice of triple store is influenced significantly by speed of loading ~10 million triples Also performed some experiments with OpenLink Virtuoso – performance looks pretty good Amazon EC2/EBS has worked well for us