Data Integration vs Transparency: Tackling the tension
1. Paul Groth
Elsevier Labs
@pgroth | pgroth.com
Data Integration & Transparency
Tackling the tension
University of Fribourg Informatics Colloquium June 15, 2015
2.
3. Outline
• Data integration for analysis
– i.e. remixing data
• The need for transparency
• Provenance as a solution
• The downloads folder problem
6. Why?
Public Domain Drug Discovery Data:
Pharma are accessing, processing, storing & re-processing
Literature
PubChem
Genbank
Patents
Databases
Downloads
Data Integration Data Analysis
Firewalled Databases
Repeat @
each
company
x
7. Prioritised Research Questions
Number sum Nr of 1 Question
15 12 9 All oxido,reductase inhibitors active <100nM in both human and mouse
18 14 8
Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety
concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor,
KOL) for findings associated with a compound?
24 13 8
Given a target find me all actives against that target. Find/predict polypharmacology of actives. Determine
ADMET profile of actives.
32 13 8 For a given interaction profile, give me compounds similar to it.
37 13 8
The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data in serine
protease assays for molecules that contain substructure X.
38 13 8
Retrieve all experimental and clinical data for a given list of compounds defined by their chemical
structure (with options to match stereochemistry or not).
41 13 8
A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to
modulate the target directly? What are the compounds that may modulate the target directly? i.e. return
all cmpds active in assays where the resolution is at least at the level of the target family (i.e. PKC) both
from structured assay databases and the literature.
44 13 8 Give me all active compounds on a given target with the relevant assay data
46 13 8 Give me the compound(s) which hit most specifically the multiple targets in a given pathway (disease)
59 14 8 Identify all known protein-protein interaction inhibitors
www.openphacts.org
8. Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse
From Mabel Loza - USC team
12. Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse
ChEMBL:
Search target Oxidoreductase: 481 targets from different species
Selection of all the oxidoreductases and filtering bioactivities with the
criteria IC50 < 100 (no units could be selected): 11497 data obtained
Table exported to a excel spreadsheet and manually filtered
From Mabel Loza - USC team
15. Using the Power of Open PHACTS, London, 22-23 April 2013
Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
index
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public Ontologies
User
Annotations
Applications
18. Credits: Curt Tilmes, Peter Fox
Tilmes, C.; Fox, P.; Ma, X.; McGuinness, D.L.; Privette, A.P.; Smith, A.; Waple, A.;
Zednik, S.; Zheng, J.G., "Provenance Representation for the National Climate
Assessment in the Global Change Information System," Geoscience and Remote
Sensing, IEEE Transactions on , vol.51, no.11, pp.5160,5168, Nov. 2013
35. Towards Workflow Ecosystems Through Semantic and Standard Representations Garijo, D.; Gil, Y.; and Corcho, O.
In Proceedings of the Ninth Workshop on Workflows in Support of Large-Scale Science (WORKS), held in
conjunction with the IEEE ACM International Conference on High-Performance Computing (SC), New Orleans, LA,
44. Look to OS techniques
1. Taint Tracking
2. Record and Replay
Work with Manolis Stamatogiannakis & Herbert Bos VU University Amsterdam – Security
Group
50. Use R&R for provenance
1. Execution Capture
2. Application of instrumentation
3. Provenance analysis
4. Selection and iteration
Implemented using plugins for Platform for
Architecture-Neutral Dynamic Analysis
(PANDA)
51. An Example (1)
<exe://pam-foreground-~3451> prov:endedAtTime 199090196 .
<exe://getent~3451> a prov:Activity .
<exe://getent~3451> rdf:type dt:getent .
<exe://cut~3452> a prov:Activity .
<exe://cut~3452> rdf:type dt:cut .
<file:/etc/nsswitch.conf> a prov:Entity .
<file:/etc/nsswitch.conf> rdfs:label "/etc/nsswitch.conf" .
<file:/etc/nsswitch.conf> rdf:type dt:Unknown .
<exe://getent~3451> prov:used <file:/etc/nsswitch.conf> .
# unused file:3477815296:getent~3451:/etc/passwd:r0:w0:f524288
<exe://getent~3451> prov:startedAtTime 199090196 .
<exe://getent~3451> prov:endedAtTime 200392668 .
<file:FD0_3452> a prov:Entity .
<file:FD0_3452> rdfs:label "FD0_3452"
55. Conclusion
• Tension between:
– putting stuff together; and
– documenting what’s been done
• Provenance helps
• Issues in collection
• Standards + Stealth1
1 hat tip Carole Goble
56. Questions?
• More info:
– openphacts.org
– data2semantics.org
– provbook.org
– https://github.com/m000/dtracker
– Manolis Stamatogiannakis, Paul Groth and Herbert Bos. Decoupling
Provenance Capture and Analysis from Execution. Theory and Practice
of Provenance 2015
– Luc Moreau, Paul Groth, James Cheney, Timothy Lebo, Simon Miles,
The rationale of PROV, Web Semantics: Science, Services and Agents
on the World Wide Web, Available online 20 April 2015
http://dx.doi.org/10.1016/j.websem.2015.04.001.
– Paul Groth, "Transparency and Reliability in the Data Supply Chain,"
IEEE Internet Computing, vol. 17, no. 2, pp. 69-71, March-April, 2013
– Paul Groth, "The Knowledge-Remixing Bottleneck," Intelligent Systems,
IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013
Hinweis der Redaktion
A lot of my work is based on the assumption that great stuff is a combo of old stuff and that the web accelerates all aspects of this process.
How do deal (technically & socially) with the implications of in terms of usability, credit, responsibility, and transparency?