SlideShare a Scribd company logo
1 of 56
Graph databases
In computational biology:
Neo4j and TitanDB
Andrei Kucharavy 23/08/2013
Rigid structure of
Interactions
= Interactome
Knowledge
access structure
= GO
Why even bother?
Those are
Graphs
Why even bother?
● ~ 1 Gb of raw data from Reactome
● ~ 300 Mb of Data from Uniprot / GO /
ENSEMBL/ … mappings
● => this is way over the conventional 1024 Mb
JVM limit => heap crash
● ~ 15 minutes to load
● Nightmare to visualize and debug
Relational Databases
Intro to neo4j presentation – jexp @ slideshare
Data models
Graph databases
Intro to neo4j presentation – jexp @ slideshare
Core abstractions
● Objects:
– Nodes (Vertexes)
– Relationships between nodes (Edges)
– Properties for Vertexes and Edges
Node1
Node2
Node3
Property1
Node2
Property2
Property3
Property1
Property1
Property2
Property2
Property
Property
Property
Core abstractions
● Objects:
– Nodes (Vertexes)
– Relationships between nodes (Edges)
– Properties for Vertexes and Edges
● Operations:
– Immediate relations
– Traversals
● Get the shortest path from j to k
● Get the path with least weight from j to k, ...
Main advantages promised
● Increased speed for graph-type applications
– Avoid “join” on 10M rows to get ~20 “related”
elements
– Traversals
● Simplified programming
– Java objects
– Xml / rdf / owl
– Schema alterations
Main advantages promised
● Ease of deployment / maintenance:
– Scalability
– Complexity
– Modifications
– Schema migrations
neo4j
● Started in 2003
● Schema-free
● ACID transactions
● Reasonably scalable, reasonably replicatable
● 10 000 open source projects, 1000 commercial
costumers
neo4j
● Started in 2003
● Schema-free
● ACID transactions
● 10 000 open source projects, 1000 commercial
costumers
● 100 % open source
https://github.com/neo4j/neo4j
neo4j
● Started in 2003
● Schema-free
● ACID transactions
● 10 000 open source projects, 1000 commercial costumers
● 100 % open source
● Master-slave replication
● AGPL 3 license: if you are open source, it is free,
Even the support
neo4j
● Started in 2003
● Schema-free
● ACID transactions
● 10 000 open source projects, 1000 commercial
costumers
● 100 % open source
● Master-slave replication
● AGPL 3 license: if you are open source, it is free,
Even the support
● Plus graphical interface => De-bug!!!!
Deployment Demo
● cd to specific DB location (better as a special
user)
● ./neo4j start
● ./neo4j stop
● => Serves localhost:7474
● 40 000 files => mainly indexes / user accesses
Under the hood
● Java & JVM
● Split in two
– In-RAM “pre-heated” v.s. Whole in-HDD
● Scalability:
– 32 G nodes / 32 G relations / 64 G properties
– 1 M traversals / sec, size-independent of a graph
● Lucene index: instant search
Interfaces
● Two-fold interface:
– REST server
– Local instance
● Specific query Language: Cipher
Interfaces
● Two-fold interface:
– REST server
– Local instance
● Specific query Language: Cipher
● Interoperability: support for tinkerpop stack
TinkerPop stack
REST APIs
What is Gremlin
● Domain-specific graph language
● Build atop Groovy
– JVM
– Dynamically evaluated
– ~ scripting in java
● Core = java
– Java
– Scala / Clojure
– Jpypes / Jython / Jruby
● Supported by most graph databases
Interfaces
● Two-fold interface:
– REST server
– Local instance
● Specific query Language: Cipher
● Interoperability: support for TinkerPop stack
● Native bindings:
– Java
– Python, PHP, Ruby / Rails, node.js, .Net
– Scala, Clojure, Haskell, ...
● My stack:
– Native Python and Python through bulbs and REST
Python + Bulbs + REST + neo4j
● Bulbs = Pythonic wrapper for Gremlin
● Portability(BluePrints + Rexter)
– Titan DB (will be discussed later on)
– Bitsy
– Infinite Graph
– Sqrrl
– ArangoDB
● Class heritability and DDT:
– Java-like class heritability
Demo 2
● Datatype declaration
● GraphDB connection and declaration
● Fill-in
● Graphical Interface
neo4j-specific
● Lucene index in the backend
– Exact indexing => constant-time retrieval
– Full-text indexing => searching partial names and
adding the missing links
● SRC = SRC_HUMAN = SRC1
Demo3
● Constant node retrieval time / internode
connection distance time
● Performing the partial search
● Adding missing links
● Neo4j server v.s. Local database
● Performing simple Gremlin queries
Use Case:
● Existent map of correlations:
ProteinDomain
Domain Type
Protein
function
Use Case:
● Existent map of correlations:
● Wanted map of correlations:
ProteinDomain
Domain Type
Protein
function
ProteinDomain
Domain Type
Protein
function
Use Case:
● Existent map of correlations:
● Wanted map of correlations:
ProteinDomain
Domain Type
Protein
function
ProteinDomain
Domain Type
Protein
function
Use Case
● SQL Python / SQLAlchemy:
– Create new table
– Add ForeignKeys, Primary key, indexes, ...
– Add the table to the data model,
– Create functions for access/update,
– ...
Use Case
● Bulbs / Neo4j => Live demo
Use case 2
● In human proteome, find all chemical groups A and B separated
by less then x Å
– Database Structure:
● Suppose all the proteins are connected to a “Type node”
● Each protein is linked to it's domains, each domain is linked to it's amino acids,
each amino-acid linked to it's chemical groups and ultimately atoms
● Chemical groups have assigned distance between them and groups they are
close to
– Algorithm
● Select a protein of interest
● Get all of it's chemical groups: 1000(a.a)*3(ch.gr/a.a)
● Filter all of the Relations longer than k: 1000*3*100(possible contacts per ch.gr)
● Recover the proteins: 1000*3*100*2
● With 1M traversals per second => 0.6 sec. to execute the query
– If TitanDB with ElasticSearch and geo-queries (all within circle of radius
x), higher speeds possible
Limitations
● Node Number:
– 32 Giga Nodes / Edges is a lot on servers
● ~100 Tb of data
● 1 Unix partition
● 40 000 ++ simultaneously opened files (Indexes+users)
– 32 Giga Edges is relatively small in biology
● ~ 43 M nodes in UniProt Only
● GO x UNIPROT x EMBL x GeneNames x Interaction
Maps x Localisations x names & Accesses ....
● All potentially druggable molecules, all protein atoms, all
atom-atom interactions
Limitations
● Absence of parallelism/distribution
– One process at time:
● 1 traversal at time
● ACID => Database locks
● Though master-slave distribution
– Single partition
● Replication
● 100 Tb + RAID!?
● Though full support for AWS and VM
Limitations
● Bubs: python over gremlin scripts
– Gremlin → Groovy → JVM → do what you want
=> SQL (Gremlin) injections
– Request sanitation needed
Hashes of the queries without variables
Pre-filtering before query referral to server
Limitations
● Bulk insert not naively implemented in Bulbs:
– Insertion rate ~10 nodes /sec
– Naive python binding tests:
● ~60 msec for ACID compliance (HDD write)
● ~1.8 msec/node cold insertion routines
(HDD sequential write)
● ~0.3 msec/node hot write insertion routines (RAM buffer)
– 500 - 1500 nodes/sec if packages of 1000
● 6 h to fill the database up to theoretical limit
– github.com/chefjerome/graphalchemy implements efficient
flush based on bulbs (alpha and thus unstable right now)
Port to TitanDB
TitanDB
● Hbase / Cassandra / BerkleyDB as backend
TitanDB
● Hbase / Cassandra / BerkleyDB as storage backend
● Lucene / ElasticSearch as Indexing backend
● Served over Rexter server
Full distribution
> 500 simultaneous connections (5000 is still stable)
Automatic replication (Hadoop)
Multiple simultaneous queries
Sky is the only limit for storage quantities
=> TitanDB / Hbase is stable up to 5 Pbytes in production
Neo4j for bioinformatics:
parsing and curating Reactome.org
● Reactome.org:
– BioPax : xml / RDF / OWL
Neo4j for bioinformatics:
parsing and curating Reactome.org
● Reactome.org structure:
– BioPax : xml / RDF / OWL
– Physical entities:
● Proteins, small molecules, Complexes, RNA, DNA
● Fragments of physical entities
– Interaction:
● Degradation / polymerisation / Biochemical reactions
● Molecular interaction
● Genetic interaction
– Pathways, Genes, Post-translational modifications...
Protege
Neo4j for bioinformatics:
parsing and curating Reactome.org
● Reality of Reactome.org:
– Main connex element: ~ 22 000 entities, but 6 other
with >100 elements
– Presence of generic classes : groups of objects
– Proteins = mix between proteins, domains, groups,
groups of domains…
– 15 000 proteins, 5000 UNIPROT references
– 156 genes, 56 RNA molecules => translation /
transcription regulation is not well described
Neo4j for bioinformatics:
parsing and curating Reactome.org
● Reality of Reactome.org:
– heavily comment-based: case of SRC
Neo4j for bioinformatics:
parsing and curating Reactome.org
Neo4j for bioinformatics:
parsing and curating Reactome.org
● Completed with HiNT protein-protein interaction
from Yue lab at Cornell
● Re-indexed:
– SwissProt protein names
– Full names from SwissProt
– Gene Names
– KEGG, GO, EMBL, ChEBI cross-references
– PDB implemented, not re-run
Neo4j for bioinformatics:
parsing and curating Reactome.org
● Example of pathway Parsing
Conclusion
● Systems biology is more about graphs then
about systems of tables
● Graph Databases are awesome
● Neo4j is terrific
● TitanDB is cool
● You should definitely pick one of them, load
Reactome.org dataset or whatever you are
interested in and play with it.
Questions ?
Thanks
Pr. Philp Bourne
Pr. Bart Deplanke
Cedric Merlot
Li Xie
Spencer Blieven
Jiang Wang
Julia Ponomarenko
Cole Christie
Andreas Prilic
Lilia Iakoucheva

More Related Content

What's hot

Dragoncraft Architectural Overview
Dragoncraft Architectural OverviewDragoncraft Architectural Overview
Dragoncraft Architectural Overviewjessesanford
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraAlexander Korotkov
 
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data Ceph Community
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloudOVHcloud
 
Apache spark on planet scale
Apache spark on planet scaleApache spark on planet scale
Apache spark on planet scaleDenis Chapligin
 
Resource element lte explanations!
Resource element lte explanations!Resource element lte explanations!
Resource element lte explanations!Bobir Shomaksudov
 
Scaling Up Logging and Metrics
Scaling Up Logging and MetricsScaling Up Logging and Metrics
Scaling Up Logging and MetricsRicardo Lourenço
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)ITCamp
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3Rob Skillington
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoNathaniel Braun
 
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)SATOSHI TAGOMORI
 
GSoC2014 - Uniritter Presentation May, 2015
GSoC2014 - Uniritter Presentation May, 2015GSoC2014 - Uniritter Presentation May, 2015
GSoC2014 - Uniritter Presentation May, 2015Fabrízio Mello
 
MySQL Meetup Prague - Modern Data Lake
MySQL Meetup Prague - Modern Data LakeMySQL Meetup Prague - Modern Data Lake
MySQL Meetup Prague - Modern Data LakeMichal Kuchta
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 

What's hot (19)

Dragoncraft Architectural Overview
Dragoncraft Architectural OverviewDragoncraft Architectural Overview
Dragoncraft Architectural Overview
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second era
 
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Apache spark on planet scale
Apache spark on planet scaleApache spark on planet scale
Apache spark on planet scale
 
Progress_190213
Progress_190213Progress_190213
Progress_190213
 
Resource element lte explanations!
Resource element lte explanations!Resource element lte explanations!
Resource element lte explanations!
 
Scaling Up Logging and Metrics
Scaling Up Logging and MetricsScaling Up Logging and Metrics
Scaling Up Logging and Metrics
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)
 
Experimental dtrace
Experimental dtraceExperimental dtrace
Experimental dtrace
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
 
Gpu Join Presentation
Gpu Join PresentationGpu Join Presentation
Gpu Join Presentation
 
JEEConf. Vanilla java
JEEConf. Vanilla javaJEEConf. Vanilla java
JEEConf. Vanilla java
 
Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
 
GSoC2014 - Uniritter Presentation May, 2015
GSoC2014 - Uniritter Presentation May, 2015GSoC2014 - Uniritter Presentation May, 2015
GSoC2014 - Uniritter Presentation May, 2015
 
MySQL Meetup Prague - Modern Data Lake
MySQL Meetup Prague - Modern Data LakeMySQL Meetup Prague - Modern Data Lake
MySQL Meetup Prague - Modern Data Lake
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 

Viewers also liked

The power of graphs to analyze biological data
The power of graphs to analyze biological dataThe power of graphs to analyze biological data
The power of graphs to analyze biological datadatablend
 
FluxGraph: a time-machine for your graphs
FluxGraph: a time-machine for your graphsFluxGraph: a time-machine for your graphs
FluxGraph: a time-machine for your graphsdatablend
 
Building a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jBuilding a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jSimon Jupp
 
GraphConnect Europe 2016 - Building a Repository of Biomedical Ontologies wit...
GraphConnect Europe 2016 - Building a Repository of Biomedical Ontologies wit...GraphConnect Europe 2016 - Building a Repository of Biomedical Ontologies wit...
GraphConnect Europe 2016 - Building a Repository of Biomedical Ontologies wit...Neo4j
 

Viewers also liked (6)

The power of graphs to analyze biological data
The power of graphs to analyze biological dataThe power of graphs to analyze biological data
The power of graphs to analyze biological data
 
Temporal graph
Temporal graphTemporal graph
Temporal graph
 
FluxGraph: a time-machine for your graphs
FluxGraph: a time-machine for your graphsFluxGraph: a time-machine for your graphs
FluxGraph: a time-machine for your graphs
 
Building a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jBuilding a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4j
 
GraphConnect Europe 2016 - Building a Repository of Biomedical Ontologies wit...
GraphConnect Europe 2016 - Building a Repository of Biomedical Ontologies wit...GraphConnect Europe 2016 - Building a Repository of Biomedical Ontologies wit...
GraphConnect Europe 2016 - Building a Repository of Biomedical Ontologies wit...
 
Neo4j and bioinformatics
Neo4j and bioinformaticsNeo4j and bioinformatics
Neo4j and bioinformatics
 

Similar to Graph databases for computational biology

Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 
Open stack HA - Theory to Reality
Open stack HA -  Theory to RealityOpen stack HA -  Theory to Reality
Open stack HA - Theory to RealitySriram Subramanian
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
REST::Neo4p - Talk @ DC Perl Mongers
REST::Neo4p - Talk @ DC Perl MongersREST::Neo4p - Talk @ DC Perl Mongers
REST::Neo4p - Talk @ DC Perl MongersMark Jensen
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccsrisatish ambati
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to CephCeph Community
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksRuslan Meshenberg
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014Avinash Ramineni
 
More Efficient Object Replication in OpenStack Summit Juno
More Efficient Object Replication in OpenStack Summit JunoMore Efficient Object Replication in OpenStack Summit Juno
More Efficient Object Replication in OpenStack Summit JunoKota Tsuyuzaki
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...VMware Tanzu
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009fschupp
 
Enhancing Domain Specific Language Implementations Through Ontology
Enhancing Domain Specific Language Implementations Through OntologyEnhancing Domain Specific Language Implementations Through Ontology
Enhancing Domain Specific Language Implementations Through OntologyChunhua Liao
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLThijs Terlouw
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014clairvoyantllc
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsCeph Community
 

Similar to Graph databases for computational biology (20)

Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Open stack HA - Theory to Reality
Open stack HA -  Theory to RealityOpen stack HA -  Theory to Reality
Open stack HA - Theory to Reality
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
REST::Neo4p - Talk @ DC Perl Mongers
REST::Neo4p - Talk @ DC Perl MongersREST::Neo4p - Talk @ DC Perl Mongers
REST::Neo4p - Talk @ DC Perl Mongers
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talks
 
NoSQL
NoSQLNoSQL
NoSQL
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
More Efficient Object Replication in OpenStack Summit Juno
More Efficient Object Replication in OpenStack Summit JunoMore Efficient Object Replication in OpenStack Summit Juno
More Efficient Object Replication in OpenStack Summit Juno
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009
 
Enhancing Domain Specific Language Implementations Through Ontology
Enhancing Domain Specific Language Implementations Through OntologyEnhancing Domain Specific Language Implementations Through Ontology
Enhancing Domain Specific Language Implementations Through Ontology
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 

Recently uploaded

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Graph databases for computational biology

  • 1. Graph databases In computational biology: Neo4j and TitanDB Andrei Kucharavy 23/08/2013
  • 2. Rigid structure of Interactions = Interactome Knowledge access structure = GO Why even bother? Those are Graphs
  • 3. Why even bother? ● ~ 1 Gb of raw data from Reactome ● ~ 300 Mb of Data from Uniprot / GO / ENSEMBL/ … mappings ● => this is way over the conventional 1024 Mb JVM limit => heap crash ● ~ 15 minutes to load ● Nightmare to visualize and debug
  • 4. Relational Databases Intro to neo4j presentation – jexp @ slideshare
  • 6. Graph databases Intro to neo4j presentation – jexp @ slideshare
  • 7. Core abstractions ● Objects: – Nodes (Vertexes) – Relationships between nodes (Edges) – Properties for Vertexes and Edges
  • 9. Core abstractions ● Objects: – Nodes (Vertexes) – Relationships between nodes (Edges) – Properties for Vertexes and Edges ● Operations: – Immediate relations – Traversals ● Get the shortest path from j to k ● Get the path with least weight from j to k, ...
  • 10.
  • 11. Main advantages promised ● Increased speed for graph-type applications – Avoid “join” on 10M rows to get ~20 “related” elements – Traversals ● Simplified programming – Java objects – Xml / rdf / owl – Schema alterations
  • 12. Main advantages promised ● Ease of deployment / maintenance: – Scalability – Complexity – Modifications – Schema migrations
  • 13. neo4j ● Started in 2003 ● Schema-free ● ACID transactions ● Reasonably scalable, reasonably replicatable ● 10 000 open source projects, 1000 commercial costumers
  • 14.
  • 15. neo4j ● Started in 2003 ● Schema-free ● ACID transactions ● 10 000 open source projects, 1000 commercial costumers ● 100 % open source
  • 17. neo4j ● Started in 2003 ● Schema-free ● ACID transactions ● 10 000 open source projects, 1000 commercial costumers ● 100 % open source ● Master-slave replication ● AGPL 3 license: if you are open source, it is free, Even the support
  • 18.
  • 19. neo4j ● Started in 2003 ● Schema-free ● ACID transactions ● 10 000 open source projects, 1000 commercial costumers ● 100 % open source ● Master-slave replication ● AGPL 3 license: if you are open source, it is free, Even the support ● Plus graphical interface => De-bug!!!!
  • 20.
  • 21. Deployment Demo ● cd to specific DB location (better as a special user) ● ./neo4j start ● ./neo4j stop ● => Serves localhost:7474 ● 40 000 files => mainly indexes / user accesses
  • 22. Under the hood ● Java & JVM ● Split in two – In-RAM “pre-heated” v.s. Whole in-HDD ● Scalability: – 32 G nodes / 32 G relations / 64 G properties – 1 M traversals / sec, size-independent of a graph ● Lucene index: instant search
  • 23. Interfaces ● Two-fold interface: – REST server – Local instance ● Specific query Language: Cipher
  • 24. Interfaces ● Two-fold interface: – REST server – Local instance ● Specific query Language: Cipher ● Interoperability: support for tinkerpop stack
  • 26. What is Gremlin ● Domain-specific graph language ● Build atop Groovy – JVM – Dynamically evaluated – ~ scripting in java ● Core = java – Java – Scala / Clojure – Jpypes / Jython / Jruby ● Supported by most graph databases
  • 27. Interfaces ● Two-fold interface: – REST server – Local instance ● Specific query Language: Cipher ● Interoperability: support for TinkerPop stack ● Native bindings: – Java – Python, PHP, Ruby / Rails, node.js, .Net – Scala, Clojure, Haskell, ... ● My stack: – Native Python and Python through bulbs and REST
  • 28. Python + Bulbs + REST + neo4j ● Bulbs = Pythonic wrapper for Gremlin ● Portability(BluePrints + Rexter) – Titan DB (will be discussed later on) – Bitsy – Infinite Graph – Sqrrl – ArangoDB ● Class heritability and DDT: – Java-like class heritability
  • 29. Demo 2 ● Datatype declaration ● GraphDB connection and declaration ● Fill-in ● Graphical Interface
  • 30. neo4j-specific ● Lucene index in the backend – Exact indexing => constant-time retrieval – Full-text indexing => searching partial names and adding the missing links ● SRC = SRC_HUMAN = SRC1
  • 31. Demo3 ● Constant node retrieval time / internode connection distance time ● Performing the partial search ● Adding missing links ● Neo4j server v.s. Local database ● Performing simple Gremlin queries
  • 32. Use Case: ● Existent map of correlations: ProteinDomain Domain Type Protein function
  • 33. Use Case: ● Existent map of correlations: ● Wanted map of correlations: ProteinDomain Domain Type Protein function ProteinDomain Domain Type Protein function
  • 34. Use Case: ● Existent map of correlations: ● Wanted map of correlations: ProteinDomain Domain Type Protein function ProteinDomain Domain Type Protein function
  • 35. Use Case ● SQL Python / SQLAlchemy: – Create new table – Add ForeignKeys, Primary key, indexes, ... – Add the table to the data model, – Create functions for access/update, – ...
  • 36. Use Case ● Bulbs / Neo4j => Live demo
  • 37. Use case 2 ● In human proteome, find all chemical groups A and B separated by less then x Å – Database Structure: ● Suppose all the proteins are connected to a “Type node” ● Each protein is linked to it's domains, each domain is linked to it's amino acids, each amino-acid linked to it's chemical groups and ultimately atoms ● Chemical groups have assigned distance between them and groups they are close to – Algorithm ● Select a protein of interest ● Get all of it's chemical groups: 1000(a.a)*3(ch.gr/a.a) ● Filter all of the Relations longer than k: 1000*3*100(possible contacts per ch.gr) ● Recover the proteins: 1000*3*100*2 ● With 1M traversals per second => 0.6 sec. to execute the query – If TitanDB with ElasticSearch and geo-queries (all within circle of radius x), higher speeds possible
  • 38. Limitations ● Node Number: – 32 Giga Nodes / Edges is a lot on servers ● ~100 Tb of data ● 1 Unix partition ● 40 000 ++ simultaneously opened files (Indexes+users) – 32 Giga Edges is relatively small in biology ● ~ 43 M nodes in UniProt Only ● GO x UNIPROT x EMBL x GeneNames x Interaction Maps x Localisations x names & Accesses .... ● All potentially druggable molecules, all protein atoms, all atom-atom interactions
  • 39. Limitations ● Absence of parallelism/distribution – One process at time: ● 1 traversal at time ● ACID => Database locks ● Though master-slave distribution – Single partition ● Replication ● 100 Tb + RAID!? ● Though full support for AWS and VM
  • 40. Limitations ● Bubs: python over gremlin scripts – Gremlin → Groovy → JVM → do what you want => SQL (Gremlin) injections – Request sanitation needed Hashes of the queries without variables Pre-filtering before query referral to server
  • 41. Limitations ● Bulk insert not naively implemented in Bulbs: – Insertion rate ~10 nodes /sec – Naive python binding tests: ● ~60 msec for ACID compliance (HDD write) ● ~1.8 msec/node cold insertion routines (HDD sequential write) ● ~0.3 msec/node hot write insertion routines (RAM buffer) – 500 - 1500 nodes/sec if packages of 1000 ● 6 h to fill the database up to theoretical limit – github.com/chefjerome/graphalchemy implements efficient flush based on bulbs (alpha and thus unstable right now)
  • 43. TitanDB ● Hbase / Cassandra / BerkleyDB as backend
  • 44. TitanDB ● Hbase / Cassandra / BerkleyDB as storage backend ● Lucene / ElasticSearch as Indexing backend ● Served over Rexter server Full distribution > 500 simultaneous connections (5000 is still stable) Automatic replication (Hadoop) Multiple simultaneous queries Sky is the only limit for storage quantities => TitanDB / Hbase is stable up to 5 Pbytes in production
  • 45. Neo4j for bioinformatics: parsing and curating Reactome.org ● Reactome.org: – BioPax : xml / RDF / OWL
  • 46. Neo4j for bioinformatics: parsing and curating Reactome.org ● Reactome.org structure: – BioPax : xml / RDF / OWL – Physical entities: ● Proteins, small molecules, Complexes, RNA, DNA ● Fragments of physical entities – Interaction: ● Degradation / polymerisation / Biochemical reactions ● Molecular interaction ● Genetic interaction – Pathways, Genes, Post-translational modifications...
  • 48.
  • 49. Neo4j for bioinformatics: parsing and curating Reactome.org ● Reality of Reactome.org: – Main connex element: ~ 22 000 entities, but 6 other with >100 elements – Presence of generic classes : groups of objects – Proteins = mix between proteins, domains, groups, groups of domains… – 15 000 proteins, 5000 UNIPROT references – 156 genes, 56 RNA molecules => translation / transcription regulation is not well described
  • 50. Neo4j for bioinformatics: parsing and curating Reactome.org ● Reality of Reactome.org: – heavily comment-based: case of SRC
  • 51. Neo4j for bioinformatics: parsing and curating Reactome.org
  • 52. Neo4j for bioinformatics: parsing and curating Reactome.org ● Completed with HiNT protein-protein interaction from Yue lab at Cornell ● Re-indexed: – SwissProt protein names – Full names from SwissProt – Gene Names – KEGG, GO, EMBL, ChEBI cross-references – PDB implemented, not re-run
  • 53. Neo4j for bioinformatics: parsing and curating Reactome.org ● Example of pathway Parsing
  • 54. Conclusion ● Systems biology is more about graphs then about systems of tables ● Graph Databases are awesome ● Neo4j is terrific ● TitanDB is cool ● You should definitely pick one of them, load Reactome.org dataset or whatever you are interested in and play with it.
  • 56. Thanks Pr. Philp Bourne Pr. Bart Deplanke Cedric Merlot Li Xie Spencer Blieven Jiang Wang Julia Ponomarenko Cole Christie Andreas Prilic Lilia Iakoucheva

Editor's Notes

  1. Why join millions of rows if only 10 relationships are iteresting? What to do if we want traversals