SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Scaling up genomic 
analysis with ADAM 
Frank Austin Nothaft, UC Berkeley AMPLab 
fnothaft@berkeley.edu, @fnothaft 
10/27/2014
What is ADAM? 
• An open source, high performance, distributed 
platform for genomic analysis 
• ADAM defines a: 
1. Data schema and layout on disk* 
2. A Scala API 
3. A command line interface 
* Via Avro and Parquet
What’s the big picture? 
ADAM:! 
Core API + 
CLIs 
bdg-formats:! 
Data schemas 
RNAdam:! 
RNA analysis on 
ADAM 
avocado:! 
Distributed local 
assembler 
xASSEMBLEx:! 
GraphX-based de 
novo assembler 
bdg-services:! 
ADAM clusters 
PacMin:! 
String graph 
assembler
Implementation Overview 
• 34k LOC (96% Scala) 
• Apache 2 licensed OSS 
• 23 contributors across 10 institutions 
• Pushing for production 1.0 release towards end of year
Key Observations 
• Current genomics pipelines are I/O limited 
• Most genomics algorithms can be formulated as a 
data or graph parallel computation 
• These algorithms are heavy on iteration/pipelining 
• Data access pattern is write once, read many times 
• High coverage, whole genome will become main 
sequencing target (for human genetics)
Principles for Scalable 
Design in ADAM 
• Parallel FS and data representation (HDFS + 
Parquet) combined with in-memory computing 
eliminates disk bandwidth bottleneck 
• Spark allows efficient implementation of iterative/ 
pipelined Map-Reduce 
• Minimize data movement: send code to data
• An in-memory data parallel computing framework 
• Optimized for iterative jobs —> unlike Hadoop 
• Data maintained in memory unless inter-node 
movement needed (e.g., on repartitioning) 
• Presents a functional programing API, along with 
support for iterative programming via REPL 
• Used at scale on clusters with >2k nodes, 4TB 
datasets
Why Spark? 
• Current leading map-reduce framework: 
• First in-memory map-reduce platform 
• Used at scale in industry, supported in major distros (Cloudera, 
HortonWorks, MapR) 
• The API: 
• Fully functional API 
• Main API in Scala, also support Java, Python, R 
• Manages node/job failures via lineage, data locality/job assignment 
• Downstream tools (GraphX, MLLib)
Data Format 
• Avro schema encoded by 
Parquet 
• Schema can be updated 
without breaking backwards 
compatibility 
• Read schema looks a lot like 
BAM, but renormalized 
• Actively removing tags 
• Variant schema is strictly 
biallelic, a “cell in the matrix” 
record AlignmentRecord { 
union { null, Contig } contig = null; 
union { null, long } start = null; 
union { null, long } end = null; 
union { null, int } mapq = null; 
union { null, string } readName = null; 
union { null, string } sequence = null; 
union { null, string } mateReference = null; 
union { null, long } mateAlignmentStart = null; 
union { null, string } cigar = null; 
union { null, string } qual = null; 
union { null, string } recordGroupName = null; 
union { int, null } basesTrimmedFromStart = 0; 
union { int, null } basesTrimmedFromEnd = 0; 
union { boolean, null } readPaired = false; 
union { boolean, null } properPair = false; 
union { boolean, null } readMapped = false; 
union { boolean, null } mateMapped = false; 
union { boolean, null } firstOfPair = false; 
union { boolean, null } secondOfPair = false; 
union { boolean, null } failedVendorQualityChecks = false; 
union { boolean, null } duplicateRead = false; 
union { boolean, null } readNegativeStrand = false; 
union { boolean, null } mateNegativeStrand = false; 
union { boolean, null } primaryAlignment = false; 
union { boolean, null } secondaryAlignment = false; 
union { boolean, null } supplementaryAlignment = false; 
union { null, string } mismatchingPositions = null; 
union { null, string } origQual = null; 
union { null, string } attributes = null; 
union { null, string } recordGroupSequencingCenter = null; 
union { null, string } recordGroupDescription = null; 
union { null, long } recordGroupRunDateEpoch = null; 
union { null, string } recordGroupFlowOrder = null; 
union { null, string } recordGroupKeySequence = null; 
union { null, string } recordGroupLibrary = null; 
union { null, int } recordGroupPredictedMedianInsertSize = null; 
union { null, string } recordGroupPlatform = null; 
union { null, string } recordGroupPlatformUnit = null; 
union { null, string } recordGroupSample = null; 
union { null, Contig} mateContig = null; 
}
Parquet 
• ASF Incubator project, based on 
Google Dremel 
• http://www.parquet.io 
• High performance columnar 
store with support for projections 
and push-down predicates 
• 3 layers of parallelism: 
• File/row group 
• Column chunk 
• Page 
Image from Parquet format definition: https://github.com/Parquet/parquet-format
Filtering 
• Parquet provides pushdown predication 
• Evaluate filter on a subset of columns 
• Only read full set of projected columns for passing records 
• Full primary/secondary indexing support in Parquet 2.0 
• Very efficient if reading a small set of columns: 
• On disk, contig ID/start/end consume < 2% of space 
Image from Parquet format definition: https://github.com/Parquet/parquet-format
Compression 
• Parquet compresses 
at the column level: 
• RLE for repetitive 
columns 
• Dictionary 
encoding for 
quantized 
columns 
• ADAM uses a fully 
denormalized schema 
• Repetitive columns are 
RLE’d out 
• Delta encoding 
(Parquet 2.0) will aid 
with quality scores 
• ADAM is 5-25% smaller 
than compressed BAM
Parquet/Spark Integration 
• 1 row group in Parquet maps 
to 1 partition in Spark 
• We interact with Parquet via 
input/output formats 
• These apply projections 
and predicates, handle 
(de)compression 
• Spark builds and executes a 
computation DAG, manages 
data locality, errors/retries, etc. 
Parquet 
RG 1 RG 2 RG n … 
Spark 
Parquet Input Format 
Partition 
2 
… 
Parquet Output Format 
Parquet 
Partition 
1 
Partition 
n 
RG 1 RG 2 RG n …
Long-read assembly 
with PacMin
The State of Analysis 
• Conventional short-read alignment based pipelines 
are really good at calling SNPs 
• But, we’re still pretty bad at calling INDELs, and 
SVs 
• And are slow: 2 weeks to sequence, 1 week to 
analyze. Not fast enough for clinical use. 
• If we move away from short reads, do we have other 
options?
Opportunities 
• New read technologies are available 
• Provide much longer reads (250bp vs. >10kbp) 
• Different error model… (15% INDEL errors, vs. 2% 
SNP errors) 
• Generally, lower sequence specific bias 
Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
If long reads are available… 
• We can use conventional methods: 
Carneiro et al, Genome Biology 2012
But! 
• Why not make raw assemblies out of the reads? 
Find overlapping reads Find consensus sequence 
for all pairs of reads (i,j): 
i j 
=? 
…ACACTGCGACTCATCGACTC… 
• Problems: 
1. Overlapping is O(n 
2 
) and single evaluation is expensive anyways 
2. Typical algorithms find a single consensus sequence; what if we’ve got 
polymorphisms?
Fast Overlapping with 
MinHashing 
• Wonderful realization by Berlin et al1: overlapping is 
similar to document similarity problem 
• Use MinHashing to approximate similarity: 
1: Berlin et al, bioRxiv 2014 
Per document/read, 
compute signature:! 
! 
1. Cut into shingles 
2. Apply random 
hashes to shingles 
3. Take min over all 
random hashes 
Hash into buckets:! 
! 
Signatures of length l 
can be hashed into b 
buckets, so we expect 
to compare all elements 
with similarity 
≥ (1/b)^(b/l) 
Compare:! 
! 
For two documents with 
signatures of length l, 
Jaccard similarity is 
estimated by 
(# equal hashes) / l 
! 
• Easy to implement in Spark: map, groupBy, map, filter
Overlaps to Assemblies 
• Finding pairwise overlaps gives us a directed 
graph between reads (lots of edges!)
Transitive Reduction 
• We can find a consensus between clique members 
• Or, we can reduce down: 
• Via two iterations of Pregel!
Actually Making Calls 
• From here, we need to call copy number per edge 
• Probably via Newton-Raphson based on coverage; we’re not sure yet. 
• Then, per position in each edge, call alleles: 
Notes:! 
Equation is from Li, Bioinformatics 2011 
g = genotype state 
m = ploidy 
휖 = probability allele was erroneously observed 
k = number of reads observed 
l = number of reads observed matching “reference” allele 
TBD: equation assumes biallelic observations at site and reference allele; we won’t have either of those conveniences…
An aside: Monoallelic 
Genotyping 
• Traditional probabilistic models for variant calling 
assume independence at each site 
• However, this throws away a lot of information 
• Can consider a different formulation of the problem: 
• Build a graph of the alleles 
• Find the allelic copy numbers that maximize 
likelihood
Allelic Graph
Allelic Graph 
ACACTCG 
C 
A 
TCTCA 
G 
C 
TCCACACT 
• Edges of graph define conditional probabilities 
• E.g., if ACACTCG is covered by 30 reads, and 
C is covered by 1 read, P(C | ACACTCG) is low 
• Can efficiently marginalize probabilities over graph 
using Eliminate algorithm1, exactly solve for argmax 
1. Jordan, “Probabilistic Graphical Models.”
Output 
• Current assemblers emit FASTA contigs 
• In layperson’s speak: long strings 
• We’ll emit “multigs”, which we’ll map back to reference 
graph 
• Multig = multi-allelic (polymorphic) contig 
• Working with UCSC, who’ve done some really neat work1 
deriving formalisms & building software for mapping 
between sequence graphs, and GA4GH ref. variation team 
1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.
Acknowledgements 
• UC Berkeley: Matt Massie, André Schumacher, 
Jey Kottalam, Christos Kozanitis, Adam Bloniarz! 
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael 
Linderman, Jeff Hammerbacher! 
• GenomeBridge: Timothy Danford, Carl Yeksigian! 
• Cloudera: Uri Laserson! 
• Microsoft Research: Jeremy Elson, Ravi Pandya! 
• And many other open source contributors: 23 
contributors to ADAM/BDG from >10 institutions

Weitere ähnliche Inhalte

Was ist angesagt?

Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Timothy Danford
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteMatt Massie
 
Strata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationStrata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationTimothy Danford
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
 
From Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleFrom Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleDatabricks
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems fnothaft
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf OpenflydataJun Zhao
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...Sri Ambati
 

Was ist angesagt? (20)

Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
 
Spark Summit East 2015
Spark Summit East 2015Spark Summit East 2015
Spark Summit East 2015
 
Strata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationStrata-Hadoop 2015 Presentation
Strata-Hadoop 2015 Presentation
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 
From Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleFrom Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at Scale
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
 

Ähnlich wie Scalable up genomic analysis with ADAM

GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processinghuguk
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analysesfnothaft
 
Apache spark on planet scale
Apache spark on planet scaleApache spark on planet scale
Apache spark on planet scaleDenis Chapligin
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulationsSFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulationsSouth Tyrol Free Software Conference
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and OutTravis Oliphant
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Spark Gotchas and Lessons Learned
Spark Gotchas and Lessons LearnedSpark Gotchas and Lessons Learned
Spark Gotchas and Lessons LearnedJen Waller
 
220 runtime environments
220 runtime environments220 runtime environments
220 runtime environmentsJ'tong Atong
 
Fosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
Fosdem 2011 - A Common Graph Database Access Layer for .Net and MonoFosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
Fosdem 2011 - A Common Graph Database Access Layer for .Net and MonoAchim Friedland
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Максим Харченко. Erlang lincx
Максим Харченко. Erlang lincxМаксим Харченко. Erlang lincx
Максим Харченко. Erlang lincxAlina Dolgikh
 
0.5mln packets per second with Erlang
0.5mln packets per second with Erlang0.5mln packets per second with Erlang
0.5mln packets per second with ErlangMaxim Kharchenko
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector? confluent
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Ontico
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...Databricks
 

Ähnlich wie Scalable up genomic analysis with ADAM (20)

GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
Apache spark on planet scale
Apache spark on planet scaleApache spark on planet scale
Apache spark on planet scale
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulationsSFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Spark Gotchas and Lessons Learned
Spark Gotchas and Lessons LearnedSpark Gotchas and Lessons Learned
Spark Gotchas and Lessons Learned
 
220 runtime environments
220 runtime environments220 runtime environments
220 runtime environments
 
Fosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
Fosdem 2011 - A Common Graph Database Access Layer for .Net and MonoFosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
Fosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Максим Харченко. Erlang lincx
Максим Харченко. Erlang lincxМаксим Харченко. Erlang lincx
Максим Харченко. Erlang lincx
 
0.5mln packets per second with Erlang
0.5mln packets per second with Erlang0.5mln packets per second with Erlang
0.5mln packets per second with Erlang
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
 

Mehr von fnothaft

Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
Reproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral ModelsReproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral Modelsfnothaft
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assemblyfnothaft
 
Execution Environments
Execution EnvironmentsExecution Environments
Execution Environmentsfnothaft
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Handsfnothaft
 
Adam bosc-071114
Adam bosc-071114Adam bosc-071114
Adam bosc-071114fnothaft
 

Mehr von fnothaft (6)

Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Reproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral ModelsReproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral Models
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
 
Execution Environments
Execution EnvironmentsExecution Environments
Execution Environments
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 
Adam bosc-071114
Adam bosc-071114Adam bosc-071114
Adam bosc-071114
 

Kürzlich hochgeladen

Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectssuserb6619e
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 

Kürzlich hochgeladen (20)

Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 

Scalable up genomic analysis with ADAM

  • 1. Scaling up genomic analysis with ADAM Frank Austin Nothaft, UC Berkeley AMPLab fnothaft@berkeley.edu, @fnothaft 10/27/2014
  • 2. What is ADAM? • An open source, high performance, distributed platform for genomic analysis • ADAM defines a: 1. Data schema and layout on disk* 2. A Scala API 3. A command line interface * Via Avro and Parquet
  • 3. What’s the big picture? ADAM:! Core API + CLIs bdg-formats:! Data schemas RNAdam:! RNA analysis on ADAM avocado:! Distributed local assembler xASSEMBLEx:! GraphX-based de novo assembler bdg-services:! ADAM clusters PacMin:! String graph assembler
  • 4. Implementation Overview • 34k LOC (96% Scala) • Apache 2 licensed OSS • 23 contributors across 10 institutions • Pushing for production 1.0 release towards end of year
  • 5. Key Observations • Current genomics pipelines are I/O limited • Most genomics algorithms can be formulated as a data or graph parallel computation • These algorithms are heavy on iteration/pipelining • Data access pattern is write once, read many times • High coverage, whole genome will become main sequencing target (for human genetics)
  • 6. Principles for Scalable Design in ADAM • Parallel FS and data representation (HDFS + Parquet) combined with in-memory computing eliminates disk bandwidth bottleneck • Spark allows efficient implementation of iterative/ pipelined Map-Reduce • Minimize data movement: send code to data
  • 7. • An in-memory data parallel computing framework • Optimized for iterative jobs —> unlike Hadoop • Data maintained in memory unless inter-node movement needed (e.g., on repartitioning) • Presents a functional programing API, along with support for iterative programming via REPL • Used at scale on clusters with >2k nodes, 4TB datasets
  • 8. Why Spark? • Current leading map-reduce framework: • First in-memory map-reduce platform • Used at scale in industry, supported in major distros (Cloudera, HortonWorks, MapR) • The API: • Fully functional API • Main API in Scala, also support Java, Python, R • Manages node/job failures via lineage, data locality/job assignment • Downstream tools (GraphX, MLLib)
  • 9. Data Format • Avro schema encoded by Parquet • Schema can be updated without breaking backwards compatibility • Read schema looks a lot like BAM, but renormalized • Actively removing tags • Variant schema is strictly biallelic, a “cell in the matrix” record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig} mateContig = null; }
  • 10. Parquet • ASF Incubator project, based on Google Dremel • http://www.parquet.io • High performance columnar store with support for projections and push-down predicates • 3 layers of parallelism: • File/row group • Column chunk • Page Image from Parquet format definition: https://github.com/Parquet/parquet-format
  • 11. Filtering • Parquet provides pushdown predication • Evaluate filter on a subset of columns • Only read full set of projected columns for passing records • Full primary/secondary indexing support in Parquet 2.0 • Very efficient if reading a small set of columns: • On disk, contig ID/start/end consume < 2% of space Image from Parquet format definition: https://github.com/Parquet/parquet-format
  • 12. Compression • Parquet compresses at the column level: • RLE for repetitive columns • Dictionary encoding for quantized columns • ADAM uses a fully denormalized schema • Repetitive columns are RLE’d out • Delta encoding (Parquet 2.0) will aid with quality scores • ADAM is 5-25% smaller than compressed BAM
  • 13. Parquet/Spark Integration • 1 row group in Parquet maps to 1 partition in Spark • We interact with Parquet via input/output formats • These apply projections and predicates, handle (de)compression • Spark builds and executes a computation DAG, manages data locality, errors/retries, etc. Parquet RG 1 RG 2 RG n … Spark Parquet Input Format Partition 2 … Parquet Output Format Parquet Partition 1 Partition n RG 1 RG 2 RG n …
  • 15. The State of Analysis • Conventional short-read alignment based pipelines are really good at calling SNPs • But, we’re still pretty bad at calling INDELs, and SVs • And are slow: 2 weeks to sequence, 1 week to analyze. Not fast enough for clinical use. • If we move away from short reads, do we have other options?
  • 16. Opportunities • New read technologies are available • Provide much longer reads (250bp vs. >10kbp) • Different error model… (15% INDEL errors, vs. 2% SNP errors) • Generally, lower sequence specific bias Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
  • 17. If long reads are available… • We can use conventional methods: Carneiro et al, Genome Biology 2012
  • 18. But! • Why not make raw assemblies out of the reads? Find overlapping reads Find consensus sequence for all pairs of reads (i,j): i j =? …ACACTGCGACTCATCGACTC… • Problems: 1. Overlapping is O(n 2 ) and single evaluation is expensive anyways 2. Typical algorithms find a single consensus sequence; what if we’ve got polymorphisms?
  • 19. Fast Overlapping with MinHashing • Wonderful realization by Berlin et al1: overlapping is similar to document similarity problem • Use MinHashing to approximate similarity: 1: Berlin et al, bioRxiv 2014 Per document/read, compute signature:! ! 1. Cut into shingles 2. Apply random hashes to shingles 3. Take min over all random hashes Hash into buckets:! ! Signatures of length l can be hashed into b buckets, so we expect to compare all elements with similarity ≥ (1/b)^(b/l) Compare:! ! For two documents with signatures of length l, Jaccard similarity is estimated by (# equal hashes) / l ! • Easy to implement in Spark: map, groupBy, map, filter
  • 20. Overlaps to Assemblies • Finding pairwise overlaps gives us a directed graph between reads (lots of edges!)
  • 21. Transitive Reduction • We can find a consensus between clique members • Or, we can reduce down: • Via two iterations of Pregel!
  • 22. Actually Making Calls • From here, we need to call copy number per edge • Probably via Newton-Raphson based on coverage; we’re not sure yet. • Then, per position in each edge, call alleles: Notes:! Equation is from Li, Bioinformatics 2011 g = genotype state m = ploidy 휖 = probability allele was erroneously observed k = number of reads observed l = number of reads observed matching “reference” allele TBD: equation assumes biallelic observations at site and reference allele; we won’t have either of those conveniences…
  • 23. An aside: Monoallelic Genotyping • Traditional probabilistic models for variant calling assume independence at each site • However, this throws away a lot of information • Can consider a different formulation of the problem: • Build a graph of the alleles • Find the allelic copy numbers that maximize likelihood
  • 25. Allelic Graph ACACTCG C A TCTCA G C TCCACACT • Edges of graph define conditional probabilities • E.g., if ACACTCG is covered by 30 reads, and C is covered by 1 read, P(C | ACACTCG) is low • Can efficiently marginalize probabilities over graph using Eliminate algorithm1, exactly solve for argmax 1. Jordan, “Probabilistic Graphical Models.”
  • 26. Output • Current assemblers emit FASTA contigs • In layperson’s speak: long strings • We’ll emit “multigs”, which we’ll map back to reference graph • Multig = multi-allelic (polymorphic) contig • Working with UCSC, who’ve done some really neat work1 deriving formalisms & building software for mapping between sequence graphs, and GA4GH ref. variation team 1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.
  • 27. Acknowledgements • UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis, Adam Bloniarz! • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher! • GenomeBridge: Timothy Danford, Carl Yeksigian! • Cloudera: Uri Laserson! • Microsoft Research: Jeremy Elson, Ravi Pandya! • And many other open source contributors: 23 contributors to ADAM/BDG from >10 institutions