SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Why is Bioinformatics 
(well, really, “genomics”) 
a Good Fit for Spark? 
Timothy Danford 
AMPLab
A One-Slide Introduction to Genomics
Bioinformatics computation is batch 
processing and workflows 
● Bioinformatics has a lot of 
“workflow engines” 
○ Galaxy, Taverna, Firehose, Zamboni, 
Queue, Luigi, bPipe 
○ bash scripts 
○ even make, fer cryin’ out loud 
○ a new one every day 
● Bioinformatics software 
development is still largely a 
research activity
State-of-the-Art infrastructure: 
shared filesystems, handwritten parallelism 
● Hand-written task creation 
● File formats instead of APIs or 
data models 
○ formats are poorly defined 
○ contain optional or 
redundant fields 
○ semantics are unclear 
● Workflow engines can’t take 
advantage of common 
parallelism between stages
So, why Spark?
Most of Genomics is 1-D Geometry
Most of Genomics is 1-D Geometry
The rest is iterative evaluation of 
probabilistic models!
Spark RDDs and Partitioners allow 
declarative parallelization for genomics 
● Genomics computation 
is parallelized in a small, 
standard number of 
ways 
○ by position 
○ by sample 
● Declarative, flexible 
partitioning schemes 
are useful
Spark can easily express genomics primitives: 
join by genomic overlap 
1. Calculate disjoint 
regions based on left 
(blue) set 
2. Partition both sets by 
disjoint regions 
3. Merge-join within each 
partition 
4. (Optional) aggregation 
across joined pairs
ADAM is Genomics + Spark 
● A rewrite of core bioinformatics tools and algorithms in Spark 
● Combines three 
technologies 
○ Spark 
○ Parquet 
○ Avro 
● Apache 2-licensed 
● Started at the AMPLab 
http://bdgenomics.org/
Avro and Parquet are just as critical to 
ADAM as Spark 
● Avro to define data models 
● Parquet for serialization format 
● Still need to answer design 
questions 
○ how wide are the schemas? 
○ how much do we follow existing 
formats? 
○ how do carry through projections?
Still need to convince bioinformaticians to 
rewrite their software! 
Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
Still need to convince bioinformaticians to 
rewrite their software! 
● A single piece of a 
single filtering stage 
for a somatic variant 
caller 
● “11-base-pair window 
centered on a candidate 
mutation” actually 
turns out to be 
optimized for a 
particular file format 
and sort order 
Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
The Future: 
Distributed and Incremental? 
● Today: 5k samples x 20 Gb / sample 
● Tomorrow: 1m+ samples @ 200+ Gb / sample? 
● More and more analysis is aggregative 
○ joint variant calling, 
○ panels of normal samples, 
○ collective variant annotation 
● And “data collection” will never be finished
Acknowledgements 
Matt Massie (AMPLab) 
Frank Nothaft (AMPLab) 
Carl Yeksigian (DataStax) 
Anthony Philippakis (Broad Institute) 
Jeff Hammerbacher (Cloudera / Mt. Sinai) 
Thank you! 
(questions?)

Weitere ähnliche Inhalte

Was ist angesagt?

Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadofnothaft
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMfnothaft
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shangSAIL_QU
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...University of California, San Diego
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Kim Herzig
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactJean-Paul Calbimonte
 

Was ist angesagt? (20)

Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's React
 

Ähnlich wie Why is Bioinformatics a Good Fit for Spark?

Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinityPeterMorrell4
 
Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...InfinIT - Innovationsnetværket for it
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesLynn Langit
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingSaliya Ekanayake
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mpranjit banshpal
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsAltuna Akalin
 
Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Barbera van Schaik
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...VMware Tanzu
 
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging FaceINTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Faceapidays
 
Getting Started with SPARK
Getting Started with SPARKGetting Started with SPARK
Getting Started with SPARKAdaCore
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Ahmed Kamal
 
RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruningwajrcs
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscoverygwprice
 

Ähnlich wie Why is Bioinformatics a Good Fit for Spark? (20)

Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mp
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
 
Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...
 
groovy & grails - lecture 1
groovy & grails - lecture 1groovy & grails - lecture 1
groovy & grails - lecture 1
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging FaceINTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Face
 
Getting Started with SPARK
Getting Started with SPARKGetting Started with SPARK
Getting Started with SPARK
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruning
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
 

Kürzlich hochgeladen

Call Girls Service in Virugambakkam - 7001305949 | 24x7 Service Available Nea...
Call Girls Service in Virugambakkam - 7001305949 | 24x7 Service Available Nea...Call Girls Service in Virugambakkam - 7001305949 | 24x7 Service Available Nea...
Call Girls Service in Virugambakkam - 7001305949 | 24x7 Service Available Nea...Nehru place Escorts
 
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdf
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdfHemostasis Physiology and Clinical correlations by Dr Faiza.pdf
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdfMedicoseAcademics
 
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service MumbaiLow Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbaisonalikaur4
 
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...narwatsonia7
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photosnarwatsonia7
 
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort ServiceCollege Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort ServiceNehru place Escorts
 
Glomerular Filtration and determinants of glomerular filtration .pptx
Glomerular Filtration and  determinants of glomerular filtration .pptxGlomerular Filtration and  determinants of glomerular filtration .pptx
Glomerular Filtration and determinants of glomerular filtration .pptxDr.Nusrat Tariq
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowNehru place Escorts
 
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking ModelsMumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Modelssonalikaur4
 
High Profile Call Girls Kodigehalli - 7001305949 Escorts Service with Real Ph...
High Profile Call Girls Kodigehalli - 7001305949 Escorts Service with Real Ph...High Profile Call Girls Kodigehalli - 7001305949 Escorts Service with Real Ph...
High Profile Call Girls Kodigehalli - 7001305949 Escorts Service with Real Ph...narwatsonia7
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Bookingnarwatsonia7
 
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...narwatsonia7
 
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
High Profile Call Girls Mavalli - 7001305949 | 24x7 Service Available Near Me
High Profile Call Girls Mavalli - 7001305949 | 24x7 Service Available Near MeHigh Profile Call Girls Mavalli - 7001305949 | 24x7 Service Available Near Me
High Profile Call Girls Mavalli - 7001305949 | 24x7 Service Available Near Menarwatsonia7
 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...narwatsonia7
 
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...narwatsonia7
 
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...saminamagar
 

Kürzlich hochgeladen (20)

Call Girls Service in Virugambakkam - 7001305949 | 24x7 Service Available Nea...
Call Girls Service in Virugambakkam - 7001305949 | 24x7 Service Available Nea...Call Girls Service in Virugambakkam - 7001305949 | 24x7 Service Available Nea...
Call Girls Service in Virugambakkam - 7001305949 | 24x7 Service Available Nea...
 
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdf
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdfHemostasis Physiology and Clinical correlations by Dr Faiza.pdf
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdf
 
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
 
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service MumbaiLow Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
 
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
 
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort ServiceCollege Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
 
Glomerular Filtration and determinants of glomerular filtration .pptx
Glomerular Filtration and  determinants of glomerular filtration .pptxGlomerular Filtration and  determinants of glomerular filtration .pptx
Glomerular Filtration and determinants of glomerular filtration .pptx
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
 
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking ModelsMumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
 
High Profile Call Girls Kodigehalli - 7001305949 Escorts Service with Real Ph...
High Profile Call Girls Kodigehalli - 7001305949 Escorts Service with Real Ph...High Profile Call Girls Kodigehalli - 7001305949 Escorts Service with Real Ph...
High Profile Call Girls Kodigehalli - 7001305949 Escorts Service with Real Ph...
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
 
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
 
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
 
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
 
High Profile Call Girls Mavalli - 7001305949 | 24x7 Service Available Near Me
High Profile Call Girls Mavalli - 7001305949 | 24x7 Service Available Near MeHigh Profile Call Girls Mavalli - 7001305949 | 24x7 Service Available Near Me
High Profile Call Girls Mavalli - 7001305949 | 24x7 Service Available Near Me
 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
 
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
 
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
 

Why is Bioinformatics a Good Fit for Spark?

  • 1. Why is Bioinformatics (well, really, “genomics”) a Good Fit for Spark? Timothy Danford AMPLab
  • 3. Bioinformatics computation is batch processing and workflows ● Bioinformatics has a lot of “workflow engines” ○ Galaxy, Taverna, Firehose, Zamboni, Queue, Luigi, bPipe ○ bash scripts ○ even make, fer cryin’ out loud ○ a new one every day ● Bioinformatics software development is still largely a research activity
  • 4. State-of-the-Art infrastructure: shared filesystems, handwritten parallelism ● Hand-written task creation ● File formats instead of APIs or data models ○ formats are poorly defined ○ contain optional or redundant fields ○ semantics are unclear ● Workflow engines can’t take advantage of common parallelism between stages
  • 5.
  • 7. Most of Genomics is 1-D Geometry
  • 8. Most of Genomics is 1-D Geometry
  • 9. The rest is iterative evaluation of probabilistic models!
  • 10. Spark RDDs and Partitioners allow declarative parallelization for genomics ● Genomics computation is parallelized in a small, standard number of ways ○ by position ○ by sample ● Declarative, flexible partitioning schemes are useful
  • 11. Spark can easily express genomics primitives: join by genomic overlap 1. Calculate disjoint regions based on left (blue) set 2. Partition both sets by disjoint regions 3. Merge-join within each partition 4. (Optional) aggregation across joined pairs
  • 12. ADAM is Genomics + Spark ● A rewrite of core bioinformatics tools and algorithms in Spark ● Combines three technologies ○ Spark ○ Parquet ○ Avro ● Apache 2-licensed ● Started at the AMPLab http://bdgenomics.org/
  • 13. Avro and Parquet are just as critical to ADAM as Spark ● Avro to define data models ● Parquet for serialization format ● Still need to answer design questions ○ how wide are the schemas? ○ how much do we follow existing formats? ○ how do carry through projections?
  • 14. Still need to convince bioinformaticians to rewrite their software! Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
  • 15. Still need to convince bioinformaticians to rewrite their software! ● A single piece of a single filtering stage for a somatic variant caller ● “11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
  • 16. The Future: Distributed and Incremental? ● Today: 5k samples x 20 Gb / sample ● Tomorrow: 1m+ samples @ 200+ Gb / sample? ● More and more analysis is aggregative ○ joint variant calling, ○ panels of normal samples, ○ collective variant annotation ● And “data collection” will never be finished
  • 17. Acknowledgements Matt Massie (AMPLab) Frank Nothaft (AMPLab) Carl Yeksigian (DataStax) Anthony Philippakis (Broad Institute) Jeff Hammerbacher (Cloudera / Mt. Sinai) Thank you! (questions?)