SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
Lightning fast genomics 
With Spark and ADAM
Who are we? 
Andy 
@Noootsab 
@NextLab_be 
@Wajug co-driver 
@Devoxx4Kids organizer 
Maths & CS 
Data lover: geo, open, massive 
Fool 
Xavier 
@xtordoir 
SilicoCloud 
-> Physics 
-> Data analysis 
-> genomics 
-> scalable systems 
-> ...
Genomics 
What is genomics about? 
Medical Diagnostics 
Drug response 
Diseases mechanisms
Genomics 
What is genomics about? 
- A human genome is a 3 billion long sequence (of 
nucleic acids: “bases”) 
- 1 per 1000 base is variable in human population 
- Genomes encode bio-molecules (tens of thousands) 
- These molecules interact together 
...and with environment 
→ Biological systems are very complex
Genomics 
State of the art 
- growing technological capacity 
- cost reduction 
- growing data._
Genomics 
State of the art 
- I.T. becomes bottleneck (cost and latency) 
- sacrifice data with sampling or cut-offs 
Andrea Sboner et al
Genomics 
Blocking points 
- “legacy stack” not designed scalable (C, perl, …) 
- HPC approach not a fit (data intensive)
Genomics 
Future of genomics 
- Personal genomes (e.g. 1,000,000 genomes for cancer 
research) 
- New sequencing technologies 
- Sequence “stuff” as needed (e.g. microbiome, 
diagnostics) 
- medicalCondition = f(genomics, environmentHistory)
Genomics 
Needs of scalability → Scala & Spark 
Needs of simplicity, clarity → ADAM
Parquet 101 
Columnar storage 
Row oriented 
Column oriented
Parquet 101 
Columnar storage 
> Homogeneous collocated data 
> Better range access 
> Better encoding
Parquet 101 
Efficient encoding of nested typed structures 
message Document { 
required int64 DocId; 
optional group Links { 
repeated int64 Backward; 
repeated int64 Forward; 
} 
repeated group Name { 
repeated group Language { 
required string Code; 
optional string Country; 
} 
optional string Url; 
} 
}
Parquet 101 
Efficient encoding of nested typed structures 
message Document { 
required int64 DocId; 
optional group Links { 
repeated int64 Backward; 
repeated int64 Forward; 
} 
repeated group Name { 
repeated group Language { 
required string Code; 
optional string Country; 
} 
optional string Url; 
} 
} 
Nested structure →Tree 
Empty levels →Branch pruning 
Repetitions →Metadata (index) 
Types → Safe/Fast codec
Parquet 101 
Efficient encoding of nested typed structures 
ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Parquet 101 
Optimized distributed storage (f.i. in HDFS) 
ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
Parquet 101 
Efficient (schema based) serialization: AVRO 
JSON Schema IDL 
{ 
"namespace": "example.avro", 
"type": "record", 
"name": "User", 
"fields": [ 
{"name": "name", "type": "string"}, 
{"name": "favorite_number", "type": ["int", "null"]}, 
{"name": "favorite_color", "type": ["string", "null"]} 
] 
} 
record User { 
string name; 
union { null, int } favorite_number = null; 
union { null, string } favorite_color = null; 
}
Parquet 101 
Efficient (schema based) serialization: AVRO 
JSON Schema Part of the: 
{ 
"namespace": "example.avro", 
"type": "record", 
"name": "User", 
"fields": [ 
{"name": "name", "type": "string"}, 
{"name": "favorite_number", "type": ["int", "null"]}, 
{"name": "favorite_color", "type": ["string", "null"]} 
] 
} 
● protocol 
● serialization 
→less metadata 
Define: IDL → JSON 
Send: Binary → JSON
ADAM 
Credits: AmpLab (UC Berkeley)
ADAM 
Overview (Sequencing) 
- DNA is a molecule 
…or a Seq[Char] 
(A, T, G, C) alphabet
ADAM 
Sequencing 
- Massively parallel sequencing of random 100-150 
bases reads (20,000,000 reads per genome) 
- 30-60x coverage for quality 
- All this mess must be re-organised! 
→ ADAM
ADAM 
Variants Calling 
- From an organized set of reads (ADAM Pileup) 
- Detect variants (Variant Calling) 
→ AVOCADO
ADAM 
Genomics specifications 
- SAM, BAM, VCF 
- Indexable 
- libraries 
- ~ scalable: hadoop-bam
ADAM 
ADAM model 
- schema based (Avro), libraries are generated 
- no storage spec here!
ADAM 
ADAM model 
- Parquet storage 
- evenly distribute data 
- storage optimized for read/query 
- better compression
ADAM 
ADAM API 
- AdamContext provides functions to read from HDFS
ADAM 
ADAM API 
- Scala classes generated from Avro 
- Data loaded as RDDs (Spark’s Resilient Distributed 
Datasets) 
- functions on RDDs (write to HDFS, genomic objects 
manipulations)
ADAM 
ADAM API 
- e.g. reading genotypes
ADAM 
ADAM Benchmark 
- It scales! 
- Data is more compact 
- Read perf is better 
- Code is simpler
Stratification using 1000Genomes 
As usual… let’s get some data. 
Genomes relate to health and are private. 
Still, there are options!
Stratification using 1000Genomes 
http://www.1000genomes.org/ 
(Nowadays targeting 2000 genomes) 
ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
Stratification using 1000Genomes
Stratification using 1000Genomes
Stratification using 1000Genomes 
Study genetic variations in populations (needs 
more contextual data for healthcare). 
To validate the interest in ADAM, we’ll do some 
qualitative exploration of the data. 
Question: it is possible to predict the 
appartenance of a given genome to a 
subpopulation?
Stratification using 1000Genomes 
We can run an unsupervised algorithm on a 
massive number of genomes. 
The idea is to find clusters that would match 
subpopulations. 
Actually, it’s important because it reflects 
populations histories: gene flows, selection, ...
Stratification using 1000Genomes 
From the 200Tb of data, we’ll focus on the 6th 
chromosome, actually only its variants 
ref: http://en.wikipedia.org/wiki/Chromosome
Genome Data 
Data structure
Genome Data 
Data structure 
Panel: Map[SampleID, Population]
Genome Data 
Data structure 
Genotypes in VCF format 
Basically a text file. Ours were downloaded from S3. 
Converted to ADAM Genotypes
Machine Learning model 
Clustering: KMeans 
ref: http://en.wikipedia.org/wiki/K-means_clustering
Machine Learning model 
Clustering: KMeans 
PreProcess = {A,C,T,G}² → {0,1,2} 
Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰ 
Distance = Euclidian (L2) ⁽*⁾ 
⁽*⁾MLlib restriction, although, here: L2~L1 
SPARK-3012 
ref: http://en.wikipedia.org/wiki/K-means_clustering
Machine Learning model 
MLLib, KMeans 
MLLib: 
● Machine Learning Algorithms 
● Data structures (e.g. Vector)
Machine Learning model 
MLLib KMeans 
DataFrame Map: 
● key = Sample 
● value = Vector of Genotypes alleles (sorted by Variant)
Mashup 
prediction 
Sample [NA20332] is in cluster #0 for population Some(ASW) 
Sample [NA20334] is in cluster #2 for population Some(ASW) 
Sample [HG00120] is in cluster #2 for population Some(GBR) 
Sample [NA18560] is in cluster #1 for population Some(CHB)
Mashup 
#0 #1 #2 
GBR 0 0 89 
ASW 54 0 7 
CHB 0 97 0
Cluster 
4 m3.xlarge instances (ec2) 
16 cores + 60G
Cluster 
Performances
Cluster 
40 m3.xlarge 
160 cores + 600G
Conclusions and future work 
● ADAM and Spark provide tools to 
manipulate genomics data in a scalable way 
● Simple APIs in Scala 
● MLLib for machine learning 
→ implement less naïve algorithms 
→ cross medical and environmental data with 
genomes
Acknowledgments 
Acknowledgements 
Scala.IO 
AmpLab 
Matt Massie Frank Nothaft 
Vincent Botta
That’s all Folks 
Apparently, we’re supposed to stay on stage 
Waiting for questions 
Hoping for none 
Looking at the bar 
And the lunch 
Oh there are beers 
And candies 
who can read this?

Weitere ähnliche Inhalte

Was ist angesagt?

Virtual Server Presentation Dha
Virtual Server Presentation DhaVirtual Server Presentation Dha
Virtual Server Presentation Dhamcshinsky
 
NVIDIA Developer Program Overview
NVIDIA Developer Program OverviewNVIDIA Developer Program Overview
NVIDIA Developer Program OverviewNVIDIA
 
Long Duration Energy Storage - Eclipse Ventures.pdf
Long Duration Energy Storage - Eclipse Ventures.pdfLong Duration Energy Storage - Eclipse Ventures.pdf
Long Duration Energy Storage - Eclipse Ventures.pdfDiana Zhou
 
20221003 AMC WiM Final_External Edit.pdf
20221003 AMC WiM Final_External Edit.pdf20221003 AMC WiM Final_External Edit.pdf
20221003 AMC WiM Final_External Edit.pdfAidanMadiganCurtis1
 
AIRBORNE WIND TURBINE SYSTEM
AIRBORNE WIND TURBINE SYSTEMAIRBORNE WIND TURBINE SYSTEM
AIRBORNE WIND TURBINE SYSTEMsathish shankar
 
مقدمة لتحويل الطاقة الشمسية / Lecture, June 21st. 2013 / El Jadida, Morocco
مقدمة لتحويل الطاقة الشمسية / Lecture, June 21st. 2013 / El Jadida, Moroccoمقدمة لتحويل الطاقة الشمسية / Lecture, June 21st. 2013 / El Jadida, Morocco
مقدمة لتحويل الطاقة الشمسية / Lecture, June 21st. 2013 / El Jadida, MoroccoProf. Dr. Ahmed Ennaoui
 
Neeraj nuclear power car
Neeraj nuclear power carNeeraj nuclear power car
Neeraj nuclear power carNeeraj Vincent
 
Final presentation engineering as a career version1.1
Final presentation engineering as a career version1.1Final presentation engineering as a career version1.1
Final presentation engineering as a career version1.1Deepesh Divaakaran
 

Was ist angesagt? (13)

Virtual Server Presentation Dha
Virtual Server Presentation DhaVirtual Server Presentation Dha
Virtual Server Presentation Dha
 
NVIDIA Developer Program Overview
NVIDIA Developer Program OverviewNVIDIA Developer Program Overview
NVIDIA Developer Program Overview
 
Long Duration Energy Storage - Eclipse Ventures.pdf
Long Duration Energy Storage - Eclipse Ventures.pdfLong Duration Energy Storage - Eclipse Ventures.pdf
Long Duration Energy Storage - Eclipse Ventures.pdf
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Energy industry overview
Energy  industry overviewEnergy  industry overview
Energy industry overview
 
Tidal
TidalTidal
Tidal
 
20221003 AMC WiM Final_External Edit.pdf
20221003 AMC WiM Final_External Edit.pdf20221003 AMC WiM Final_External Edit.pdf
20221003 AMC WiM Final_External Edit.pdf
 
AIRBORNE WIND TURBINE SYSTEM
AIRBORNE WIND TURBINE SYSTEMAIRBORNE WIND TURBINE SYSTEM
AIRBORNE WIND TURBINE SYSTEM
 
Renewable energy
Renewable energyRenewable energy
Renewable energy
 
مقدمة لتحويل الطاقة الشمسية / Lecture, June 21st. 2013 / El Jadida, Morocco
مقدمة لتحويل الطاقة الشمسية / Lecture, June 21st. 2013 / El Jadida, Moroccoمقدمة لتحويل الطاقة الشمسية / Lecture, June 21st. 2013 / El Jadida, Morocco
مقدمة لتحويل الطاقة الشمسية / Lecture, June 21st. 2013 / El Jadida, Morocco
 
Persistent memory
Persistent memoryPersistent memory
Persistent memory
 
Neeraj nuclear power car
Neeraj nuclear power carNeeraj nuclear power car
Neeraj nuclear power car
 
Final presentation engineering as a career version1.1
Final presentation engineering as a career version1.1Final presentation engineering as a career version1.1
Final presentation engineering as a career version1.1
 

Ähnlich wie Lightning fast genomics with Spark, Adam and Scala

BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
CS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databasesCS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databasesGabe Rudy
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
 
Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...Maté Ongenaert
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbMongoDB
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009Ian Foster
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandrarantav
 
AWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of MarylandAWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of MarylandAmazon Web Services
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBMongoDB
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkDatabricks
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaDatabricks
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 

Ähnlich wie Lightning fast genomics with Spark, Adam and Scala (20)

BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
CS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databasesCS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databases
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandra
 
AWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of MarylandAWS Customer Presentation- University of Maryland
AWS Customer Presentation- University of Maryland
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
User biglm
User biglmUser biglm
User biglm
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 

Mehr von Andy Petrella

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data MappingAndy Petrella
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooksAndy Petrella
 
Governance compliance
Governance   complianceGovernance   compliance
Governance complianceAndy Petrella
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPRAndy Petrella
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and howAndy Petrella
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scalaAndy Petrella
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformAndy Petrella
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserAndy Petrella
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open ScienceAndy Petrella
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 

Mehr von Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 

Kürzlich hochgeladen

Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 

Kürzlich hochgeladen (20)

Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 

Lightning fast genomics with Spark, Adam and Scala

  • 1. Lightning fast genomics With Spark and ADAM
  • 2. Who are we? Andy @Noootsab @NextLab_be @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Xavier @xtordoir SilicoCloud -> Physics -> Data analysis -> genomics -> scalable systems -> ...
  • 3. Genomics What is genomics about? Medical Diagnostics Drug response Diseases mechanisms
  • 4. Genomics What is genomics about? - A human genome is a 3 billion long sequence (of nucleic acids: “bases”) - 1 per 1000 base is variable in human population - Genomes encode bio-molecules (tens of thousands) - These molecules interact together ...and with environment → Biological systems are very complex
  • 5. Genomics State of the art - growing technological capacity - cost reduction - growing data._
  • 6. Genomics State of the art - I.T. becomes bottleneck (cost and latency) - sacrifice data with sampling or cut-offs Andrea Sboner et al
  • 7. Genomics Blocking points - “legacy stack” not designed scalable (C, perl, …) - HPC approach not a fit (data intensive)
  • 8. Genomics Future of genomics - Personal genomes (e.g. 1,000,000 genomes for cancer research) - New sequencing technologies - Sequence “stuff” as needed (e.g. microbiome, diagnostics) - medicalCondition = f(genomics, environmentHistory)
  • 9. Genomics Needs of scalability → Scala & Spark Needs of simplicity, clarity → ADAM
  • 10. Parquet 101 Columnar storage Row oriented Column oriented
  • 11. Parquet 101 Columnar storage > Homogeneous collocated data > Better range access > Better encoding
  • 12. Parquet 101 Efficient encoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } }
  • 13. Parquet 101 Efficient encoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Nested structure →Tree Empty levels →Branch pruning Repetitions →Metadata (index) Types → Safe/Fast codec
  • 14. Parquet 101 Efficient encoding of nested typed structures ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 15. Parquet 101 Optimized distributed storage (f.i. in HDFS) ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
  • 16. Parquet 101 Efficient (schema based) serialization: AVRO JSON Schema IDL { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } record User { string name; union { null, int } favorite_number = null; union { null, string } favorite_color = null; }
  • 17. Parquet 101 Efficient (schema based) serialization: AVRO JSON Schema Part of the: { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } ● protocol ● serialization →less metadata Define: IDL → JSON Send: Binary → JSON
  • 18. ADAM Credits: AmpLab (UC Berkeley)
  • 19. ADAM Overview (Sequencing) - DNA is a molecule …or a Seq[Char] (A, T, G, C) alphabet
  • 20. ADAM Sequencing - Massively parallel sequencing of random 100-150 bases reads (20,000,000 reads per genome) - 30-60x coverage for quality - All this mess must be re-organised! → ADAM
  • 21. ADAM Variants Calling - From an organized set of reads (ADAM Pileup) - Detect variants (Variant Calling) → AVOCADO
  • 22. ADAM Genomics specifications - SAM, BAM, VCF - Indexable - libraries - ~ scalable: hadoop-bam
  • 23. ADAM ADAM model - schema based (Avro), libraries are generated - no storage spec here!
  • 24. ADAM ADAM model - Parquet storage - evenly distribute data - storage optimized for read/query - better compression
  • 25. ADAM ADAM API - AdamContext provides functions to read from HDFS
  • 26. ADAM ADAM API - Scala classes generated from Avro - Data loaded as RDDs (Spark’s Resilient Distributed Datasets) - functions on RDDs (write to HDFS, genomic objects manipulations)
  • 27. ADAM ADAM API - e.g. reading genotypes
  • 28. ADAM ADAM Benchmark - It scales! - Data is more compact - Read perf is better - Code is simpler
  • 29. Stratification using 1000Genomes As usual… let’s get some data. Genomes relate to health and are private. Still, there are options!
  • 30. Stratification using 1000Genomes http://www.1000genomes.org/ (Nowadays targeting 2000 genomes) ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
  • 33. Stratification using 1000Genomes Study genetic variations in populations (needs more contextual data for healthcare). To validate the interest in ADAM, we’ll do some qualitative exploration of the data. Question: it is possible to predict the appartenance of a given genome to a subpopulation?
  • 34. Stratification using 1000Genomes We can run an unsupervised algorithm on a massive number of genomes. The idea is to find clusters that would match subpopulations. Actually, it’s important because it reflects populations histories: gene flows, selection, ...
  • 35. Stratification using 1000Genomes From the 200Tb of data, we’ll focus on the 6th chromosome, actually only its variants ref: http://en.wikipedia.org/wiki/Chromosome
  • 36. Genome Data Data structure
  • 37. Genome Data Data structure Panel: Map[SampleID, Population]
  • 38. Genome Data Data structure Genotypes in VCF format Basically a text file. Ours were downloaded from S3. Converted to ADAM Genotypes
  • 39. Machine Learning model Clustering: KMeans ref: http://en.wikipedia.org/wiki/K-means_clustering
  • 40. Machine Learning model Clustering: KMeans PreProcess = {A,C,T,G}² → {0,1,2} Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰ Distance = Euclidian (L2) ⁽*⁾ ⁽*⁾MLlib restriction, although, here: L2~L1 SPARK-3012 ref: http://en.wikipedia.org/wiki/K-means_clustering
  • 41. Machine Learning model MLLib, KMeans MLLib: ● Machine Learning Algorithms ● Data structures (e.g. Vector)
  • 42. Machine Learning model MLLib KMeans DataFrame Map: ● key = Sample ● value = Vector of Genotypes alleles (sorted by Variant)
  • 43. Mashup prediction Sample [NA20332] is in cluster #0 for population Some(ASW) Sample [NA20334] is in cluster #2 for population Some(ASW) Sample [HG00120] is in cluster #2 for population Some(GBR) Sample [NA18560] is in cluster #1 for population Some(CHB)
  • 44. Mashup #0 #1 #2 GBR 0 0 89 ASW 54 0 7 CHB 0 97 0
  • 45. Cluster 4 m3.xlarge instances (ec2) 16 cores + 60G
  • 47. Cluster 40 m3.xlarge 160 cores + 600G
  • 48. Conclusions and future work ● ADAM and Spark provide tools to manipulate genomics data in a scalable way ● Simple APIs in Scala ● MLLib for machine learning → implement less naïve algorithms → cross medical and environmental data with genomes
  • 49. Acknowledgments Acknowledgements Scala.IO AmpLab Matt Massie Frank Nothaft Vincent Botta
  • 50. That’s all Folks Apparently, we’re supposed to stay on stage Waiting for questions Hoping for none Looking at the bar And the lunch Oh there are beers And candies who can read this?