This document discusses using Apache Spark and ADAM to perform scalable genomic analysis. It provides an overview of genomics and challenges with existing approaches. ADAM uses Apache Spark and Parquet to efficiently store and query large genomic datasets. The document demonstrates clustering genomic data from the 1000 Genomes Project to predict populations, showing ADAM and Spark can handle large genomic workloads. It concludes these tools provide scalable genomic data processing but future work is needed to implement more advanced algorithms.
2. Who are we?
Andy
@Noootsab
@NextLab_be
@Wajug co-driver
@Devoxx4Kids organizer
Maths & CS
Data lover: geo, open, massive
Fool
Xavier
@xtordoir
SilicoCloud
-> Physics
-> Data analysis
-> genomics
-> scalable systems
-> ...
3. Genomics
What is genomics about?
Medical Diagnostics
Drug response
Diseases mechanisms
4. Genomics
What is genomics about?
- A human genome is a 3 billion long sequence (of
nucleic acids: “bases”)
- 1 per 1000 base is variable in human population
- Genomes encode bio-molecules (tens of thousands)
- These molecules interact together
...and with environment
→ Biological systems are very complex
5. Genomics
State of the art
- growing technological capacity
- cost reduction
- growing data._
6. Genomics
State of the art
- I.T. becomes bottleneck (cost and latency)
- sacrifice data with sampling or cut-offs
Andrea Sboner et al
7. Genomics
Blocking points
- “legacy stack” not designed scalable (C, perl, …)
- HPC approach not a fit (data intensive)
8. Genomics
Future of genomics
- Personal genomes (e.g. 1,000,000 genomes for cancer
research)
- New sequencing technologies
- Sequence “stuff” as needed (e.g. microbiome,
diagnostics)
- medicalCondition = f(genomics, environmentHistory)
9. Genomics
Needs of scalability → Scala & Spark
Needs of simplicity, clarity → ADAM
20. ADAM
Sequencing
- Massively parallel sequencing of random 100-150
bases reads (20,000,000 reads per genome)
- 30-60x coverage for quality
- All this mess must be re-organised!
→ ADAM
21. ADAM
Variants Calling
- From an organized set of reads (ADAM Pileup)
- Detect variants (Variant Calling)
→ AVOCADO
23. ADAM
ADAM model
- schema based (Avro), libraries are generated
- no storage spec here!
24. ADAM
ADAM model
- Parquet storage
- evenly distribute data
- storage optimized for read/query
- better compression
25. ADAM
ADAM API
- AdamContext provides functions to read from HDFS
26. ADAM
ADAM API
- Scala classes generated from Avro
- Data loaded as RDDs (Spark’s Resilient Distributed
Datasets)
- functions on RDDs (write to HDFS, genomic objects
manipulations)
33. Stratification using 1000Genomes
Study genetic variations in populations (needs
more contextual data for healthcare).
To validate the interest in ADAM, we’ll do some
qualitative exploration of the data.
Question: it is possible to predict the
appartenance of a given genome to a
subpopulation?
34. Stratification using 1000Genomes
We can run an unsupervised algorithm on a
massive number of genomes.
The idea is to find clusters that would match
subpopulations.
Actually, it’s important because it reflects
populations histories: gene flows, selection, ...
35. Stratification using 1000Genomes
From the 200Tb of data, we’ll focus on the 6th
chromosome, actually only its variants
ref: http://en.wikipedia.org/wiki/Chromosome
41. Machine Learning model
MLLib, KMeans
MLLib:
● Machine Learning Algorithms
● Data structures (e.g. Vector)
42. Machine Learning model
MLLib KMeans
DataFrame Map:
● key = Sample
● value = Vector of Genotypes alleles (sorted by Variant)
43. Mashup
prediction
Sample [NA20332] is in cluster #0 for population Some(ASW)
Sample [NA20334] is in cluster #2 for population Some(ASW)
Sample [HG00120] is in cluster #2 for population Some(GBR)
Sample [NA18560] is in cluster #1 for population Some(CHB)
48. Conclusions and future work
● ADAM and Spark provide tools to
manipulate genomics data in a scalable way
● Simple APIs in Scala
● MLLib for machine learning
→ implement less naïve algorithms
→ cross medical and environmental data with
genomes
50. That’s all Folks
Apparently, we’re supposed to stay on stage
Waiting for questions
Hoping for none
Looking at the bar
And the lunch
Oh there are beers
And candies
who can read this?