Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Scalable Genome Analysis with ADAM
1. Scalable Genome Analysis
With ADAM
Frank Austin Nothaft, UC Berkeley AMPLab
fnothaft@berkeley.edu, @fnothaft
7/23/2015
2. Analyzing genomes:
What is our goal?
• Genomes are the “source” code for life:
• The human genome is a 3.2B character
“program”, split across 46 “files”
• Within a species, genomes are ~99.9% similar
• The 0.1% variance gives rise to diverse traits, as
well as diseases
3. The Sequencing Abstraction
It was the best of times, it was the worst of times…
Metaphor borrowed from Michael Schatz
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
• Sequencing is a Poission substring sampling process
• For $1,000, we can sequence a 30x copy of your genome
4. Genome Resequencing
• The Human Genome Project identified the “average”
genome from 20 individuals at $1B cost
• To make this process cheaper, we use our knowledge
of the “average” genome to calculate a diff
• Two problems:
• How do we compute this diff?
• How do we make sense of the differences?
5. Alignment and Assembly
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
worst of times
best of times was the worst
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
11. First, define a schema
record AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig } mateContig = null;
}
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
12. Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
A schema provides a
narrow waistrecord AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig } mateContig = null;
}
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
13. Accelerate common
access patterns
• In genomics, we commonly
have to find observations that
overlap in a coordinate plane
• This coordinate plane is
genomics specific, and is
known a priori
• We can use our knowledge of
the coordinate plane to
implement a fast overlap join
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
14. Pick appropriate storage
• When accessing scientific
datasets, we frequently slice and
dice the dataset:
• Algorithms may touch
subsets of columns
• We don’t always touch the
whole dataset
• This is a good match for
columnar storage
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
15. Is introducing a new data
model really a good idea?
Source: XKCD, http://xkcd.com/927/
16. A subtle point:!
Proper stack design can simplify
backwards compatibility
To support legacy data formats, you define a way to
serialize/deserialize the schema into/from the
legacy flat file format!
Data Distribution
Materialized Data
Legacy File Format
Schema
Data Models
Data Distribution
Materialized Data
Columnar Storage
Schema
Data Models
17. A subtle point:!
Proper stack design can simplify
backwards compatibility
This is a view!
Data Distribution
Materialized Data
Legacy File Format
Schema
Data Models
Data Distribution
Materialized Data
Columnar Storage
Schema
Data Models
19. What are the challenges?
• The differences between people’s genomes leads
to various traits and diseases, but:
• Variants don’t always have straightforward
explanations
• 3B base genome with variation at 0.1% of
locations —> lots of variants! How do we find
the important one?
20. Making Sense of Variation
• Variation in the genome can affect biology in
several ways:
• A variant can modify or break a protein
• A variant can modify how much of a protein is
created
• The subset of your genome that encodes proteins
is the exome. This is ~1% of your genome!
21. How do we link
variants to traits?
• Statistical modeling!
• We may use a simple model (e.g., a 𝛘2 test)
• Or, we may use a more complex model (e.g., linear
regression)
• This is known as a genotype-to-phenotype
association test
• Most of these tests can be expressed as aggregation
functions —> i.e., a reduction. This maps well to Spark!
22. What if two variants are
close to each other?
• This phenomenon is known as linkage disequilibrium: if
two variants are close to each other, they are likely to be
inherited together
Levo and Segal, Nat Rev Gen, 2014.
• We can often link expression changes and trait changes
to blocks of the genome, but not to specific variants
23. Can we break up
these blocks?
• When a variant modifies a gene, we can predict how the
gene is modified:
• Does this variant change how the protein is spliced?
• Does this variant change an amino acid?
• We can discard variants that do not change proteins
• However, if the variant is outside of a gene, how do we
make sense of it?
• Are these variants important?
24. Mutations in AML
There is a big “long tail”, including people who have
cancer, but who have no “modified” genes!
Ravi Pandya, BeatAML.
25. Looking outside
of the Exome
• We analyze mutations
in the exome using the
grammar for protein
creation
• Can we apply a similar
approach outside of
the exome?
• Let’s use the grammar
for regulation instead!
S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.
27. You Can Help!
• All of our projects are open source:
• https://www.github.com/bigdatagenomics
• Apache 2 licensed
• For a tutorial, see Chapter 10 of Advanced Analytics with
Spark
• More details on ADAM in our latest paper
• The GA4GH is looking for coders to help enable genomic
data sharing: https://github.com/ga4gh/server
28. Acknowledgements
• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey
Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony
Joseph, Dave Patterson!
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman,
Jeff Hammerbacher!
• GenomeBridge: Carl Yeksigian!
• Cloudera: Uri Laserson, Tom White!
• Microsoft Research: Ravi Pandya, Bill Bolosky!
• UC Santa Cruz: Benedict Paten, David Haussler, Hannes Schmidt, Beau
Norgeot!
• And many other open source contributors, especially Michael Heuer, Neil
Ferguson, Andy Petrella, Xavier Tordior!
• Total of 40 contributors to ADAM/BDG from >12 institutions