Scalable up genomic analysis with ADAM

Scaling up genomic
analysis with ADAM
Frank Austin Nothaft, UC Berkeley AMPLab
fnothaft@berkeley.edu, @fnothaft
10/27/2014

What is ADAM?
• An open source, high performance, distributed
platform for genomic analysis
• ADAM defines a:
1. Data schema and layout on disk*
2. A Scala API
3. A command line interface
* Via Avro and Parquet

What’s the big picture?
ADAM:!
Core API +
CLIs
bdg-formats:!
Data schemas
RNAdam:!
RNA analysis on
ADAM
avocado:!
Distributed local
assembler
xASSEMBLEx:!
GraphX-based de
novo assembler
bdg-services:!
ADAM clusters
PacMin:!
String graph
assembler

Implementation Overview
• 34k LOC (96% Scala)
• Apache 2 licensed OSS
• 23 contributors across 10 institutions
• Pushing for production 1.0 release towards end of year

Key Observations
• Current genomics pipelines are I/O limited
• Most genomics algorithms can be formulated as a
data or graph parallel computation
• These algorithms are heavy on iteration/pipelining
• Data access pattern is write once, read many times
• High coverage, whole genome will become main
sequencing target (for human genetics)

Principles for Scalable
Design in ADAM
• Parallel FS and data representation (HDFS +
Parquet) combined with in-memory computing
eliminates disk bandwidth bottleneck
• Spark allows efficient implementation of iterative/
pipelined Map-Reduce
• Minimize data movement: send code to data

• An in-memory data parallel computing framework
• Optimized for iterative jobs —> unlike Hadoop
• Data maintained in memory unless inter-node
movement needed (e.g., on repartitioning)
• Presents a functional programing API, along with
support for iterative programming via REPL
• Used at scale on clusters with >2k nodes, 4TB
datasets

Why Spark?
• Current leading map-reduce framework:
• First in-memory map-reduce platform
• Used at scale in industry, supported in major distros (Cloudera,
HortonWorks, MapR)
• The API:
• Fully functional API
• Main API in Scala, also support Java, Python, R
• Manages node/job failures via lineage, data locality/job assignment
• Downstream tools (GraphX, MLLib)

Data Format
• Avro schema encoded by
Parquet
• Schema can be updated
without breaking backwards
compatibility
• Read schema looks a lot like
BAM, but renormalized
• Actively removing tags
• Variant schema is strictly
biallelic, a “cell in the matrix”
record AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig} mateContig = null;
}

Parquet
• ASF Incubator project, based on
Google Dremel
• http://www.parquet.io
• High performance columnar
store with support for projections
and push-down predicates
• 3 layers of parallelism:
• File/row group
• Column chunk
• Page
Image from Parquet format definition: https://github.com/Parquet/parquet-format

Filtering
• Parquet provides pushdown predication
• Evaluate filter on a subset of columns
• Only read full set of projected columns for passing records
• Full primary/secondary indexing support in Parquet 2.0
• Very efficient if reading a small set of columns:
• On disk, contig ID/start/end consume < 2% of space
Image from Parquet format definition: https://github.com/Parquet/parquet-format

Compression
• Parquet compresses
at the column level:
• RLE for repetitive
columns
• Dictionary
encoding for
quantized
columns
• ADAM uses a fully
denormalized schema
• Repetitive columns are
RLE’d out
• Delta encoding
(Parquet 2.0) will aid
with quality scores
• ADAM is 5-25% smaller
than compressed BAM

Parquet/Spark Integration
• 1 row group in Parquet maps
to 1 partition in Spark
• We interact with Parquet via
input/output formats
• These apply projections
and predicates, handle
(de)compression
• Spark builds and executes a
computation DAG, manages
data locality, errors/retries, etc.
Parquet
RG 1 RG 2 RG n …
Spark
Parquet Input Format
Partition
2
…
Parquet Output Format
Parquet
Partition
1
Partition
n
RG 1 RG 2 RG n …

Long-read assembly
with PacMin

The State of Analysis
• Conventional short-read alignment based pipelines
are really good at calling SNPs
• But, we’re still pretty bad at calling INDELs, and
SVs
• And are slow: 2 weeks to sequence, 1 week to
analyze. Not fast enough for clinical use.
• If we move away from short reads, do we have other
options?

Opportunities
• New read technologies are available
• Provide much longer reads (250bp vs. >10kbp)
• Different error model… (15% INDEL errors, vs. 2%
SNP errors)
• Generally, lower sequence specific bias
Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/

If long reads are available…
• We can use conventional methods:
Carneiro et al, Genome Biology 2012

But!
• Why not make raw assemblies out of the reads?
Find overlapping reads Find consensus sequence
for all pairs of reads (i,j):
i j
=?
…ACACTGCGACTCATCGACTC…
• Problems:
1. Overlapping is O(n
2
) and single evaluation is expensive anyways
2. Typical algorithms find a single consensus sequence; what if we’ve got
polymorphisms?

Fast Overlapping with
MinHashing
• Wonderful realization by Berlin et al1: overlapping is
similar to document similarity problem
• Use MinHashing to approximate similarity:
1: Berlin et al, bioRxiv 2014
Per document/read,
compute signature:!
!
1. Cut into shingles
2. Apply random
hashes to shingles
3. Take min over all
random hashes
Hash into buckets:!
!
Signatures of length l
can be hashed into b
buckets, so we expect
to compare all elements
with similarity
≥ (1/b)^(b/l)
Compare:!
!
For two documents with
signatures of length l,
Jaccard similarity is
estimated by
(# equal hashes) / l
!
• Easy to implement in Spark: map, groupBy, map, filter

Overlaps to Assemblies
• Finding pairwise overlaps gives us a directed
graph between reads (lots of edges!)

Transitive Reduction
• We can find a consensus between clique members
• Or, we can reduce down:
• Via two iterations of Pregel!

Actually Making Calls
• From here, we need to call copy number per edge
• Probably via Newton-Raphson based on coverage; we’re not sure yet.
• Then, per position in each edge, call alleles:
Notes:!
Equation is from Li, Bioinformatics 2011
g = genotype state
m = ploidy
휖 = probability allele was erroneously observed
k = number of reads observed
l = number of reads observed matching “reference” allele
TBD: equation assumes biallelic observations at site and reference allele; we won’t have either of those conveniences…

An aside: Monoallelic
Genotyping
• Traditional probabilistic models for variant calling
assume independence at each site
• However, this throws away a lot of information
• Can consider a different formulation of the problem:
• Build a graph of the alleles
• Find the allelic copy numbers that maximize
likelihood

Allelic Graph
ACACTCG
C
A
TCTCA
G
C
TCCACACT
• Edges of graph define conditional probabilities
• E.g., if ACACTCG is covered by 30 reads, and
C is covered by 1 read, P(C | ACACTCG) is low
• Can efficiently marginalize probabilities over graph
using Eliminate algorithm1, exactly solve for argmax
1. Jordan, “Probabilistic Graphical Models.”

Output
• Current assemblers emit FASTA contigs
• In layperson’s speak: long strings
• We’ll emit “multigs”, which we’ll map back to reference
graph
• Multig = multi-allelic (polymorphic) contig
• Working with UCSC, who’ve done some really neat work1
deriving formalisms & building software for mapping
between sequence graphs, and GA4GH ref. variation team
1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.

Acknowledgements
• UC Berkeley: Matt Massie, André Schumacher,
Jey Kottalam, Christos Kozanitis, Adam Bloniarz!
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael
Linderman, Jeff Hammerbacher!
• GenomeBridge: Timothy Danford, Carl Yeksigian!
• Cloudera: Uri Laserson!
• Microsoft Research: Jeremy Elson, Ravi Pandya!
• And many other open source contributors: 23
contributors to ADAM/BDG from >10 institutions

Scalable up genomic analysis with ADAM

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Scalable up genomic analysis with ADAM

Ähnlich wie Scalable up genomic analysis with ADAM (20)

Mehr von fnothaft

Mehr von fnothaft (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scalable up genomic analysis with ADAM