SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
ADAM: Fast, Scalable
Genome Analysis
http://bigdatagenomics.github.io

Matt Massie

Twitter: @matt_massie
Email: massie@berkeley.edu
University of California, Berkeley
http://amplab.cs.berkeley.edu
Design
•

Create a platform with an easy programming environment for
developers

•

Provide both single and multi-sample methods that are fast and
scalable for whole genome, high-coverage data

•

Allow for multiple views of the same data, e.g. SQL/Table, Graph
Analysis, Iterator on Records, Resilient Distributed Datasets

•

Leverage existing open-source systems and plug into current
“Big Data” ecosystems

•

Deployable on an in-house cluster or any cloud vendor Amazon EC2, Google Compute Engine or Microsoft Azure

•

Everything is a file - bulk data transfer only requires standard
tools like rsync, scp, distcp, S3sync, etc.
Commits

Implementation

•
•

Accelerated work began September, 2013

•

Built using Apache Spark execution engine and
Apache Avro and Parquet for file formats

•
•

20K lines of Scala code

Nine contributors from Mt. Sinai, GenomeBridge,
The Broad Institute and others

Apache-licensed open-source
Read Pre-Processing

Raw Reads

Features
•

Mapping
Sorted Mapping
Local Alignment
Mark Duplicates
Base Quality Score
Recalibration
Calling-Ready
Reads

•

ADAM

•
•
•

Read pre-processing: sort, mark dups, BQSR
Read comparison across multiple covariates
Converters between legacy and ADAM
formats

Avocado - A variant caller, distributed

•
•
•

SNP caller

•

Fully configurable pipeline via a config file

Local assembler
Support for integrating aligners in M/R
frameworks
http://avro.apache.org/
Avro
•

Serialization system similar to Google Protobuf and
Apache Thrift

•
•

Data formats are fully described with a schema

•
•
•

Datafile format is self-descriptive and record-oriented

Bindings for Java, C, C++, C#, JavaScript, Python, Ruby,
PHP and Perl (R in the works)
Provides schema evolution, resolution and projection
Numerous conversion utilities to print Avro as JSON,
extract schema from JAXB, turn XSD/XML to Avro
Parquet
http://parquet.io/
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Parquet
•
•

Based on Google Dremel design

•

Columnar File Format

Created by Twitter and Cloudera with contributions from
dozens of open-source developers
Limits I/O to only data that is needed

•
•

•
•

Fast scans - load only columns you need, e.g. scan a read flag
on a whole genome, high-coverage file in less than a minute

Compresses very well - ADAM files are 5-25% smaller than
BAM files without loss of data

Integrates easily with Avro, Hadoop, Hive, Shark, Impala, Pig,
Jackson/JSON, Scrooge and others
Read Data Example
chrom20 TCGA
chrom20 GAAT

4M1D

chrom20 CCGAT

Projection
Predicate

4M

5M

Row Oriented
chrom20

TCGA

4M

chrom20

GAAT

4M1D

chrom20 CCGAT

5M

Column Oriented
chrom20 chrom20 chrom20

TCGA

GAAT

CCGAT

4M

4M1D

5M
http://spark.apache.org/
Apache Spark
•

Grew out of Berkeley AMPLab research - now a top-level
Apache project, commercially-supported

•

Ease of Use - Spark offers over 80 high-level operators that
make it easy to build parallel apps using Scala, Java, Python or
R

•
•

Easy to test code in “local” mode

•

Speed - Spark has an advanced DAG execution engine that is
10-100x faster than Hadoop M/R

•

Runs well on in-house clusters, Amazon EC2 and Google
Compute Engine

Can use it interactively for ad-hoc analysis from the Scala,
Python and R shells or using iPython notebook
Performance as Proof
Sort
24

Hours

20

Mark Duplicates

BQSR

20.37
17.73

16
12

8.93

8
4

0.33 0.47 0.75

0
Picard

ADAM Single Node

ADAM 100 EC2 Nodes

1000g NA12878 Whole Genome, 60x Coverage
For comparison, Bina Technologies quotes .94 hours for
BQSR at only 37x coverage
Summary
• Schema-driven design allows developers to
think at the logical layer

• Well-designed execution systems allows

developers to focus on science and algorithms
instead of implementation details

• Modern data formats enable distributed, fast
computation and easier integration

• Moving computation to the data reduces
transfers and improves performance
Thank you
Extra slides
Rank variants by read depth
and print the top 100
val join : RDD[(ADAMVariant, ADAMRecord)] =
partitionAndJoin(sc, dict, variants, reads)
val readCounts = join.map( p => (p._1, 1) ).reduceByKey(_ + _)
val sorted = readCounts.map( p=> (p._2, p._1) ).sortByKey()
val top100 = sorted.take(100)
top100.foreach {
  case (count, variant) =>
" println("%dt%s".format(count, variant.getId))
}
Flagstat
$ time adam flagstat NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.adam
757704193 + 0 in total (QC-passed reads + QC-failed reads)
8158052 + 0 primary duplicates
7594332 + 0 primary duplicates - both read and mate mapped
563720 + 0 primary duplicates - only read mapped
10344 + 0 primary duplicates - cross chromosome
10227903 + 0 secondary duplicates
10142158 + 0 secondary duplicates - both read and mate mapped
85745 + 0 secondary duplicates - only read mapped
4026853 + 0 secondary duplicates - cross chromosome
750027254 + 0 mapped (98.99%:0.00%)
757704193 + 0 paired in sequencing
377464374 + 0 read1
380239819 + 0 read2
724651663 + 0 properly paired (95.64%:0.00%)
745340038 + 0 with itself and mate mapped
4687216 + 0 singletons (0.62%:0.00%)
11135947 + 0 with mate mapped to a different chr
5557972 + 0 with mate mapped to a different chr (mapQ>=5)

real    1m58.688s
user    25m52.453s
sys     0m43.879s

Would take 40 minutes just to read
from a single disk (assuming 100mb/s)
Concordance between
ADAM and GATK BQSR
RMSE: 1.48 Exact Matches: 50.06%

50

ADAM

40
30
20
10
0

0

10

20

30
GATK

40

50
http://hadoop.apache.org/
Hadoop Distributed
File System (HDFS)

• Based on GoogleFS
• Single namespace across entire cluster
• Uses commodity hardware - JBOD
• Files are broken into blocks (e.g. 128MB)
• Blocks replicated for durability and
performance

• Write-once, read-many access pattern
$ adam

e
d8b
/Y88b
/ Y88b
/____Y88b
/
Y88b

888~-_
888

888
|
888
|
888
/
888_-~

e
d8b
/Y88b
/ Y88b
/____Y88b
/
Y88b

e
e
d8b d8b
d888bdY88b
/ Y88Y Y888b
/
YY
Y888b
/
Y888b

Choose one of the following commands:
transform
print_tags
flagstat
reads2ref
mpileup
print
aggregate_pileups
listdict
compare
compute_variants
bam2adam
adam2vcf
vcf2adam
findreads
fasta2adam

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations
Prints the values and counts of all tags in a set of records
Print statistics on reads in an ADAM file (similar to samtools flagstat)
Convert an ADAM read-oriented file to an ADAM reference-oriented file
Output the samtool mpileup text from ADAM reference-oriented data
Print an ADAM formatted file
Aggregate pileups in an ADAM reference-oriented file
Print the contents of an ADAM sequence dictionary
Compare two ADAM files based on read name
Compute variant data from genotypes
Single-node BAM to ADAM converter (Note: the 'transform' command can take SAM or BAM as input)
Convert an ADAM variant to the VCF ADAM format
Convert a VCF file to the corresponding ADAM format
Find reads that match particular individual or comparative criteria
Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which
represents assembled sequences.
plugin : Executes an AdamPlugin

Weitere ähnliche Inhalte

Was ist angesagt?

Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovVasil Remeniuk
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practicesHadoop User Group
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 

Was ist angesagt? (20)

Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Hadoop
Hadoop Hadoop
Hadoop
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 

Andere mochten auch

Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMMatt Massie
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGAayushi Pal
 
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Spark Summit
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
 

Andere mochten auch (8)

Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAM
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCING
 
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 

Ähnlich wie Ga4 gh meeting at the the sanger institute

Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaMachine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaData Con LA
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introductionfardinjamshidi
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Getting started with Amazon Redshift
Getting started with Amazon RedshiftGetting started with Amazon Redshift
Getting started with Amazon RedshiftAmazon Web Services
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1Joe Stein
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupRadu Chilom
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 

Ähnlich wie Ga4 gh meeting at the the sanger institute (20)

Shark
SharkShark
Shark
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaMachine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh Poduska
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Getting started with Amazon Redshift
Getting started with Amazon RedshiftGetting started with Amazon Redshift
Getting started with Amazon Redshift
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 

Ga4 gh meeting at the the sanger institute

  • 1. ADAM: Fast, Scalable Genome Analysis http://bigdatagenomics.github.io Matt Massie Twitter: @matt_massie Email: massie@berkeley.edu University of California, Berkeley http://amplab.cs.berkeley.edu
  • 2. Design • Create a platform with an easy programming environment for developers • Provide both single and multi-sample methods that are fast and scalable for whole genome, high-coverage data • Allow for multiple views of the same data, e.g. SQL/Table, Graph Analysis, Iterator on Records, Resilient Distributed Datasets • Leverage existing open-source systems and plug into current “Big Data” ecosystems • Deployable on an in-house cluster or any cloud vendor Amazon EC2, Google Compute Engine or Microsoft Azure • Everything is a file - bulk data transfer only requires standard tools like rsync, scp, distcp, S3sync, etc.
  • 3. Commits Implementation • • Accelerated work began September, 2013 • Built using Apache Spark execution engine and Apache Avro and Parquet for file formats • • 20K lines of Scala code Nine contributors from Mt. Sinai, GenomeBridge, The Broad Institute and others Apache-licensed open-source
  • 4. Read Pre-Processing Raw Reads Features • Mapping Sorted Mapping Local Alignment Mark Duplicates Base Quality Score Recalibration Calling-Ready Reads • ADAM • • • Read pre-processing: sort, mark dups, BQSR Read comparison across multiple covariates Converters between legacy and ADAM formats Avocado - A variant caller, distributed • • • SNP caller • Fully configurable pipeline via a config file Local assembler Support for integrating aligners in M/R frameworks
  • 6. Avro • Serialization system similar to Google Protobuf and Apache Thrift • • Data formats are fully described with a schema • • • Datafile format is self-descriptive and record-oriented Bindings for Java, C, C++, C#, JavaScript, Python, Ruby, PHP and Perl (R in the works) Provides schema evolution, resolution and projection Numerous conversion utilities to print Avro as JSON, extract schema from JAXB, turn XSD/XML to Avro
  • 7.
  • 8.
  • 9.
  • 10.
  • 12. Parquet • • Based on Google Dremel design • Columnar File Format Created by Twitter and Cloudera with contributions from dozens of open-source developers Limits I/O to only data that is needed • • • • Fast scans - load only columns you need, e.g. scan a read flag on a whole genome, high-coverage file in less than a minute Compresses very well - ADAM files are 5-25% smaller than BAM files without loss of data Integrates easily with Avro, Hadoop, Hive, Shark, Impala, Pig, Jackson/JSON, Scrooge and others
  • 13. Read Data Example chrom20 TCGA chrom20 GAAT 4M1D chrom20 CCGAT Projection Predicate 4M 5M Row Oriented chrom20 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M Column Oriented chrom20 chrom20 chrom20 TCGA GAAT CCGAT 4M 4M1D 5M
  • 15. Apache Spark • Grew out of Berkeley AMPLab research - now a top-level Apache project, commercially-supported • Ease of Use - Spark offers over 80 high-level operators that make it easy to build parallel apps using Scala, Java, Python or R • • Easy to test code in “local” mode • Speed - Spark has an advanced DAG execution engine that is 10-100x faster than Hadoop M/R • Runs well on in-house clusters, Amazon EC2 and Google Compute Engine Can use it interactively for ad-hoc analysis from the Scala, Python and R shells or using iPython notebook
  • 16. Performance as Proof Sort 24 Hours 20 Mark Duplicates BQSR 20.37 17.73 16 12 8.93 8 4 0.33 0.47 0.75 0 Picard ADAM Single Node ADAM 100 EC2 Nodes 1000g NA12878 Whole Genome, 60x Coverage For comparison, Bina Technologies quotes .94 hours for BQSR at only 37x coverage
  • 17. Summary • Schema-driven design allows developers to think at the logical layer • Well-designed execution systems allows developers to focus on science and algorithms instead of implementation details • Modern data formats enable distributed, fast computation and easier integration • Moving computation to the data reduces transfers and improves performance
  • 20. Rank variants by read depth and print the top 100 val join : RDD[(ADAMVariant, ADAMRecord)] = partitionAndJoin(sc, dict, variants, reads) val readCounts = join.map( p => (p._1, 1) ).reduceByKey(_ + _) val sorted = readCounts.map( p=> (p._2, p._1) ).sortByKey() val top100 = sorted.take(100) top100.foreach {   case (count, variant) => " println("%dt%s".format(count, variant.getId)) }
  • 21. Flagstat $ time adam flagstat NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.adam 757704193 + 0 in total (QC-passed reads + QC-failed reads) 8158052 + 0 primary duplicates 7594332 + 0 primary duplicates - both read and mate mapped 563720 + 0 primary duplicates - only read mapped 10344 + 0 primary duplicates - cross chromosome 10227903 + 0 secondary duplicates 10142158 + 0 secondary duplicates - both read and mate mapped 85745 + 0 secondary duplicates - only read mapped 4026853 + 0 secondary duplicates - cross chromosome 750027254 + 0 mapped (98.99%:0.00%) 757704193 + 0 paired in sequencing 377464374 + 0 read1 380239819 + 0 read2 724651663 + 0 properly paired (95.64%:0.00%) 745340038 + 0 with itself and mate mapped 4687216 + 0 singletons (0.62%:0.00%) 11135947 + 0 with mate mapped to a different chr 5557972 + 0 with mate mapped to a different chr (mapQ>=5) real    1m58.688s user    25m52.453s sys     0m43.879s Would take 40 minutes just to read from a single disk (assuming 100mb/s)
  • 22. Concordance between ADAM and GATK BQSR RMSE: 1.48 Exact Matches: 50.06% 50 ADAM 40 30 20 10 0 0 10 20 30 GATK 40 50
  • 24. Hadoop Distributed File System (HDFS) • Based on GoogleFS • Single namespace across entire cluster • Uses commodity hardware - JBOD • Files are broken into blocks (e.g. 128MB) • Blocks replicated for durability and performance • Write-once, read-many access pattern
  • 25.
  • 26. $ adam e d8b /Y88b / Y88b /____Y88b / Y88b 888~-_ 888 888 | 888 | 888 / 888_-~ e d8b /Y88b / Y88b /____Y88b / Y88b e e d8b d8b d888bdY88b / Y88Y Y888b / YY Y888b / Y888b Choose one of the following commands: transform print_tags flagstat reads2ref mpileup print aggregate_pileups listdict compare compute_variants bam2adam adam2vcf vcf2adam findreads fasta2adam : : : : : : : : : : : : : : : Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations Prints the values and counts of all tags in a set of records Print statistics on reads in an ADAM file (similar to samtools flagstat) Convert an ADAM read-oriented file to an ADAM reference-oriented file Output the samtool mpileup text from ADAM reference-oriented data Print an ADAM formatted file Aggregate pileups in an ADAM reference-oriented file Print the contents of an ADAM sequence dictionary Compare two ADAM files based on read name Compute variant data from genotypes Single-node BAM to ADAM converter (Note: the 'transform' command can take SAM or BAM as input) Convert an ADAM variant to the VCF ADAM format Convert a VCF file to the corresponding ADAM format Find reads that match particular individual or comparative criteria Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which represents assembled sequences. plugin : Executes an AdamPlugin