Genome Analysis Pipelines with Spark and ADAM

•

6 gefällt mir•2,734 views

Spark is a powerful new tool for processing large volumes of data quickly across a cluster of networked computers. Typical bioinformatics workflow requirements are well-matched to Spark’s capabilities. However, Spark is not commonly used because many legacy bioinformatics applications make assumptions about their computing environment. These assumptions present a barrier to integrating the tools into more modern computing environments. These barriers are quickly coming down. ADAM is a software library and set of tools built on top of Spark that make it easy work with file formats commonly used for genome analysis like FastQ, BAM, and VCF. In this presentation, we’ll explore how a step that is common to many bioinformatics workflows, sequence alignment, can done with Bowtie and ADAM inside a Spark environment to quickly align short reads to a reference genome. A complete code example is demonstrated and provided at https://github.com/allenday/spark-genome-alignment-demo

Technologie

© 2015 MapR Technologies 3
Alignment
Reference
Sequences
Aligned
Reads Downstream
Applications…
DNA Reads

© 2015 MapR Technologies 4
Alignment
Reference
Sequences
DNA Reads
Aligned
Reads Downstream
Applications…
Align()

© 2015 MapR Technologies 5
Possible Align() Outcomes
Unaligned
DNA Reads
Reference
Sequences
Single
Location
Reads
Multiple
Location
Reads
Unlocatable
Reads
Align()

© 2015 MapR Technologies 6
Many-to-Many Relationship Between Reads and Locations
• Read1
• Read2
• Read3
• Read4
• NULL
• LocationA
• LocationB
• LocationC
• LocationD
• LocationA
• NULL
• LocationE

© 2015 MapR Technologies 7
Parallelizing Alignment
Unaligned
DNA Reads
Locations
Locations
Locations
Part1Part2Part3
Aligned
DNA Reads
Align() Concat() Sort() Etc…Split()

© 2015 MapR Technologies 8
Using HPC+SAN has Bottlenecks (SGE, PBS, Condor, Etc)
Part1Part2Part3
Volume Read
Bottleneck
Volume Write
Bottleneck
Read & Write
Bottleneck

© 2015 MapR Technologies 9
Using Spark Eliminates Bottlenecks
Align() Concat() Sort()Split()

© 2015 MapR Technologies 10
How to Do it? FastQ => BAM
Prepare Environment
• Get Input Reference Sequences in FastA format (one time step)
• Index Reference Sequences with Bowtie (one time step)
Get Input Data
• Get Input Unaligned Reads in FastQ format (per input data set)
Do the Work
• Convert Reads to ADAM Format with Spark (per input data set)
• Use Spark Pipe to Align the Reads (per input data fragment)
• Write the Alignments to SAM or BAM format (per input data,
optional)

© 2015 MapR Technologies 11
Let’s look at some code…
• Let’s check on the build we kicked off earlier.
• And the driver script:
• cat $DEMO/bin/bowtie_pipe_single.scala
• I said before that it’s possible to be 100% compatible with legacy
tools by writing out to SAM or BAM.
• You might not want to do this…

© 2015 MapR Technologies 12
Benefits of Keeping Data in Spark / ADAM
• ADAM file sizes are about 20% smaller than BAM.
• ADAM files are faster for common operations (filters, range
queries)
• ADAM files sort more quickly than `samtools sort …`. This can
be a huge time savings

© 2015 MapR Technologies 14
• Some other commands built-in, take a look:
• $ADAM_HOME/bin/adam-submit
• $ADAM_HOME/bin/adam-submit count_kmers
$DEMO/build/data/reads.sam /tmp/count_kmers 3
• head /tmp/count_kmers/part-00000
• All of this is easier without leaving the Spark shell
• And it’s faster, too…

© 2015 MapR Technologies 15
And it’s faster, too… Using Spark Eliminates Bottlenecks
Align() Concat() Sort()Split()

© 2015 MapR Technologies 16
Thanks! Questions?
@allenday, @mapr
aday@mapr.com
linkedin.com/in/allenday

Weitere ähnliche Inhalte

Was ist angesagt?

An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks

Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Spark Summit

Strata NY 2017 Parquet Arrow roadmapJulien Le Dem

2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache GiraphAvery Ching

Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableStefan Kupstaitis-Dunkler

R and-hadoopBryan Downing

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit

Streaming in the Wild with Apache FlinkDataWorks Summit/Hadoop Summit

Fishing Graphs in a Hadoop Data Lake DataWorks Summit/Hadoop Summit

R Hadoop integrationDzung Nguyen

Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit

High Performance Cloud ComputingAmazon Web Services

Apache Spark & HadoopMapR Technologies

HUG August 2010: Best practicesHadoop User Group

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit

Bringing complex event processing to Spark streamingDataWorks Summit

Real time hadoop + mapreduce introGeoff Hendrey

Strata NY 2018: The deconstructed databaseJulien Le Dem

Hadoop and Distributed ComputingFederico Cargnelutti

Using SparkR to Scale Data Science Applications in Production. Lessons from t...DataWorks Summit/Hadoop Summit

Was ist angesagt? (20)

An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...

Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...

Strata NY 2017 Parquet Arrow roadmap

2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph

Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable

R and-hadoop

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...

Streaming in the Wild with Apache Flink

Fishing Graphs in a Hadoop Data Lake

R Hadoop integration

Processing 70Tb Of Genomics Data With ADAM And Toil

High Performance Cloud Computing

Apache Spark & Hadoop

HUG August 2010: Best practices

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...

Bringing complex event processing to Spark streaming

Real time hadoop + mapreduce intro

Strata NY 2018: The deconstructed database

Hadoop and Distributed Computing

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Andere mochten auch

Lightning fast genomics with Spark, Adam and ScalaAndy Petrella

Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson

Genomics isn't SpecialAllen Day, PhD

Strata Big Data Science Talk on ADAMMatt Massie

20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD

Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD

Scaling up genomic analysis with ADAMfnothaft

Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson

Fast Variant Calling with ADAM and avocadofnothaft

Intel - Challenges and Opportunities in Cloud-Based Genomics AnalyticsIntelHealthcare

Strata-Hadoop 2015 PresentationTimothy Danford

Scalable up genomic analysis with ADAMfnothaft

openEHR sll-2015finalopenEHR Foundation

Free Code Friday: Genome Resequencing with Spark, Part 1MapR Technologies

Why is Bioinformatics a Good Fit for Spark?Timothy Danford

Data analytics challenges in genomicsmikaelhuss

Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala

Developing openEHR EHRs - core functionalitiesPablo Pazos

ADAM—Spark Summit, 2014fnothaft

Scala: the unpredicted lingua franca for data scienceAndy Petrella

Andere mochten auch (20)

Lightning fast genomics with Spark, Adam and Scala

Genomics Is Not Special: Towards Data Intensive Biology

Genomics isn't Special

Strata Big Data Science Talk on ADAM

20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix

Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI

Scaling up genomic analysis with ADAM

Hadoop for Bioinformatics: Building a Scalable Variant Store

Fast Variant Calling with ADAM and avocado

Intel - Challenges and Opportunities in Cloud-Based Genomics Analytics

Strata-Hadoop 2015 Presentation

Scalable up genomic analysis with ADAM

openEHR sll-2015final

Free Code Friday: Genome Resequencing with Spark, Part 1

Why is Bioinformatics a Good Fit for Spark?

Data analytics challenges in genomics

Building cloud-enabled genomics workflows with Luigi and Docker

Developing openEHR EHRs - core functionalities

ADAM—Spark Summit, 2014

Scala: the unpredicted lingua franca for data science

Ähnlich wie Genome Analysis Pipelines with Spark and ADAM

Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformDataStax Academy

20151015 zagreb spark_notebooksAndrey Vykhodtsev

Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin

Take a peek at Dell's smart EPM global environmentRodrigo Radtke de Souza

Rust is for "Big Data"Andy Grove

Visual Mapping of Clickstream DataDataWorks Summit

Apache Tez – Present and FutureJianfeng Zhang

Apache Tez – Present and FutureRajesh Balamohan

DataFrames: The Extended CutWes McKinney

IBMHadoopofferingTechline-Systems2015Daniela Zuppini

20160524 ibm fast data meetupshinolajla

Introduction to MySQL ClusterAbel Flórez

PySpark Best PracticesCloudera, Inc.

MySQL ClusterAbel Flórez

Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...DataKitchen

Stream Processing Everywhere - What to use?MapR Technologies

Monitoring Attack Surface to Secure DevOps PipelinesDenim Group

12 Factor ScalaJoe Kutner

Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit

Real Time Analytics with DseDataStax Academy

Ähnlich wie Genome Analysis Pipelines with Spark and ADAM (20)

Large Scale Data Analytics with Spark and Cassandra on the DSE Platform

20151015 zagreb spark_notebooks

Apache Airflow (incubating) NL HUG Meetup 2016-07-19

Take a peek at Dell's smart EPM global environment

Rust is for "Big Data"

Visual Mapping of Clickstream Data

Apache Tez – Present and Future

DataFrames: The Extended Cut

IBMHadoopofferingTechline-Systems2015

20160524 ibm fast data meetup

Introduction to MySQL Cluster

PySpark Best Practices

MySQL Cluster

Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...

Stream Processing Everywhere - What to use?

Monitoring Attack Surface to Secure DevOps Pipelines

12 Factor Scala

Spark SQL versus Apache Drill: Different Tools with Different Rules

Real Time Analytics with Dse

Mehr von Allen Day, PhD

Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD

20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...Allen Day, PhD

20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD

20170424 - Big Data in Biology - Vancouver - Simon Fraser UniversityAllen Day, PhD

20170406 Genomics@Google - KeyGene - WageningenAllen Day, PhD

20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD

Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIAllen Day, PhD

Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Allen Day, PhD

Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD

Renaissance in Medicine - Strata - NoSQL and GenomicsAllen Day, PhD

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD

R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD

Human Genetics & Big Data [sans Ethics]Allen Day, PhD

Building Data Science Teams, AbbreviatedAllen Day, PhD

Genomics Crash Course for Data EngineersAllen Day, PhD

20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD

20131212 - Sydney - Garvan Institute - Human Genetics and Big DataAllen Day, PhD

2013.12.12 - Sydney - Big Data AnalyticsAllen Day, PhD

20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD

20131111 - Santa Monica - BigDataCamp - Big Data Design PatternsAllen Day, PhD

Mehr von Allen Day, PhD (20)

Deep learning in medicine: An introduction and applications to next-generatio...

20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...

20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...

20170424 - Big Data in Biology - Vancouver - Simon Fraser University

20170406 Genomics@Google - KeyGene - Wageningen

20170402 Crop Innovation and Business - Amsterdam

Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI

Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17

Hadoop as a Platform for Genomics - Strata 2015, San Jose

Renaissance in Medicine - Strata - NoSQL and Genomics

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...

R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose

Human Genetics & Big Data [sans Ethics]

Building Data Science Teams, Abbreviated

Genomics Crash Course for Data Engineers

20140228 - Singapore - BDAS - Ensuring Hadoop Production Success

20131212 - Sydney - Garvan Institute - Human Genetics and Big Data

2013.12.12 - Sydney - Big Data Analytics

20131011 - Los Gatos - Netflix - Big Data Design Patterns

20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns

Kürzlich hochgeladen

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda

A Journey Into the Emotions of Software DevelopersNicole Novielli

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen

Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers

UiPath Community: Communication Mining from Zero to HeroUiPathCommunity

Scale your database traffic with Read & Write split using MySQL RouterMydbops

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765

Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica

Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica

2024 April Patch TuesdayIvanti

React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal

Kürzlich hochgeladen (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger

A Journey Into the Emotions of Software Developers

How AI, OpenAI, and ChatGPT impact business and software.

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Testing tools and AI - ideas what to try with some tool examples

Design pattern talk by Kaya Weers - 2024 (v2)

UiPath Community: Communication Mining from Zero to Hero

Scale your database traffic with Read & Write split using MySQL Router

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration

Glenn Lazarus- Why Your Observability Strategy Needs Security Observability

Zeshan Sattar- Assessing the skill requirements and industry expectations for...

2024 April Patch Tuesday

React Native vs Ionic - The Best Mobile App Framework

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...

Genome Analysis Pipelines with Spark and ADAM

2. © 2015 MapR Technologies 2 Let’s kick of a build while we talk… • Follow the instructions at • https://github.com/allenday/spark-genome-alignment-demo • While that’s cooking… • What’s this all about?

6. © 2015 MapR Technologies 6 Many-to-Many Relationship Between Reads and Locations • Read1 • Read2 • Read3 • Read4 • NULL • LocationA • LocationB • LocationC • LocationD • LocationA • NULL • LocationE

10. © 2015 MapR Technologies 10 How to Do it? FastQ => BAM Prepare Environment • Get Input Reference Sequences in FastA format (one time step) • Index Reference Sequences with Bowtie (one time step) Get Input Data • Get Input Unaligned Reads in FastQ format (per input data set) Do the Work • Convert Reads to ADAM Format with Spark (per input data set) • Use Spark Pipe to Align the Reads (per input data fragment) • Write the Alignments to SAM or BAM format (per input data, optional)

11. © 2015 MapR Technologies 11 Let’s look at some code… • Let’s check on the build we kicked off earlier. • And the driver script: • cat $DEMO/bin/bowtie_pipe_single.scala • I said before that it’s possible to be 100% compatible with legacy tools by writing out to SAM or BAM. • You might not want to do this…

12. © 2015 MapR Technologies 12 Benefits of Keeping Data in Spark / ADAM • ADAM file sizes are about 20% smaller than BAM. • ADAM files are faster for common operations (filters, range queries) • ADAM files sort more quickly than `samtools sort …`. This can be a huge time savings

14. © 2015 MapR Technologies 14 • Some other commands built-in, take a look: • $ADAM_HOME/bin/adam-submit • $ADAM_HOME/bin/adam-submit count_kmers $DEMO/build/data/reads.sam /tmp/count_kmers 3 • head /tmp/count_kmers/part-00000 • All of this is easier without leaving the Spark shell • And it’s faster, too…

Genome Analysis Pipelines with Spark and ADAM

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Genome Analysis Pipelines with Spark and ADAM

Ähnlich wie Genome Analysis Pipelines with Spark and ADAM (20)

Mehr von Allen Day, PhD

Mehr von Allen Day, PhD (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Genome Analysis Pipelines with Spark and ADAM