SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Rethinking Data-Intensive
Science Using Scalable
Analytics Systems
Frank Austin Nothaft
UC Berkeley AMP/ASPIRE Lab, @fnothaft
With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja,
Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson
Scientific revolutions are
driven by data acquisition
revolutions
Genome Sequencing
Source: NIH National Genome Research Institute
2014: ~230,000 genomes sequenced
15-250GB/genome = ~30TB/day
= ~10PB/year
Human Genome!
Project: ~10GB
1000 Genomes: 15TB
TCGA: 3PB
Sequencing advances line up well
with scalable analytics software
Source: NIH National Genome Research Institute
Google
MapReduce
Hadoop MR
Spark
Parquet
Mapping scientific systems to
commodity analytics systems
• Contemporary scientific systems are custom-built
• Leads to functionality from commodity systems being rebuilt
• We have an opportunity to rethink the abstractions that
scientific systems use:
• Migrate from a flat architecture to a stacked
architecture
• Expose higher level programming primitives
• Use commodity tools wherever possible
Common Traits of Legacy Data
Intensive Scientific Systems
1. Computation is workflow/pipeline oriented
2. Processing system has monolithic/flat architecture
3. Data is stored in flat files
Genomics Pipelines
Source: The Broad Institute of MIT/Harvard
Flat File Formats
• Scientific data is typically stored in application
specific file formats:
• Genomic reads: SAM/BAM, CRAM
• Genomic variants: VCF/BCF, MAF
• Genomic features: BED, NarrowPeak, GTF
• Centralized metadata makes it difficult to parallelize
applications
Flat Architectures
• APIs present very barebones abstractions:
• GATK: Sorted iterator over the genome
• Why are flat architectures bad?
1. Trivial: low level abstractions are not productive
2. Trivial: flat architectures create technical lock-in
3. Subtle: low level abstractions can introduce bugs
The perils of flattening…
• The trivial:
• You can improve performance by pushing data
access order into your data layout
• But now, you can’t easily compose pipeline stages
that have different access orders
• The obscure:
• If you access data via a sorted iterator, will you
incorrectly implement your algorithm?
A green field approach
First, define a schema
record AlignmentRecord {	
union { null, Contig } contig = null;	
union { null, long } start = null;	
union { null, long } end = null;	
union { null, int } mapq = null;	
union { null, string } readName = null;	
union { null, string } sequence = null;	
union { null, string } mateReference = null;	
union { null, long } mateAlignmentStart = null;	
union { null, string } cigar = null;	
union { null, string } qual = null;	
union { null, string } recordGroupName = null;	
union { int, null } basesTrimmedFromStart = 0;	
union { int, null } basesTrimmedFromEnd = 0;	
union { boolean, null } readPaired = false;	
union { boolean, null } properPair = false;	
union { boolean, null } readMapped = false;	
union { boolean, null } mateMapped = false;	
union { boolean, null } firstOfPair = false;	
union { boolean, null } secondOfPair = false;	
union { boolean, null } failedVendorQualityChecks = false;	
union { boolean, null } duplicateRead = false;	
union { boolean, null } readNegativeStrand = false;	
union { boolean, null } mateNegativeStrand = false;	
union { boolean, null } primaryAlignment = false;	
union { boolean, null } secondaryAlignment = false;	
union { boolean, null } supplementaryAlignment = false;	
union { null, string } mismatchingPositions = null;	
union { null, string } origQual = null;	
union { null, string } attributes = null;	
union { null, string } recordGroupSequencingCenter = null;	
union { null, string } recordGroupDescription = null;	
union { null, long } recordGroupRunDateEpoch = null;	
union { null, string } recordGroupFlowOrder = null;	
union { null, string } recordGroupKeySequence = null;	
union { null, string } recordGroupLibrary = null;	
union { null, int } recordGroupPredictedMedianInsertSize = null;	
union { null, string } recordGroupPlatform = null;	
union { null, string } recordGroupPlatformUnit = null;	
union { null, string } recordGroupSample = null;	
union { null, Contig } mateContig = null;	
}
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
A schema provides a
narrow waistrecord AlignmentRecord {	
union { null, Contig } contig = null;	
union { null, long } start = null;	
union { null, long } end = null;	
union { null, int } mapq = null;	
union { null, string } readName = null;	
union { null, string } sequence = null;	
union { null, string } mateReference = null;	
union { null, long } mateAlignmentStart = null;	
union { null, string } cigar = null;	
union { null, string } qual = null;	
union { null, string } recordGroupName = null;	
union { int, null } basesTrimmedFromStart = 0;	
union { int, null } basesTrimmedFromEnd = 0;	
union { boolean, null } readPaired = false;	
union { boolean, null } properPair = false;	
union { boolean, null } readMapped = false;	
union { boolean, null } mateMapped = false;	
union { boolean, null } firstOfPair = false;	
union { boolean, null } secondOfPair = false;	
union { boolean, null } failedVendorQualityChecks = false;	
union { boolean, null } duplicateRead = false;	
union { boolean, null } readNegativeStrand = false;	
union { boolean, null } mateNegativeStrand = false;	
union { boolean, null } primaryAlignment = false;	
union { boolean, null } secondaryAlignment = false;	
union { boolean, null } supplementaryAlignment = false;	
union { null, string } mismatchingPositions = null;	
union { null, string } origQual = null;	
union { null, string } attributes = null;	
union { null, string } recordGroupSequencingCenter = null;	
union { null, string } recordGroupDescription = null;	
union { null, long } recordGroupRunDateEpoch = null;	
union { null, string } recordGroupFlowOrder = null;	
union { null, string } recordGroupKeySequence = null;	
union { null, string } recordGroupLibrary = null;	
union { null, int } recordGroupPredictedMedianInsertSize = null;	
union { null, string } recordGroupPlatform = null;	
union { null, string } recordGroupPlatformUnit = null;	
union { null, string } recordGroupSample = null;	
union { null, Contig } mateContig = null;	
}
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Accelerate common
access patterns
• In genomics, we commonly
have to find observations that
overlap in a coordinate plane
• This coordinate plane is
genomics specific, and is
known a priori
• We can use our knowledge of
the coordinate plane to
implement a fast overlap join
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Pick appropriate storage
• When accessing scientific
datasets, we frequently slice and
dice the dataset:
• Algorithms may touch
subsets of columns
• We don’t always touch the
whole dataset
• This is a good match for
columnar storage
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Is introducing a new data
model really a good idea?
Source: XKCD, http://xkcd.com/927/
A subtle point:!
Proper stack design can simplify
backwards compatibility
To support legacy data formats, you define a way to
serialize/deserialize the schema into/from the
legacy flat file format!
Data Distribution
Materialized Data
Legacy File Format
Schema
Data Models
Data Distribution
Materialized Data
Columnar Storage
Schema
Data Models
A subtle point:!
Proper stack design can simplify
backwards compatibility
This is a view!
Data Distribution
Materialized Data
Legacy File Format
Schema
Data Models
Data Distribution
Materialized Data
Columnar Storage
Schema
Data Models
A well designed stack
simplifies application design
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Variant calling & analysis,
RNA-seq analysis, etc.
Disk, SDD, block
store, memory cache
HDFS, Tachyon, HPC file
systems, S3
Load data from Parquet and
legacy formats
Spark, Spark-SQL,
Hadoop
Enriched Read/Variant
Avro Schema for reads,
variants, and genotypes
Users define analyses
via transformations
Enriched models provide convenient
methods on common models
The evidence access layer
efficiently executes transformations
Schemas define the logical
structure of basic genomic objects
Common interfaces map logical
schema to bytes on disk
Parallel file system layer
coordinates distribution of data
Decoupling storage enables
performance/cost tradeoff
How does this perform
on real scientific data?
ADAM performs genomic
preprocessing
Source: The Broad Institute of MIT/Harvard
ADAM’s Performance
• Achieve linear scalability out
to 128 nodes for most tasks
• Up to 3x improvement over
current tools on a single node
Analysis run using Amazon EC2, single node was i2.8xlarge, cluster was r3.2xlarge
Scripts available at https://www.github.com/bigdatagenomics/bdg-services.git
Astronomy Pipelines
Source: The LSST Project
Astronomy Image
Co-addition Performance
• Scales out to 16 nodes
• ~3x improvement over extant
tool on a single node
Analysis run using Amazon EC2, cluster was c3.8xlarge (HPC optimized)
Conclusions
• There is a huge increase in the amount of scientific
data being processed
• Although scientific processing pipelines tend to be
custom solutions, we can replace these pipelines
with general, DBMS backed solutions
• When we move to a general solution, we can gain
performance without losing correctness
Acknowledgements
• ADAM (https://www.github.com/bigdatagenomics/adam):!
• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey
Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony
Joseph, Dave Patterson!
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman,
Jeff Hammerbacher!
• GenomeBridge: Carl Yeksigian!
• Cloudera: Uri Laserson!
• Microsoft Research: Ravi Pandya!
• UC Santa Cruz: Benedict Paten, David Haussler!
• KIRA (https://www.github.com/BIDS/Kira):!
• UC Berkeley: Zhao Zhang, Mike Franklin, Evan Sparks, Kyle Barbary,
Oliver Zahn, Saul Perlmutter!
• PoC code at https://github.com/zhaozhang/SparkMontage

Weitere ähnliche Inhalte

Was ist angesagt?

Twinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query ToolTwinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query ToolLeigh Dodds
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using RVictoria López
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit
 
12. Heaps - Data Structures using C++ by Varsha Patil
12. Heaps - Data Structures using C++ by Varsha Patil12. Heaps - Data Structures using C++ by Varsha Patil
12. Heaps - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.Tatiana Tarasova
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01shaziabibi5
 
Analysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAnalysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAbhishek Mungoli
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsEdwin de Jonge
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandasAkshitaKanther
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on RAjay Ohri
 
Practical dimensions
Practical dimensionsPractical dimensions
Practical dimensionstholem
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf OpenflydataJun Zhao
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query OptimizationAli Usman
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQLOlaf Hartig
 

Was ist angesagt? (20)

Twinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query ToolTwinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query Tool
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
SPARQL Cheat Sheet
SPARQL Cheat SheetSPARQL Cheat Sheet
SPARQL Cheat Sheet
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
12. Heaps - Data Structures using C++ by Varsha Patil
12. Heaps - Data Structures using C++ by Varsha Patil12. Heaps - Data Structures using C++ by Varsha Patil
12. Heaps - Data Structures using C++ by Varsha Patil
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01
 
Analysis of different similarity measures: Simrank
Analysis of different similarity measures: SimrankAnalysis of different similarity measures: Simrank
Analysis of different similarity measures: Simrank
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasets
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
Practical dimensions
Practical dimensionsPractical dimensions
Practical dimensions
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata
 
Data Structure
Data StructureData Structure
Data Structure
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query Optimization
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQL
 

Ähnlich wie Rethinking Data-Intensive Science Using Scalable Analytics Systems

Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMfnothaft
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadofnothaft
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013Sanjeev Mishra
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
What's next in Julia
What's next in JuliaWhat's next in Julia
What's next in JuliaJiahao Chen
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5Roger Barga
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingRayhan Ferdous
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASRick Watts
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...NoSQLmatters
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Rakebul Hasan
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structuresecomputernotes
 
Accessing hemibrain data using Neuprintr
Accessing hemibrain data using Neuprintr Accessing hemibrain data using Neuprintr
Accessing hemibrain data using Neuprintr Alexander Bates
 
computer notes - Data Structures - 1
computer notes - Data Structures - 1computer notes - Data Structures - 1
computer notes - Data Structures - 1ecomputernotes
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm. Abdul salam
 
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Dominic Suciu
 

Ähnlich wie Rethinking Data-Intensive Science Using Scalable Analytics Systems (20)

Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
What's next in Julia
What's next in JuliaWhat's next in Julia
What's next in Julia
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to Reporting
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SAS
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
 
User biglm
User biglmUser biglm
User biglm
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structures
 
Accessing hemibrain data using Neuprintr
Accessing hemibrain data using Neuprintr Accessing hemibrain data using Neuprintr
Accessing hemibrain data using Neuprintr
 
computer notes - Data Structures - 1
computer notes - Data Structures - 1computer notes - Data Structures - 1
computer notes - Data Structures - 1
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
 
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
 
ORM JPA
ORM JPAORM JPA
ORM JPA
 

Mehr von fnothaft

Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMfnothaft
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analysesfnothaft
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
Reproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral ModelsReproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral Modelsfnothaft
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assemblyfnothaft
 
Execution Environments
Execution EnvironmentsExecution Environments
Execution Environmentsfnothaft
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Handsfnothaft
 
Adam bosc-071114
Adam bosc-071114Adam bosc-071114
Adam bosc-071114fnothaft
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 

Mehr von fnothaft (9)

Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Reproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral ModelsReproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral Models
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
 
Execution Environments
Execution EnvironmentsExecution Environments
Execution Environments
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 
Adam bosc-071114
Adam bosc-071114Adam bosc-071114
Adam bosc-071114
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 

Kürzlich hochgeladen

Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书rnrncn29
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 

Kürzlich hochgeladen (20)

Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 

Rethinking Data-Intensive Science Using Scalable Analytics Systems

  • 1. Rethinking Data-Intensive Science Using Scalable Analytics Systems Frank Austin Nothaft UC Berkeley AMP/ASPIRE Lab, @fnothaft With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson
  • 2. Scientific revolutions are driven by data acquisition revolutions
  • 3. Genome Sequencing Source: NIH National Genome Research Institute 2014: ~230,000 genomes sequenced 15-250GB/genome = ~30TB/day = ~10PB/year Human Genome! Project: ~10GB 1000 Genomes: 15TB TCGA: 3PB
  • 4. Sequencing advances line up well with scalable analytics software Source: NIH National Genome Research Institute Google MapReduce Hadoop MR Spark Parquet
  • 5. Mapping scientific systems to commodity analytics systems • Contemporary scientific systems are custom-built • Leads to functionality from commodity systems being rebuilt • We have an opportunity to rethink the abstractions that scientific systems use: • Migrate from a flat architecture to a stacked architecture • Expose higher level programming primitives • Use commodity tools wherever possible
  • 6. Common Traits of Legacy Data Intensive Scientific Systems 1. Computation is workflow/pipeline oriented 2. Processing system has monolithic/flat architecture 3. Data is stored in flat files
  • 7. Genomics Pipelines Source: The Broad Institute of MIT/Harvard
  • 8. Flat File Formats • Scientific data is typically stored in application specific file formats: • Genomic reads: SAM/BAM, CRAM • Genomic variants: VCF/BCF, MAF • Genomic features: BED, NarrowPeak, GTF • Centralized metadata makes it difficult to parallelize applications
  • 9. Flat Architectures • APIs present very barebones abstractions: • GATK: Sorted iterator over the genome • Why are flat architectures bad? 1. Trivial: low level abstractions are not productive 2. Trivial: flat architectures create technical lock-in 3. Subtle: low level abstractions can introduce bugs
  • 10. The perils of flattening… • The trivial: • You can improve performance by pushing data access order into your data layout • But now, you can’t easily compose pipeline stages that have different access orders • The obscure: • If you access data via a sorted iterator, will you incorrectly implement your algorithm?
  • 11. A green field approach
  • 12. First, define a schema record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null; } Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  • 13. Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models A schema provides a narrow waistrecord AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null; } Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  • 14. Accelerate common access patterns • In genomics, we commonly have to find observations that overlap in a coordinate plane • This coordinate plane is genomics specific, and is known a priori • We can use our knowledge of the coordinate plane to implement a fast overlap join Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  • 15. Pick appropriate storage • When accessing scientific datasets, we frequently slice and dice the dataset: • Algorithms may touch subsets of columns • We don’t always touch the whole dataset • This is a good match for columnar storage Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  • 16. Is introducing a new data model really a good idea? Source: XKCD, http://xkcd.com/927/
  • 17. A subtle point:! Proper stack design can simplify backwards compatibility To support legacy data formats, you define a way to serialize/deserialize the schema into/from the legacy flat file format! Data Distribution Materialized Data Legacy File Format Schema Data Models Data Distribution Materialized Data Columnar Storage Schema Data Models
  • 18. A subtle point:! Proper stack design can simplify backwards compatibility This is a view! Data Distribution Materialized Data Legacy File Format Schema Data Models Data Distribution Materialized Data Columnar Storage Schema Data Models
  • 19. A well designed stack simplifies application design Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models Variant calling & analysis, RNA-seq analysis, etc. Disk, SDD, block store, memory cache HDFS, Tachyon, HPC file systems, S3 Load data from Parquet and legacy formats Spark, Spark-SQL, Hadoop Enriched Read/Variant Avro Schema for reads, variants, and genotypes Users define analyses via transformations Enriched models provide convenient methods on common models The evidence access layer efficiently executes transformations Schemas define the logical structure of basic genomic objects Common interfaces map logical schema to bytes on disk Parallel file system layer coordinates distribution of data Decoupling storage enables performance/cost tradeoff
  • 20. How does this perform on real scientific data?
  • 21. ADAM performs genomic preprocessing Source: The Broad Institute of MIT/Harvard
  • 22. ADAM’s Performance • Achieve linear scalability out to 128 nodes for most tasks • Up to 3x improvement over current tools on a single node Analysis run using Amazon EC2, single node was i2.8xlarge, cluster was r3.2xlarge Scripts available at https://www.github.com/bigdatagenomics/bdg-services.git
  • 24. Astronomy Image Co-addition Performance • Scales out to 16 nodes • ~3x improvement over extant tool on a single node Analysis run using Amazon EC2, cluster was c3.8xlarge (HPC optimized)
  • 25. Conclusions • There is a huge increase in the amount of scientific data being processed • Although scientific processing pipelines tend to be custom solutions, we can replace these pipelines with general, DBMS backed solutions • When we move to a general solution, we can gain performance without losing correctness
  • 26. Acknowledgements • ADAM (https://www.github.com/bigdatagenomics/adam):! • UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony Joseph, Dave Patterson! • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher! • GenomeBridge: Carl Yeksigian! • Cloudera: Uri Laserson! • Microsoft Research: Ravi Pandya! • UC Santa Cruz: Benedict Paten, David Haussler! • KIRA (https://www.github.com/BIDS/Kira):! • UC Berkeley: Zhao Zhang, Mike Franklin, Evan Sparks, Kyle Barbary, Oliver Zahn, Saul Perlmutter! • PoC code at https://github.com/zhaozhang/SparkMontage