SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Doing next-gen sequencing
   analysis in the cloud.
       C. Titus Brown
       ctb@msu.edu
Acknowledgements
Lab members involved        Collaborators
   Adina Howe (w/Tiedje)    Jim Tiedje, MSU
   Jason Pell
   ArendHintze              Billie Swalla, UW
   RosangelaCanino-         Janet Jansson, LBNL
    Koning
   Qingpeng Zhang           Susannah Tringe, JGI
   Elijah Lowe
   LikitPreeyanon          Funding
   JiarongGuo
   Tim Brom                USDA NIFA; NSF IOS;
   KanchanPavangadkar           BEACON.
   Eric McDonald
CT Brown - Doing next-gen sequencing analysis in the cloud
“Be the change you want to see”
         We are aggressivelyopen…

Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
  http://ged.msu.edu/interests.html
  (What‟s a good license??)
 Preprints: on arXiv, q-bio:
  „kmer-percolation arxiv‟
  „diginormarxiv‟
The data catastrophe!
 Data set sizes growing faster than compute capacity
  (esp RAM).
 Many biological algorithms don‟t scale all that well,
  anyway.
 Algorithmically, we want:
   Single-pass.
   Compression approaches (lossy or otherwise).
   Low-memory data structures


 I, personally, think the last thing in the world we need
  is another standalone package: pre-filtering
  approaches.
    “Run our nifty approaches first, then feed into the
Digital normalization


                   Suppose you have a
                dilution factor of A (10) to
                B(1). To get 10x of B you
                  need to get 100x of A!
                          Overkill!!

                 This 100x will consume
                disk space and, because
                   of errors, memory.
Downsample based on de Bruijn
graph structure (which can be
derived online)
Digital normalization algorithm

for read in dataset:
  if median_kmer_count(read) < CUTOFF:
update_kmer_counts(read)
save(read)
  else:
       # discard read

            Note, single pass; fixed memory.
Digital normalization is efficient &
effective
                       • Single pass algorithm
                       • Fixed memory;
 Algorithmic nerdvana! • Cheaper than assembly;
                       • Reduces assembly time;
                       • Scales assembly memory.




                                 Brown et al., in review, PLoS On
Digital normalization removes errors
Shotgun data is often (1) high
coverage and (2) biased in coverage.
…here we discard > 95% of data!
Other key points
 Virtually identical contigassembly; scaffolding works
  but is not yet cookie-cutter.

 Digital normalization changes the way de Bruijn graph
  assembly scales from the size of your data set to
  the size of the source sample.

 Alwayslower memory than assembly: we never
  collect most erroneous k-mers.

 Digital normalization can be done once– and then
  assembly parameter exploration can be done.
Quotable quotes.
Comment: “This looks like a great solution for
  people who can’t afford real computers”.

                     OK, but:

   “Buying ever bigger computers is a great
   solution for people who don’t want to think
                      hard.”

To be less snide: both kinds of scaling are needed,
                      of course.
Why use diginorm?
 Use the cloud to assemble any microbial
 genomes incl. single-cell, many eukaryotic
 genomes, most mRNAseq, and many
 metagenomes.

 Seems to provide leverage on addressing many
 biological or sample prep problems (single-cell &
 genome amplification MDA; metagenome;
 heterozygosity).

 And, well, the general idea of locus specific
 graph analysis solves lots of things…
Some interim concluding
thoughts
 Digital normalization-like approaches provide a
 path to solving the majority of assembly scaling
 problems, and will enable assembly on current
 cloud computing hardware.
   This is not true for highly diverse metagenome
    environments…
   For soil, we estimate that we need 50 Tbp / gram
    soil. Sigh.

 Biologists and bioinformaticianshate:
   Throwing away data
   Caveats in bioinformatics papers (which reviewers
   like, note)
Streaming error correction.




We can do error trimming of genomic, MDA, transcriptomic,
     metagenomic data in < 2 passes, fixed memory.
   We have just submitted a proposal to adapt Euler or
   Quake-like error correction (e.g. spectral alignment
               problem) to this framework.
Side note: error correction is the
biggest “data” problem left in
sequencing.




        Both for mapping & assembly.
Replication fu
 In December 2011, I met Wes McKinney on a
 train and he convinced me that I should look at
 IPython Notebook.

 This is an interactive Web notebook for data
 analysis…

 Hey, neat! We can use this for replication!
   All of our figures can be regenerated from scratch,
    on an EC2 instance, using a Makefile (data
    pipeline) and IPython Notebook (figure generation).
   Everything is version controlled.
   Honestly not much work, and will be less the next
    time.
CT Brown - Doing next-gen sequencing analysis in the cloud
So… how‟d that go?
 People who already cared thought it was nifty.
       http://ivory.idyll.org/blog/replication-i.html
 Almost nobody else cares ;(
   Presub enquiry to editor: “Be sure that your paper can
    be reproduced.” Uh, please read my letter to the end?
   “Could you improve your Makefile? I want to
    reimplementdiginorm in another language and reuse
    your pipeline, but your Makefile is a mess.”
 Incredibly useful, nonetheless. Already part of
  undergraduate and graduate training in my lab;
  helping us and others with next parpes; etc. etc. etc.

   Life is way too short to waste on unnecessarily
   replicating your own workflows, much less other
                       people’s.
Acknowledgements
Lab members involved        Collaborators
   Adina Howe (w/Tiedje)    Jim Tiedje, MSU
   Jason Pell
   ArendHintze              Billie Swalla, UW
   RosangelaCanino-         Janet Jansson, LBNL
    Koning
   Qingpeng Zhang           Susannah Tringe, JGI
   Elijah Lowe
   LikitPreeyanon          Funding
   JiarongGuo
   Tim Brom                USDA NIFA; NSF IOS;
   KanchanPavangadkar           BEACON.
   Eric McDonald
Advertisement!
 Qingpeng Zhang (QP) will talk about our very
 useful „khmer‟ software for efficiently counting k-
 mers.

 Want a simple Python lib for reading & indexing
 FASTA/FASTQ? Check out screed.

  “Better science through superior software.”
Advertisement


Panel on “Should we have voluntary review
       standards for bioinformatics?”

           Tomorrow, 4:30pm.
We are aggressivelyopen
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
  http://ged.msu.edu/interests.html
  (What‟s a good license??)
 Preprints: on arXiv, q-bio:
  „kmer-percolation arxiv‟
  „diginormarxiv‟

Weitere ähnliche Inhalte

Andere mochten auch

The human genome project
The human genome projectThe human genome project
The human genome project14pascba
 
Human genome project
Human genome projectHuman genome project
Human genome projectShital Pal
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingDayananda Salam
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshopc.titus.brown
 
Gene therapy
Gene therapyGene therapy
Gene therapydamarisb
 

Andere mochten auch (7)

The human genome project
The human genome projectThe human genome project
The human genome project
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
Basic Steps of the NGS Method
Basic Steps of the NGS MethodBasic Steps of the NGS Method
Basic Steps of the NGS Method
 
Gene therapy
Gene therapy Gene therapy
Gene therapy
 
Gene therapy
Gene therapyGene therapy
Gene therapy
 

Ähnlich wie CT Brown - Doing next-gen sequencing analysis in the cloud

2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assemblyc.titus.brown
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesAdina Chuang Howe
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Guy Coates
 
Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010parallellabs
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsc.titus.brown
 
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduceData-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduceGeorge Ang
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and DataGuy Coates
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Art Of Distributed P0
Art Of Distributed P0Art Of Distributed P0
Art Of Distributed P0George Ang
 

Ähnlich wie CT Brown - Doing next-gen sequencing analysis in the cloud (20)

2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
PyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc AltedPyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc Alted
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?
 
Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
 
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduceData-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduce
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Art Of Distributed P0
Art Of Distributed P0Art Of Distributed P0
Art Of Distributed P0
 

Mehr von Jan Aerts

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationJan Aerts
 
Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Jan Aerts
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Jan Aerts
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Jan Aerts
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Jan Aerts
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data AnalysisJan Aerts
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualizationJan Aerts
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsJan Aerts
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...Jan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumJan Aerts
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisJan Aerts
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...Jan Aerts
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...Jan Aerts
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...Jan Aerts
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...Jan Aerts
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsJan Aerts
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesJan Aerts
 

Mehr von Jan Aerts (20)

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
 
Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 

Kürzlich hochgeladen

Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 

Kürzlich hochgeladen (20)

Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 

CT Brown - Doing next-gen sequencing analysis in the cloud

  • 1. Doing next-gen sequencing analysis in the cloud. C. Titus Brown ctb@msu.edu
  • 2. Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jim Tiedje, MSU  Jason Pell  ArendHintze  Billie Swalla, UW  RosangelaCanino-  Janet Jansson, LBNL Koning  Qingpeng Zhang  Susannah Tringe, JGI  Elijah Lowe  LikitPreeyanon Funding  JiarongGuo  Tim Brom USDA NIFA; NSF IOS;  KanchanPavangadkar BEACON.  Eric McDonald
  • 4. “Be the change you want to see” We are aggressivelyopen… Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/interests.html (What‟s a good license??)  Preprints: on arXiv, q-bio: „kmer-percolation arxiv‟ „diginormarxiv‟
  • 5. The data catastrophe!  Data set sizes growing faster than compute capacity (esp RAM).  Many biological algorithms don‟t scale all that well, anyway.  Algorithmically, we want:  Single-pass.  Compression approaches (lossy or otherwise).  Low-memory data structures  I, personally, think the last thing in the world we need is another standalone package: pre-filtering approaches. “Run our nifty approaches first, then feed into the
  • 6. Digital normalization Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
  • 7. Downsample based on de Bruijn graph structure (which can be derived online)
  • 8. Digital normalization algorithm for read in dataset: if median_kmer_count(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
  • 9. Digital normalization is efficient & effective • Single pass algorithm • Fixed memory; Algorithmic nerdvana! • Cheaper than assembly; • Reduces assembly time; • Scales assembly memory. Brown et al., in review, PLoS On
  • 11. Shotgun data is often (1) high coverage and (2) biased in coverage.
  • 12. …here we discard > 95% of data!
  • 13. Other key points  Virtually identical contigassembly; scaffolding works but is not yet cookie-cutter.  Digital normalization changes the way de Bruijn graph assembly scales from the size of your data set to the size of the source sample.  Alwayslower memory than assembly: we never collect most erroneous k-mers.  Digital normalization can be done once– and then assembly parameter exploration can be done.
  • 14. Quotable quotes. Comment: “This looks like a great solution for people who can’t afford real computers”. OK, but: “Buying ever bigger computers is a great solution for people who don’t want to think hard.” To be less snide: both kinds of scaling are needed, of course.
  • 15. Why use diginorm?  Use the cloud to assemble any microbial genomes incl. single-cell, many eukaryotic genomes, most mRNAseq, and many metagenomes.  Seems to provide leverage on addressing many biological or sample prep problems (single-cell & genome amplification MDA; metagenome; heterozygosity).  And, well, the general idea of locus specific graph analysis solves lots of things…
  • 16. Some interim concluding thoughts  Digital normalization-like approaches provide a path to solving the majority of assembly scaling problems, and will enable assembly on current cloud computing hardware.  This is not true for highly diverse metagenome environments…  For soil, we estimate that we need 50 Tbp / gram soil. Sigh.  Biologists and bioinformaticianshate:  Throwing away data  Caveats in bioinformatics papers (which reviewers like, note)
  • 17. Streaming error correction. We can do error trimming of genomic, MDA, transcriptomic, metagenomic data in < 2 passes, fixed memory. We have just submitted a proposal to adapt Euler or Quake-like error correction (e.g. spectral alignment problem) to this framework.
  • 18. Side note: error correction is the biggest “data” problem left in sequencing. Both for mapping & assembly.
  • 19. Replication fu  In December 2011, I met Wes McKinney on a train and he convinced me that I should look at IPython Notebook.  This is an interactive Web notebook for data analysis…  Hey, neat! We can use this for replication!  All of our figures can be regenerated from scratch, on an EC2 instance, using a Makefile (data pipeline) and IPython Notebook (figure generation).  Everything is version controlled.  Honestly not much work, and will be less the next time.
  • 21. So… how‟d that go?  People who already cared thought it was nifty. http://ivory.idyll.org/blog/replication-i.html  Almost nobody else cares ;(  Presub enquiry to editor: “Be sure that your paper can be reproduced.” Uh, please read my letter to the end?  “Could you improve your Makefile? I want to reimplementdiginorm in another language and reuse your pipeline, but your Makefile is a mess.”  Incredibly useful, nonetheless. Already part of undergraduate and graduate training in my lab; helping us and others with next parpes; etc. etc. etc. Life is way too short to waste on unnecessarily replicating your own workflows, much less other people’s.
  • 22. Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jim Tiedje, MSU  Jason Pell  ArendHintze  Billie Swalla, UW  RosangelaCanino-  Janet Jansson, LBNL Koning  Qingpeng Zhang  Susannah Tringe, JGI  Elijah Lowe  LikitPreeyanon Funding  JiarongGuo  Tim Brom USDA NIFA; NSF IOS;  KanchanPavangadkar BEACON.  Eric McDonald
  • 23. Advertisement!  Qingpeng Zhang (QP) will talk about our very useful „khmer‟ software for efficiently counting k- mers.  Want a simple Python lib for reading & indexing FASTA/FASTQ? Check out screed. “Better science through superior software.”
  • 24. Advertisement Panel on “Should we have voluntary review standards for bioinformatics?” Tomorrow, 4:30pm.
  • 25. We are aggressivelyopen Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/interests.html (What‟s a good license??)  Preprints: on arXiv, q-bio: „kmer-percolation arxiv‟ „diginormarxiv‟