Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Data, and Biology"
1. Like the Dog that Caught the Bus:
Sequencing, Big Data, and Biology
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Jan 2014
ctb@msu.edu
2. 20 years in…
Started working in Dr. Koonin‟s group in 1993;
First publication was submitted almost exactly 20
years ago!
3. Like the Dog that Caught the Bus:
Sequencing, Big Data, and Biology
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Jan 2014
ctb@msu.edu
4. Analogy: we seek an understanding
of humanity via our libraries.
http://eofdreams.com/library.html;
5. But, our only observation tool is
shredding a mixture of all of the
books & digitizing the shreds.
http://eofdreams.com/library.html;
http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
6. Points:
Lots of fragments needed! (Deep sampling.)
Having read and understood some books will help
quite a bit (Prior knowledge.)
Rare books will be harder to reconstruct than
common books.
Errors in OCR process matter quite a bit.
The more, different specialized libraries you
sample, the more likely you are to discover valid
correlations between topics and books.
A categorization system would be an invaluable but
not infallible guide to book topics.
Understanding the language would help you validate
& understand the books.
7. Biological analog: shotgun
metagenomics
Collect samples;
Extract DNA;
Feed into sequencer;
Computationally analyze.
“Sequence it all and let the
bioinformaticians sort it
out”
Wikipedia: Environmental shotgun
sequencing.png
8. Investigating soil microbial
communities
95% or more of soil microbes cannot be cultured
in lab.
Very little transport in soil and sediment =>
slow mixing rates.
Estimates of immense diversity:
Billions of microbial cells per gram of soil.
Million+ microbial species per gram of soil (Gans et
al, 2005)
One observed lower bound for genomic sequence
complexity => 26 Gbp (Amazon Rain Forest
Microbial Observatory)
9. “By 'soil' we understand (Vil'yams, 1931) a loose
surface layer of earth capable of yielding plant
crops. In the physical sense the soil represents a
complex disperse system consisting of three
phases: solid, liquid, and gaseous.”
Microbies live in & on:
• Surfaces of
aggregate particles;
• Pores within
microaggregates;
N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS
http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h
tml
10. Questions to address
Role of soil microbes in nutrient cycling:
How does agricultural soil differ from native soil?
How do soil microbial communities respond to
climate perturbation?
Genome-level questions:
What kind of strain-level heterogeneity is present in
the population?
What are the phage and viral populations &
dynamic?
What species are where, and how much is shared
between different geographical locations?
11. Must use culture independent and
metagenomic approaches
Many reasons why you can‟t or don‟t want to
culture:
Syntrophic relationships
Niche-specificity or unknown physiology
Dormant microbes
Abundance within communities
If you want to get at underlying function, 16s
analysis alone is not sufficient.
Single-cell sequencing & shotgun metagenomics
are two common ways to investigate complex
microbial communities.
12. Shotgun metagenomics
Collect samples;
Extract DNA;
Feed into sequencer;
Computationally analyze.
“Sequence it all and let the
bioinformaticians sort it
out”
Wikipedia: Environmental shotgun
sequencing.png
13. Computational reconstruction of
(meta)genomic content.
http://eofdreams.com/library.html;
http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
14. Points:
Lots of fragments needed! (Deep sampling.)
Having read and understood some books will help
quite a bit (Reference genomes.)
Rare books will be harder to reconstruct than
common books.
Errors in OCR process matter quite a bit.
(Sequencing error)
The more, different specialized libraries you sample,
the more likely you are to discover valid correlations
between topics and books. (We don’t understand
most microbial function.)
A categorization system would be an invaluable but
not infallible guide to book topics. (Phylogeny can
guide interpretation.)
Understanding the language would help you validate
17. Why do we need so much data?!
20-40x coverage is necessary; 100x is ~sufficient.
Mixed population sampling => sensitivity driven by
lowest abundance.
For example, for E. coli in 1/1000 dilution, you would
need approximately 100x coverage of a 5mb genome
at 1/1000, or 500 Gbp of sequence!
(For soil, estimate is 50 Tbp)
Sequencing is straightforward; data analysis is not.
“$1000 genome with $1m analysis”
18. Great Prairie Grand Challenge goals
How much of the source metagenome can we
reconstruct from ~300-600 Gbp+ of shotgun
sequencing? (Largest data sets thus far.)
What can we learn about soil from looking at the
reconstructed metagenome? (See list of
questions)
19. Great Prairie Grand Challenge goals
How much of the source metagenome can we
reconstruct from ~300-600 Gbp+ of shotgun
sequencing? (Largest data sets thus far.)
What can we learn about soil from looking at the
reconstructed metagenome? (See list of
questions)
(For complex ecological and evolutionary
systems, we‟re just starting to get past the first
question. More on that later.)
20. So, we want to go from raw data:
Name
@SRR606249.17/1
GAGTATGTTCTCATAGAGGTTGGTANNNNT
+
B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score
@SRR606249.17/2
CGAANNNNNNNNNNNNNNNNNCCTGGCTCA
+
CCCF#################22@GHIJJJ
22. De Bruijn graphs – assemble on
overlaps
J.R. Miller et al. / Genomics (2010)
23. Two problems: (1) variation/error
Single nucleotide variations cause long branches;
They don‟t rejoin quickly.
24. Two problems: (2) No graph
locality.
Assembly is inherently an all by all process. There
is no good way to subdivide the reads without
potentially missing a key connection
28. The Problem
We can cheaply gather DNA data in quantities
sufficient to swamp straightforward assembly
algorithms running on commodity hardware.
No locality to the data in terms of graph structure.
Since ~2008:
The field has engaged in lots of engineering
optimization…
…but the data generation rate has consistently
outstripped Moore‟s Law.
30. 1. Data partitioning
(a computational version of cell sorting)
Split reads into “bins”
belonging to
different source
species.
Can do this based
almost entirely on
connectivity of
sequences.
“Divide and conquer”
Memory-efficient
implementation
helps to scale
assembly.
Pell et al., 2012, PNAS
31. Our two solutions.
1. Subdivide data (~20x scaling; 2 years to develop;
100x data increase)
2. Discard redundant data.
32. 2. Approach: Digital normalization
(a computational version of library normalization)
Suppose you have
a dilution factor of
A (10) to B(1). To
get 10x of B you
need to get 100x
of A!
Diversity vs
richness.
The high-coverage
reads in sample A
are unnecessary
for assembly, and,
33. Shotgun sequencing and
coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.
34. Most shotgun data is redundant.
You only need 5-10 reads at a locus to
assemble or call (diploid) SNPs… but
because sampling is random, and you
need 5-10 reads at every locus, you
41. Coverage estimation
If you can estimate the coverage of a read in a
data set without a reference, this is
straightforward:
for read in dataset:
if estimated_coverage(read) < CUTOFF:
save(read)
(Trick: the read coverage estimator needs to be errortolerant.)
42. The median k-mer count in a read is a good
approximate estimator of coverage.
This gives us a
reference-free
measure of
coverage.
43. Diginorm builds a De Bruijn graph & then
downsamples based on observed coverage.
Corresponds exactly to
underlying abstraction used for
assembly; retains graph
structure.
44. Digital normalization approach
Is streaming and single pass: looks at each read
only once;
Does not “collect” the majority of errors;
Keeps all low-coverage reads;
Smooths out coverage of regions.
…raw data can be retained for later abundance
estimation.
45. Contig assembly now scales with richness, not
(data)
(information)
diversity.
Most samples can be assembled in < 50 GB of
memory.
46. Diginorm is widely useful:
1. Assembly of the H. contortus parasitic nematode
genome, a “high polymorphism/variable coverage”
problem.
(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P.
marinus) transcriptome, a “big assembly” problem.
(in prep)
3. Osedax symbiont metagenome, a “contaminated
metagenome” problem (Goffredi et al, 2013; pmid
47. Diginorm is “lossy compression”
Nearly perfect from an information theoretic
perspective:
Discards 95% more of data for genomes.
Loses < 00.02% of information.
48. Prospective: sequencing tumor cells
Goal: phylogenetically reconstruct causal “driver
mutations” in face of passenger mutations.
1000 cells x 3 Gbp x 20 coverage: 60 Tbp of
sequence.
Most of this data will be redundant and not useful.
Developing diginorm-based algorithms to
eliminate data while retaining variant information.
49. Where are we taking this?
Streaming online algorithms only look at data
~once.
Diginorm is streaming, online…
Conceptually, can move many aspects of
sequence analysis into streaming mode.
=> Extraordinary potential for computational
efficiency.
51. What about the assembly results for Iowa
corn and prairie??
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill
4.5 mill
19%
5.3 mill
3.5 bill
5.9 mill
22%
6.8 mill
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Adina Howe
52. Resulting contigs are low
coverage.
Figure 11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil met agenomes.
53. So, for soil:
We really do need more data;
But at least now we can assemble what we
already have.
Estimate required sequencing depth at 50 Tbp;
Now also have 2-8 Tbp from Amazon Rain Forest
Microbial Observatory.
…still not saturated coverage, but getting closer.
But, diginorm approach turns out to be widely
useful.
54. Biogeography: Iowa sample overlap?
Corn and prairie De Bruijn graps have 51% overlap.
Corn
Prairie
Suggests that at greater depth, samples may have similar geno
55. Concluding thoughts
Empirically effective tools, in reasonably wide
use.
Diginorm provides streaming, online algorithmic
basis for
Coverage downsampling/lossy compression
Error identification (sublinear)
Error correction
Variant calling?
Enables analyses that would otherwise be hard or
impossible.
Most assembly doable in cloud or on commodity
hardware;
56. The real challenge:
understanding
We have gotten distracted by shiny toys:
sequencing!! Data!!
Data is now plentiful! But:
We typically have no knowledge of what > 50% of
an environmental metagenome “means”,
functionally.
Most data is not openly available, so we cannot
mine correlations across data sets.
Most computational science is not reproducible,
so I can‟t reuse other people‟s tools or
approaches.
57. Data intensive biology & hypothesis
generation
My interest in biological data is to enable better
hypothesis generation.
58. My interests
Open source ecosystem of analysis tools.
Loosely coupled APIs for querying databases.
Publishing reproducible and reusable analyses,
openly.
Education and training.
“Platform perspective”
59. Practical implications of diginorm
Data is (essentially) free;
For some problems, analysis is now cheaper
than data gathering (i.e. essentially free);
…plus, we can run most of our approaches in
the cloud.
60. khmer-protocols
Read cleaning
Effort to provide standard “cheap”
assembly protocols for the cloud.
Diginorm
Entirely copy/paste; ~2-6 days from
raw reads to
assembly, annotations, and
differential expression analysis.
~$150 on Amazon per data set.
Open, versioned, forkable, citable.
Assembly
Annotation
RSEM differential
expression
63. My interests
Open source ecosystem of analysis tools.
Loosely coupled APIs for querying databases.
Publishing reproducible and reusable
analyses, openly.
Education and training.
“Platform perspective”
64. We practice open science!
Everything discussed here:
Code: github.com/ged-lab/ ; BSD license
Blog: http://ivory.idyll.org/blog („titus brown blog‟)
Twitter: @ctitusbrown
Grants on Lab Web site:
http://ged.msu.edu/research.html
Preprints: on arXiv, q-bio:
„diginorm arxiv‟
65. Acknowledgements
Lab members involved
Adina Howe (w/Tiedje)
Jason Pell
Arend Hintze
Qingpeng Zhang
Elijah Lowe
Likit Preeyanon
Jiarong Guo
Tim Brom
Kanchan Pavangadkar
Eric McDonald
Camille Scott
Jordan Fish
Michael Crusoe
Leigh Sheneman
Collaborators
Jim Tiedje, MSU
Susannah Tringe and Janet
Jansson (JGI, LBNL)
Erich Schwarz, Caltech /
Cornell
Paul Sternberg, Caltech
Robin Gasser, U. Melbourne
Weiming Li, MSU
Shana Goffredi, Occidental
Funding
USDA NIFA; NSF IOS;
NIH; BEACON.
Hinweis der Redaktion
@@ change slide up => more complex diversity
Taking advantage of structure within read
Note that any such measure will do.
Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.