2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Data, and Biology"

Like the Dog that Caught the Bus:
Sequencing, Big Data, and Biology
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Jan 2014
ctb@msu.edu

20 years in…
 Started working in Dr. Koonin‟s group in 1993;
 First publication was submitted almost exactly 20

years ago!

Analogy: we seek an understanding
of humanity via our libraries.

http://eofdreams.com/library.html;

But, our only observation tool is
shredding a mixture of all of the
books & digitizing the shreds.

http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/

Points:
 Lots of fragments needed! (Deep sampling.)
 Having read and understood some books will help








quite a bit (Prior knowledge.)
Rare books will be harder to reconstruct than
common books.
Errors in OCR process matter quite a bit.
The more, different specialized libraries you
sample, the more likely you are to discover valid
correlations between topics and books.
A categorization system would be an invaluable but
not infallible guide to book topics.
Understanding the language would help you validate
& understand the books.

Biological analog: shotgun
metagenomics
 Collect samples;
 Extract DNA;
 Feed into sequencer;
 Computationally analyze.

“Sequence it all and let the
bioinformaticians sort it
out”
Wikipedia: Environmental shotgun
sequencing.png

Investigating soil microbial
communities
 95% or more of soil microbes cannot be cultured

in lab.
 Very little transport in soil and sediment =>
slow mixing rates.
 Estimates of immense diversity:
 Billions of microbial cells per gram of soil.
 Million+ microbial species per gram of soil (Gans et

al, 2005)
 One observed lower bound for genomic sequence
complexity => 26 Gbp (Amazon Rain Forest
Microbial Observatory)

“By 'soil' we understand (Vil'yams, 1931) a loose
surface layer of earth capable of yielding plant
crops. In the physical sense the soil represents a
complex disperse system consisting of three
phases: solid, liquid, and gaseous.”

Microbies live in & on:
• Surfaces of
aggregate particles;
• Pores within
microaggregates;

N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS
http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h
tml

Questions to address
 Role of soil microbes in nutrient cycling:
 How does agricultural soil differ from native soil?

 How do soil microbial communities respond to

climate perturbation?
 Genome-level questions:
 What kind of strain-level heterogeneity is present in

the population?
 What are the phage and viral populations &
dynamic?
 What species are where, and how much is shared
between different geographical locations?

Must use culture independent and
metagenomic approaches
 Many reasons why you can‟t or don‟t want to

culture:
 Syntrophic relationships
 Niche-specificity or unknown physiology
 Dormant microbes
 Abundance within communities

 If you want to get at underlying function, 16s

analysis alone is not sufficient.
Single-cell sequencing & shotgun metagenomics
are two common ways to investigate complex
microbial communities.

Shotgun metagenomics
 Collect samples;
 Extract DNA;
 Feed into sequencer;
 Computationally analyze.

“Sequence it all and let the
bioinformaticians sort it
out”
Wikipedia: Environmental shotgun
sequencing.png

Computational reconstruction of
(meta)genomic content.

http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/

Points:
 Lots of fragments needed! (Deep sampling.)
 Having read and understood some books will help








quite a bit (Reference genomes.)
Rare books will be harder to reconstruct than
common books.
Errors in OCR process matter quite a bit.
(Sequencing error)
The more, different specialized libraries you sample,
the more likely you are to discover valid correlations
between topics and books. (We don’t understand
most microbial function.)
A categorization system would be an invaluable but
not infallible guide to book topics. (Phylogeny can
guide interpretation.)
Understanding the language would help you validate

Great Prairie Grand
Challenge --SAMPLING
LOCATIONS

2008

A “Grand Challenge” dataset
(DOE/JGI)
Total: 1,846 Gbp soil metagenome
600

MetaHIT (Qin et. al, 2011), 578 Gbp

Basepairs of Sequencing (Gbp)

500

400

Rumen (Hess et. al, 2011), 268 Gbp

300

200

Rumen K-mer Filtered,
111 Gbp

100

NCBI nr database,
37 Gbp

0
Iowa,
Iowa, Native Kansas,
Continuous
Prairie
Cultivated
corn
corn

Kansas,
Native
Prairie
GAII

Wisconsin, Wisconsin, Wisconsin, Wisconsin,
Restored Switchgrass
Continuous
Native
corn
Prairie
Prairie

HiSeq

Why do we need so much data?!
 20-40x coverage is necessary; 100x is ~sufficient.

 Mixed population sampling => sensitivity driven by

lowest abundance.
 For example, for E. coli in 1/1000 dilution, you would

need approximately 100x coverage of a 5mb genome
at 1/1000, or 500 Gbp of sequence!
(For soil, estimate is 50 Tbp)
 Sequencing is straightforward; data analysis is not.

“$1000 genome with $1m analysis”

Great Prairie Grand Challenge goals
 How much of the source metagenome can we

reconstruct from ~300-600 Gbp+ of shotgun
sequencing? (Largest data sets thus far.)
 What can we learn about soil from looking at the

reconstructed metagenome? (See list of
questions)

Great Prairie Grand Challenge goals
 How much of the source metagenome can we

reconstruct from ~300-600 Gbp+ of shotgun
sequencing? (Largest data sets thus far.)
 What can we learn about soil from looking at the

reconstructed metagenome? (See list of
questions)
(For complex ecological and evolutionary
systems, we‟re just starting to get past the first
question. More on that later.)

So, we want to go from raw data:
Name
@SRR606249.17/1
GAGTATGTTCTCATAGAGGTTGGTANNNNT
+
B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score
@SRR606249.17/2
CGAANNNNNNNNNNNNNNNNNCCTGGCTCA
+
CCCF#################22@GHIJJJ

…to “assembled” original sequence.

UMD assembly primer (cbcb.umd.edu)

De Bruijn graphs – assemble on
overlaps

J.R. Miller et al. / Genomics (2010)

Two problems: (1) variation/error

Single nucleotide variations cause long branches;
They don‟t rejoin quickly.

Two problems: (2) No graph
locality.
Assembly is inherently an all by all process. There
is no good way to subdivide the reads without
potentially missing a key connection

Assembly graphs scale with data size, not
information.

Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com

Why do k-mer assemblers scale
badly?
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set

Practical memory measurements

Velvet measurements (Adina Howe)

The Problem
 We can cheaply gather DNA data in quantities

sufficient to swamp straightforward assembly
algorithms running on commodity hardware.
 No locality to the data in terms of graph structure.
 Since ~2008:
 The field has engaged in lots of engineering

optimization…
 …but the data generation rate has consistently
outstripped Moore‟s Law.

Our two solutions.
1. Subdivide data
2. Discard redundant data.

1. Data partitioning
(a computational version of cell sorting)
Split reads into “bins”
belonging to
different source
species.
Can do this based
almost entirely on
connectivity of
sequences.
“Divide and conquer”
Memory-efficient
implementation
helps to scale
assembly.

Pell et al., 2012, PNAS

Our two solutions.
1. Subdivide data (~20x scaling; 2 years to develop;
100x data increase)
2. Discard redundant data.

2. Approach: Digital normalization
(a computational version of library normalization)
Suppose you have
a dilution factor of
A (10) to B(1). To
get 10x of B you
need to get 100x
of A!
Diversity vs
richness.

The high-coverage
reads in sample A
are unnecessary
for assembly, and,

Shotgun sequencing and
coverage

“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.

Most shotgun data is redundant.

You only need 5-10 reads at a locus to
assemble or call (diploid) SNPs… but
because sampling is random, and you
need 5-10 reads at every locus, you

Coverage estimation
If you can estimate the coverage of a read in a
data set without a reference, this is
straightforward:
for read in dataset:
if estimated_coverage(read) < CUTOFF:
save(read)
(Trick: the read coverage estimator needs to be errortolerant.)

The median k-mer count in a read is a good
approximate estimator of coverage.
This gives us a
reference-free
measure of
coverage.

Diginorm builds a De Bruijn graph & then
downsamples based on observed coverage.
Corresponds exactly to
underlying abstraction used for
assembly; retains graph
structure.

Digital normalization approach
 Is streaming and single pass: looks at each read

only once;
 Does not “collect” the majority of errors;
 Keeps all low-coverage reads;
 Smooths out coverage of regions.

…raw data can be retained for later abundance
estimation.

Contig assembly now scales with richness, not
(data)
(information)
diversity.

Most samples can be assembled in < 50 GB of
memory.

Diginorm is widely useful:
1. Assembly of the H. contortus parasitic nematode
genome, a “high polymorphism/variable coverage”
problem.
(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P.
marinus) transcriptome, a “big assembly” problem.
(in prep)
3. Osedax symbiont metagenome, a “contaminated
metagenome” problem (Goffredi et al, 2013; pmid

Diginorm is “lossy compression”
 Nearly perfect from an information theoretic

perspective:
 Discards 95% more of data for genomes.
 Loses < 00.02% of information.

Prospective: sequencing tumor cells
 Goal: phylogenetically reconstruct causal “driver

mutations” in face of passenger mutations.
 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of

sequence.
 Most of this data will be redundant and not useful.
 Developing diginorm-based algorithms to

eliminate data while retaining variant information.

Where are we taking this?
 Streaming online algorithms only look at data

~once.
 Diginorm is streaming, online…

 Conceptually, can move many aspects of

sequence analysis into streaming mode.
=> Extraordinary potential for computational
efficiency.

=> Streaming, online variant
calling.

Single pass, reference free, tunable, streaming online varian
Potentially quite clinically useful.

What about the assembly results for Iowa
corn and prairie??

Total
Assembly

Total Contigs
(> 300 bp)

% Reads
Assembled

Predicted
protein
coding

2.5 bill

4.5 mill

19%

5.3 mill

3.5 bill

5.9 mill

22%

6.8 mill

Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp

Adina Howe

Resulting contigs are low
coverage.

Figure 11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil met agenomes.

So, for soil:
 We really do need more data;
 But at least now we can assemble what we

already have.
 Estimate required sequencing depth at 50 Tbp;

 Now also have 2-8 Tbp from Amazon Rain Forest

Microbial Observatory.
 …still not saturated coverage, but getting closer.
But, diginorm approach turns out to be widely
useful.

Biogeography: Iowa sample overlap?
Corn and prairie De Bruijn graps have 51% overlap.

Corn

Prairie

Suggests that at greater depth, samples may have similar geno

Concluding thoughts
 Empirically effective tools, in reasonably wide

use.
 Diginorm provides streaming, online algorithmic

basis for





Coverage downsampling/lossy compression
Error identification (sublinear)
Error correction
Variant calling?

 Enables analyses that would otherwise be hard or

impossible.
 Most assembly doable in cloud or on commodity

hardware;

The real challenge:
understanding
 We have gotten distracted by shiny toys:

sequencing!! Data!!
 Data is now plentiful! But:
 We typically have no knowledge of what > 50% of

an environmental metagenome “means”,
functionally.
 Most data is not openly available, so we cannot
mine correlations across data sets.
 Most computational science is not reproducible,
so I can‟t reuse other people‟s tools or
approaches.

Data intensive biology & hypothesis
generation
 My interest in biological data is to enable better

hypothesis generation.

My interests
 Open source ecosystem of analysis tools.
 Loosely coupled APIs for querying databases.
 Publishing reproducible and reusable analyses,

openly.
 Education and training.

“Platform perspective”

Practical implications of diginorm
 Data is (essentially) free;
 For some problems, analysis is now cheaper

than data gathering (i.e. essentially free);
 …plus, we can run most of our approaches in

the cloud.

khmer-protocols
Read cleaning

 Effort to provide standard “cheap”

assembly protocols for the cloud.
Diginorm

 Entirely copy/paste; ~2-6 days from

raw reads to
assembly, annotations, and
differential expression analysis.
~$150 on Amazon per data set.
 Open, versioned, forkable, citable.

Assembly

Annotation

RSEM differential
expression

IPython Notebook: data + code
=>
IPython)Notebook)

My interests
 Open source ecosystem of analysis tools.
 Loosely coupled APIs for querying databases.
 Publishing reproducible and reusable

analyses, openly.
 Education and training.

“Platform perspective”

We practice open science!
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/research.html
 Preprints: on arXiv, q-bio:
„diginorm arxiv‟

Acknowledgements
Lab members involved















Adina Howe (w/Tiedje)
Jason Pell
Arend Hintze
Qingpeng Zhang
Elijah Lowe
Likit Preeyanon
Jiarong Guo
Tim Brom
Kanchan Pavangadkar
Eric McDonald
Camille Scott
Jordan Fish
Michael Crusoe
Leigh Sheneman

Collaborators
 Jim Tiedje, MSU
 Susannah Tringe and Janet






Jansson (JGI, LBNL)
Erich Schwarz, Caltech /
Cornell
Paul Sternberg, Caltech
Robin Gasser, U. Melbourne
Weiming Li, MSU
Shana Goffredi, Occidental

Funding

USDA NIFA; NSF IOS;
NIH; BEACON.

2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Data, and Biology"

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (13)

Ähnlich wie 2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Data, and Biology"

Ähnlich wie 2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Data, and Biology" (20)

Mehr von c.titus.brown

Mehr von c.titus.brown (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Data, and Biology"

Hinweis der Redaktion