The document describes a reference database of genome sequences for Streptococcus species. It lists 26 Streptococcus genome assemblies from NCBI, including assemblies for S. massiliensis, S. intermedius, S. constellatus, S. anginosus, and some unnamed Streptococcus species. This database can be used as a reference for comparing new Streptococcus genome sequences and identifying species.
3. FDA,
USDA,
CDC
State,
Local
and
Foreign
Public
Health
Agencies
Industry/Academia
Addi$onal
DATA
ANALYSIS
DATA
ASSEMBLY
AND
STORAGE
and
Analysis
DATA
ACQUISITION
NCBI,
EMBL
DDBJ
(INDIS)
(Public
Access
Database)
Our Current Model – Publicly available data
NaFonal
Network
of
Sequencers
IntrenaFonal
Network
of
Sequencers
4. Automated
Bacterial
Assembly
SRA
Reads
sample
1
Trim
reads
(Ns,
adaptor)
Reference
Distance
tree
Find
closest
reference
genome(s)
ArgoCA
(Combined
Assembly)
De
novo
assembly
panel
Argo
(Reference
assisted
assembly)
SOAP
denovo
GS-‐assembler
(newbler)
MaSuRCA
Celera
Assembler
Reads
remapped
to
combined
assembly
ConFg
fasta
Read
placements
(bam)
Quality
profile
SPAdes
5. WGS & Epidemiologically Relevant Distance (ERD)
• WGS
allows
high
resoluFon
genotypic
comparison
of
pathogen
isolates
• What
is
the
epidemiological
relevance
of
genotypic
distance?
• Many
methods
to
compute
–
we
need
some
common
principles…
6. Since all approaches start with sequence reads, we must
retain for independent confirmaHon
0
0.5
1
0
500
1000
1500
Millions
FDA-‐CFSAN:
microbial
foodborne
pathogen
research
SRA
format
bytes
per
sequenced
base
versus
number
of
bases
in
MiSeq
runs
With
Quality
Without
QualiFes
0
0.2
0.4
0.6
0.8
0
200
400
600
800
1000
1200
Millions
OXFORD
University:
PopulaFon
Genomics
of
Mycobacterium
tuberculosis
SRA
format
bytes
per
sequenced
base
versus
number
of
bases
in
MiSeq
and
HiSeq
runs
With
Quality
Without
Quality
Storage
is
manageable…
7. Reliable, transparent, high throughput, high
resolution ERDs?
Major challenge is to distinguish independent
events (SNPs) from single events that generate
multiple nucleotide differences
i.e. collapsed repeats and other artifacts,
alignment errors (reference-based alignments),
sequence quality, & recombination
11. Table:
Samples
currently
processed
(as
of
Sept
5,
2014)
in
NCBI
Pathogen
Pipeline
Organisms
Center
Listeria
Salmonella
E.
coli
Total
CDC
903
903
FDA
+
State
Partners*
858
6129
307
7294
100K
565
34
599
FERA
14
14
Total
1775
6694
341
8810
Processing
Status
12. How
to
measure
the
system?
need
the
raw
data
(sequence
reads)
in
unprocessed
form
any
read
trimming/filtering
along
with
the
assembly
can
be
regenerated
13. Assembly
metrics
map
the
reads
back
to
the
assembly
and
generate
a
profile
of
each
posiFon
(coverage,
alleles,
qualiFes)
compare
the
assembly
against
other
assemblies
of
the
same
organism
(genus,
species)
and
check
the
expected
genome
size,
or
similarity
to
related
genomes
annotaFon
metrics
such
as
frameshiied
proteins
14. What
is
the
actual
measurement
for
sequence
similarity?
the
number
of
pairwise
SNPs
between
two
genomes
What
is
the
threshold?
a
pairwise
distance
(an
observaFonally
determined
cutoff
below
which
a
cluster
of
2
or
more
isolates
are
considered
significantly
close
enough
to
warrant
further
invesFgaFon)
15. Sensi>vity
vs.
Specificity
sequence
clustering
sensiFvity
–
measure
of
isolates
which
belong
to
the
cluster
within
epidemiologically
relevant
distance
(true
posiFves)
/
true
posiFves
+
false
negaFves
(not
correctly
idenFfied)
specificity
–
measure
of
isolates
which
are
excluded
from
a
cluster
within
epidemiologically
relevant
distance
(true
negaFves)
/
true
negaFves
+
false
posiFves
16. Organism
Total
Samples
Not
expected
species1
Mixed
organisms
Less
than
5X
coverage
Duplicates
PacBio
Poor
2nd
read
Failed
assembly
stage
Listeria
1775
20
2
(?)
1
5
1
Salmonella
6694
35
5
9
12
E.
coli
341
8
1
1.
not
L.
monocytogenes,
S.
enterica,
or
E.
coli
Processing
Problems
29. Assembly
for
sample
SAMN02727350
Type
Number
of
conFgs
Sum
of
conFg
lengths
Full
assembly
667
5251272
conFgs
with
Listeria
hits
37
3031650
conFgs
with
Staphylococcus
hits
630
2203573
34. Table:
Assembly
stats
for
SAMN02693748
measurement
result
num_input_reads
4212706
aligned_reads
4040070
assembly_num_bases
3180478
assembly_num_conFgs
50
assembly_N50
2817733
poor_quality_support_bases
132321
35.
36.
37.
38.
39.
40. Organism
Biosample
SRA
Run
Similarity
to:
Listeria
monocytogenes
IEH-‐NGS-‐LIS-‐00100
SAMN02567873
SRR1207486
Listeria
SLCC7179
SRR1220750
Listeria
J0161
Salmonella
enterica
EnteriFdis
MDH-‐2014-‐00798
SAMN02741943
SRR1553852
Schwarzengrund
str.
CVM19633
SRR1272871
EnteriFdis
str.
P125109
Salmonella
enterica
Fluntern
MDH-‐2013-‐00153
SAMN02378158
SRR1067624
Javiana
and
Schwarzengrund
SRR1395304
Cubana
and
Agona
41.
42. Proficiency
TesFng
• Replicate
results
(phylogeny,
SNPs)
from
published
studies
• Resequencing
ü same
isolate
on
mulFple
plasorms
ü same
isolate
in
mulFple
libraries
ü same
isolate
in
mulFple
labs
• Blinded
submissions
ü already-‐characterized
isolates
ü mixed
sample
isolates
ü metagenomic
isolates
• Corner
cases
ü Extreme
coverage
ü Duplicates
ü Sample
mixups
43.
44.
45.
46.
47. Acknowledgements
National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA
Richa
Agarwala
Azat
Badretdin
Slava
Brover
Joshua
Cherry
Vyacheslav
Chetvernin
Robert
Cohen
Michael
DiCuccio
Mike
Feldgarden
Dan
Hai
William
Klimke
Arjun
Prasad
Edward
Rice
Kirill
Rotmistrovskyy
Stephen
Sherry
Sergey
Shiryev
MarFn
Shumway
TaFana
Tatusova
Igor
Tolstoy
Chunlin
Xiao
Leonid
Zaslavsky
Alexander
Zasypkin
Alejandro
A.
Schaffer
Lukas
Wagner
Aleksandr
Morgulis
David
Lipman
James
Ostell
NCBI
• This research was supported by the Intramural
Research Program of the NIH, National Library of
Medicine. http://www.ncbi.nlm.nih.gov
CDC
FDA/CFSAN
NIHGRI
UC-Davis
USDA
Vendors: PacBio, Illumina, Roche