whole genome analysis
history
needs
steps involved
human genome data
NGS
pyrosequencing
illumina
SOLiD
Ion torrent
PacBio
applications
problems
benefits
2. Genome is the entire complement of genetic material of an
organism, virus or an organelle (or) Haploid set of chromosome in
eukaryotic organism
Whole genome is the complete genome set of an organism.
Whole genome sequencing is a laboratory process where
complete DNA sequence of organism’s genome at a single time.
3. The term "sequence analysis" implies subjecting a DNA
or peptide sequence to sequence alignment, sequence
databases, repeated sequence searches, or other
bioinformatics methods on a computer.
Sequence analysis in bioinformatics is an automated,
computer-based examination of characteristic
fragments, e.g. of a DNA strand.
4. WHOLE GENOME ANALYSIS
Genomic analysis is the identification, measurement or comparison of
genomic features such as DNA sequence, structural variation, gene
expression, or regulatory and functional element annotation at a
genomic scale.
Methods for genomic analysis typically require high-throughput
sequencing or microarray hybridization and bioinformatics.
5. • Allan Maxam and Walter Gilbert developed chemical method
of DNA sequencing in 1976-1977.
• Since this method was technically complex & use of extensive
hazardous chemicals & fallen out of flavor.
• Sanger coulson developed the chain-termination method in
1977.
• Only be used for fairly short strands (100 to 1000 base pairs)
and longer sequences must be subdivided into smaller
fragments.
6. • Bacteriophage fX174, was the first genome to be sequenced in
1935 with 11 genes & 5,368 base pairs (bp). This is a viral
genome smaller than T Phages and are polyhedral. This was
done by Norman & Baker by staining method with the use of
Sanger method of shotgun sequencing.
• Haemophilus influenza was the first bacterial genome to be
sequenced in 1995 by Craig. This is a gram negative bacteria
with 1.8 million bp
• The first nearly complete human genomes sequenced were J.
Craig Venter's, James Watson's, Yoruban and Seong-Jin Kim.
7. WHY WHOLE GENOME
SEQUENCING?
• Information about coding and non coding part of an organism.
• To find out important pathways in microbes.
• For evolutionary study and species comparison.
• For more effective personalized medicine (why a drug works for person X
and not for Y).
• Identification of important secondary metabolite pathways (e.g. in plants).
• Disease-susceptibility prediction based on gene sequence variation.
8. 1) Genome sequence assembly
2) Identify repetitive sequences – mask out
3) Gene prediction – train a model for each genome
4) Genome annotation- process of attaching biological information to
sequence
5) Metabolic pathways and regulation- to find missing genes
6) Protein 2D gel electrophoresis- to detect translational product
7) Functional genomics
8) Gene location/gene map
9) Self-comparison of proteome
10) Comparative genomics
11) Identify clusters of functionally related genes
12) Evolutionary modeling- to analyze chromosomal arrangement,
duplications, predictions can be made
10. • Easy for prokaryotes (single cell) – one gene,
one protein
• More difficult for eukaryotes (multicell) – one
gene, many proteins
• Very difficult for Human – short exons
separated by non-coding long introns
• Gene recognition is by sequence alignment
11. For eg: 3.1 x 109 bp in human genome
Difficulties:
• Small genes are hard to identify
• Some genes are rarely expressed and do not
have normal codon usage patterns – thus hard
to detect
13. • Recently a number of faster and cheaper sequencing methods have been developed.
– In October 2006, the X Prize Foundation, working in collaboration with the J. Craig
Venter Science Foundation, established the Archon X Prize for Genomics, intending to
award US $10 million to "the first Team that can build a device and use it to sequence 100
human genomes within 10 days or less, with an accuracy of no more than one error in every
1,000,000 bases sequenced, with sequences accurately covering at least 98% of the genome,
and at a recurring cost of no more than $1,000 per genome". An error rate of 1 in
1,000,000 bases, out of a total of approximately six billion bases in the human diploid
genome, would mean about 6,000 errors per genome. In clinical use, such as predictive
medicine currently 1,400 clinical single gene sequencing tests are over
In August 2013 this has been cancelled.
– Currently there is a developing method that will sequence the entire human genome for
$1000, to allow personal genomics.
– One of the most widely used new methods involve the pyrosequencing biochemical
reactions (invented by Nyren and Ronaghi in 1996), with the massively parallel
microfluidics technology invented by the 454 Life Sciences company. This combined
technology is called “454 sequencing”.
14. NEXT GENERATION SEQUENCING
Roche’s 454 FLX Ion torrentIllumina ABI’s Solid
• Array based sequencing
• Sequence full genome of an organism in a few days at a very low cost.
• Produce high throughput data in form of short reads.
15. Towards ‘Next Generation’ sequencing
instruments
Capacity greater than one Gigabase per run
Drastic decrease in costs per genome
sequencing of whole bacterial genomes in a single run
sequencing genomes of individuals
metagenomics: sequencing DNA extracted from
environmental samples
looking for rare variants in a single amplified region, in
tumors or viral infections
transcriptome sequencing: total cellular mRNA converted
to cDNA.
16. PYROSEQUENCING
•This technique used to sequence DNA by using chemiluminescent enzymatic
reactions.
Step 1: Preparation of single stranded DNA molecule by alkali denaturation and dNTP
is attached
Step 2: In DNA synthesis, a dNTP is attached to the 3’end of the growing DNA strand.
DNA Polymerase start elongating by using dNTPs. The two phosphates on the end are
released as pyrophosphate (PPi).
17. Samples collected from library adapters added to both ends
Individual fragments are captured using adapters
Roche’s 454 FLX
18. Fragments are amplified by PCR picotiter plate is loaded
(sample loading takes 8h)
Sequence is accomplished &
data is analysed
19. Step 3: ATP sulfurylase is normally used in sulfur assimilation: it converts ATP and inorganic
sulfate to adenosine 5’-phosphosulfate (APS) and PPi. Luciferase is the enzyme that causes
fireflies to glow. It uses luciferin and ATP as substrates, converting luciferin to oxyluciferin and
releasing visible light.
20. • The four dNTPs are added one at a time
• The amount of light released is proportional to the number of nucleotides
added to the new DNA strand. Thus, if the sequence has 2 A’s in a row,
both get added and twice as much light is released.
Step 4: After the reaction has completed, apyrase is added to destroy any
leftover dNTPs.
• The pyrosequencing machine cycles between the 4 dNTPs many times,
building up the complete sequence. About 300 bp of sequence is possible
(as compared to 800-1000 bp with Sanger sequencing).
Step 5: The light is detected with a charge-coupled device (CCD) camera-
Pyrosequencing method
21. • Sample preparation: Extracted
and purified DNA.
• Tagmentation: Transposome is
fragmented
• 2 different adapters are added
on each end of the DNA, then
bind it to a slide( flow cell)
coated with the complementary
sequences for each primer.
• This allows “bridge PCR”,
producing a small spot of
amplified DNA on the slide.
• The slide contains millions of
individual DNA spots. The spots
are visualized during the
sequencing run, using the
fluorescence of the nucleotide
being added.
Illumina Sequencing
22. Illumina Sequencing Chemistry
• Cluster generation: process where each
fragment is isothermally amplified.
• Reverse strands are cleaved and washed
away while forward strands are present
• This method uses the basic Sanger idea of
“sequencing by synthesis” of the second
strand of a DNA molecule. Starting with a
primer, new bases are added one at a time,
with fluorescent tags used to determine
which base was added.
• The fluorescent tags block the 3’-OH of the
new nucleotide, and so the next base can
only be added when the tag is removed.
• The cycle is repeated 50-100 times.
23. SOLiD Sequencing
1. Fragment library- 2 types of fragments(single &
mate paired)
2. Ligation of adapters
3. Substrate preparation
4. Hybridization – clonal amplification
5. emulsion PCR
6. Di-base probes(fluorescently labelled) are added
7. Fluorescence is measured
24. Emulsion preparation:
Water + capture beads + enzyme + DNA fragments
+ synthetic oil is vigorously shaked. Thus water droplets
are formed around beads i.e emulsion
•Each plate has 1.6 million wells
•This is designed in such a way that only one bead will fit
in each well
28. •Semi conductor chip contains millions of wells covered
by millions of pixels.
•Chip captures chemical information from DNA sequence
& translate light to digital information
•DNA is cleaved into millions fragments and is attached
to its own bead flooded with DNA nucleotides
•For each bonding hydrogen ion is released eg: G C and
change the pH solution of well
•The chemical change is read on chip by using ion
sensitive layer beneath well
•If nucleotide is not complementary to specific strand no
ion is released eg: G T
•This process occurs simultaneously in million wells
32. •Each SMRT(single molecule real time) has 10,000 zeromode
waveguides
•Step 1: DNA polymerase is immobilized at bottom
•Step 2: phospholinked nucleotides are added
•Step 3: each nucleotide is labelled with different coloured
fluorophore
•Step 4: base is detected such that light pulse is produced after
incorporation up to 1000 fold.
36. Instrument Pacbio Ion torrent 454 Illumina SOLiD
Method Single
molecule in
real time
Ion
semiconductor
Pyrosequencing Synthesis ligation
Read length 3kb 200bp 700bp 50 to 250bp 50+35 OR
50+50bp
Error type indel Indel Indel Substitution A-T bias
Error % 13 ̃1 ̃0.1 ̃0.1 ̃0.1
Reads per run 35000-75000 Upto 4M 1M Upto 3.2G 1.2 to 1.4G
Time/run 30 min in 2 h 2h 24h 1-10 days 1 to 2 wks
Cost/million
bases in $
2 1 10 0.05 to 0.15 0.13
Advantages Longest read
length & fast
↓ expensive &
fast
Long read size &
fast
↑ sequence
yield, cost,
accuracy
↓ low cost per
base
disadvantages Low yield at ↑
accuracy.
expensive
Homopolymer
errors
Runs are
expensive.
homopolymer
errors
expensive Slower than
other
methods, read
lengths,
longevity of
platform
37. equipment applications
454 Whole genome sequencing, resequencing, ,Metagenomics
Ion torrent Small de novo genome sequencing
Amplicon sequencing,Metagenomics Validation
Illumina Small de novo genome sequencing,cytogenetic analysis,
Metagenomics Validation
Transcriptome sequencing (RNA-Seq)
Whole Exome Sequencing
Whole Genome Sequencing
SOLiD Transcriptome sequencing (RNA-Seq) Whole Exome Sequencing
Whole Genome Sequencing
Pacific Biosciences Small genomes, Epigenomics
38. Assembly Problems
there are random mutations (either naturally occurring cell-to-cell variation or
generated by PCR or cloning),
sometimes the cloning vector itself gets sequenced
most genomes contain multiple copies of many sequences
Getting rid of vector sequences is easy once the problem is recognized.
Repeat sequence DNA is very common in eukaryotes. High quality sequencing is
helpful
Sequencing errors, bad data, random mutations,misreadings
Data produce in form of short reads
Short reads produced have low quality bases and vector/adaptor contaminations.
Several genome assemblers are available but we have to check the performance of
them to search for best one.
Quality control
Patent and licensing restrictions
39. Short Reads
454 FLX Solid
IlluminaIon torrent Low cost &
Less time
Genomic Fragments
40.
41. BENEFITS
Treatments based on genomics
Improve outcomes
Faster diagnosis
More precise prognosis
Effective therapy
Reduce healthcare costs
42. Applications
Oncology
Determine the preferred therapeutic agent for each tumor
Ascertain which patients are most likely to benefit from a given
therapy
Molecular Pathology
Disease-specific tests
Cost of a single test: $100 - $5,000
Individual test validation and performance
Medicine
• human genomic sequence in public databases allows rapid
identification of disease genes by positional cloning
• Inherited Diseases can be identified