Assembling the Tree of Life from public DNA sequence data

•Als PPTX, PDF herunterladen•

2 gefällt mir•739 views

Slides for my contribution to the symposium "Beyond the Tree of Life", held 16-18 March 2016, at the Trippenhuis in Amsterdam, the Netherlands.

Wissenschaft

FROM PUBLIC DNA SEQUENCE
DATA
PITFALLS, CHALLENGES AND
SOLUTIONS
Rutger Vos, Naturalis Biodiversity Center
Twitter: @rvosa

Assembling a Tree of Life
If you want to build The
Tree of Life you will
need to build a system
for building Trees of
Life.
Such a system has
many moving parts:
-data mining
-marker selection
-tree inference
-tree calibration
-tree grafting

Data mining
Keyword searches allow one to locate specific
genes for specific taxa, assuming that gene
names are standardized and applied correctly.

Data mining
Similarity ("BLAST") searches allow one to
locate sequences that are similar to a query
sequence

Pitfalls in data mining
 Bad gene naming conventions
 Incomplete knowledge of
orthology
 Ambiguous taxonomy

Possible solutions
 TNRS helps resolve ambiguities in names
(e.g. synonyms): http://taxosaurus.org
 Sequence clustering databases help avoid
keyword searching and haphazard BLASTing,
e.g.:
http://phylota.net

Digression: orthology
assignment
 Based on gene names?
Hopeless: outside of very few markers, gene
names are a mess
 Based on pairwise genome comparisons?
Sort-of done for proteins (e.g. InParanoid) but
not for non-coding
 Based on pairwise reciprocal best BLAST
hits?
Quick and cheap, but error prone depending
on losses and gains, and database
completeness

Supermatrix packing
How to
optimize
combinations
of markers to
maximize taxon
sampling and
minimize
sparseness?

Tree inference
 The current gold standard in species trees
from multilocus alignments: *BEAST. Very
expensive.
 Cheaper, scalable Bayesian tree inference is
provided by other tools, e.g. ExaBayes.
 Maybe the two can be combined? For
example:
 A cheap, large, taxonomically broad backbone
 Within-clade relationships resolved more
expensively

Tree calibration
 Fossils now available
through web service of
http://fossilcalibrations.or
g
 Scalable tree calibration
tools now exist, e.g.
treePL
 *BEAST has more
sophisticated methods

Tree grafting
Pitfalls:
• What if your exemplar species aren't on either side of the root?
• Grafting could then easily result in negative branch lengths
• Also, node density effects will lead to younger nodes in the backbone

All in one: SUPERSMART
The SUPERSMART
platform helps
assemble high quality
marker sets and
analyze them using
configurable divide-and-
conquer tree inference
approaches.
The platform is
available as a VM, a
Docker container, or as
an easy to install (using
Puppet)software stack.

Conclusions
 More and more moving parts for a ToL
moonshot are becoming available
 Workflows that result in high-quality estimates
can be composed from these
 SUPERSMART uses a recursive workflow with
different inference methods at different levels

Acknowledgements
 Alexandre Antonelli
 Hannes Hettling
 Mike Sanderson
 Bengt Oxelman
 Karin Nilsson
 Mats Töpel
 Hervé Sauquet
 Henrik Nilsson
 Daniele Silvestro
 Fabien Condamine
 Ruud Scharn

Questions?
 Thanks for listening!
 Contact me at @rvosa
 For more about our project:
 http://www.supersmart-project.org

Weitere ähnliche Inhalte

Was ist angesagt?

Data retriveal ,srg and dbgetSurendraKumar338

SHARE Notification Service, October 2014SHARE

Ethics reproducibility and data stewardshipRussell Jarvis

Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...CEDAR: Center for Expanded Data Annotation and Retrieval

BLASTavrilcoghlan

Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble

Investigating plant systems using data integration and network analysisCatherine Canevet

Paul GrothConnected Data World

Aspects of Reproducibility in Earth ScienceRaul Palma

Connecting life sciences data at the European Bioinformatics InstituteConnected Data World

Zmasek TOPSAN Biohackathon 2011cmzmasek

The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...CEDAR: Center for Expanded Data Annotation and Retrieval

Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Alejandra Gonzalez-Beltran

Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...CEDAR: Center for Expanded Data Annotation and Retrieval

A guided SQL tour of bioinformatics databasesYannick Pouliot

VariantSpark a library for genomics by Lynn LangitData Con LA

An Open Repository Model for Acquiring Knowledge About Scientific ExperimentsCEDAR: Center for Expanded Data Annotation and Retrieval

RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble

Group A - pet adoption centrehfcheng7

Improving links between Human Proteome Atlas (HPA) and EMBL-EBI resourcesRafael C. Jimenez

Was ist angesagt? (20)

Data retriveal ,srg and dbget

SHARE Notification Service, October 2014

Ethics reproducibility and data stewardship

Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...

BLAST

Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks

Investigating plant systems using data integration and network analysis

Paul Groth

Aspects of Reproducibility in Earth Science

Connecting life sciences data at the European Bioinformatics Institute

Zmasek TOPSAN Biohackathon 2011

The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...

Metagenomic Data Provenance and Management using the ISA infrastructure --- o...

Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...

A guided SQL tour of bioinformatics databases

VariantSpark a library for genomics by Lynn Langit

An Open Repository Model for Acquiring Knowledge About Scientific Experiments

RARE and FAIR Science: Reproducibility and Research Objects

Group A - pet adoption centre

Improving links between Human Proteome Atlas (HPA) and EMBL-EBI resources

Ähnlich wie Assembling the Tree of Life from public DNA sequence data

PhylogeneticsSyed Lokman

Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...journal ijrtem

Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...IJRTEMJOURNAL

clustering_classification.pptHODECE21

Softwares For Phylogentic AnalysisPrasanthperceptron

iEvoBio 2010 cdaostoreBrandon Chisham

Ievobio2010cdaostoreBrandon Chisham

Bioinformatics_1_ChenS.pptxxRowlet

Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseHilmar Lapp

Sequencealignmentinbioinformatics 100204112518-phpapp02PILLAI ASWATHY VISWANATH

Protease Phylogeny Chris Southan

Basic BLAST (BLASTn)Syed Lokman

B.3.5Subhash Chandra (Maurya)

Perl for PhyloinformaticsRutger Vos

2020 02 11_biological_databases_part1Prof. Wim Van Criekinge

BIOINFORMATICS_AND_PHYLOGENY.pdf.pdfsirwansleman

2012 03 01_bioinformatics_ii_les1Prof. Wim Van Criekinge

2017 biological databases_part1_vuploadProf. Wim Van Criekinge

Data base searching toolNithyaNandapal

2013 nas-ehs-data-integration-dcc.titus.brown

Ähnlich wie Assembling the Tree of Life from public DNA sequence data (20)

Phylogenetics

Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...

clustering_classification.ppt

Softwares For Phylogentic Analysis

iEvoBio 2010 cdaostore

Ievobio2010cdaostore

Bioinformatics_1_ChenS.pptx

Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database

Sequencealignmentinbioinformatics 100204112518-phpapp02

Protease Phylogeny

Basic BLAST (BLASTn)

B.3.5

Perl for Phyloinformatics

2020 02 11_biological_databases_part1

BIOINFORMATICS_AND_PHYLOGENY.pdf.pdf

2012 03 01_bioinformatics_ii_les1

2017 biological databases_part1_vupload

Data base searching tool

2013 nas-ehs-data-integration-dc

Mehr von Rutger Vos

Anna Karenina on hooves - what makes an animal fit for domestication?Rutger Vos

10 Misverstanden Over EvolutieRutger Vos

Crash Course BiodiversiteitRutger Vos

Natural history research as a replicable data scienceRutger Vos

Species delimitation - species limits and character evolutionRutger Vos

Onderzoek bio-informatica Naturalis. Raad voor Cultuur 2017.Rutger Vos

Robot eye for the butterflyRutger Vos

Taxonomic classification of digitized specimens using machine learningRutger Vos

Self-Updating Platform for the Estimation of Rates of Speciation, Migration A...Rutger Vos

Hoe leer je een robot soorten te herkennen?Rutger Vos

Modeling the biosphere: the natural historian's perspectiveRutger Vos

Kunnen we een tomaat van 400 jaar oud proevenRutger Vos

PhyloTastic: names-based phyloinformatic data integrationRutger Vos

SUPERSMART pipeline introRutger Vos

Reconstructing paleoenvironments using metagenomicsRutger Vos

Synthesising disparate data resources to obtain composite estimates of geophy...Rutger Vos

The Galaxy bioinformatics workflow environmentRutger Vos

Retrieving useful information from connected specimen- and data collectionsRutger Vos

NeXML - phylogenetic data as XMLRutger Vos

Vos at NCB NaturalisRutger Vos

Mehr von Rutger Vos (20)

Anna Karenina on hooves - what makes an animal fit for domestication?

10 Misverstanden Over Evolutie

Crash Course Biodiversiteit

Natural history research as a replicable data science

Species delimitation - species limits and character evolution

Onderzoek bio-informatica Naturalis. Raad voor Cultuur 2017.

Robot eye for the butterfly

Taxonomic classification of digitized specimens using machine learning

Self-Updating Platform for the Estimation of Rates of Speciation, Migration A...

Hoe leer je een robot soorten te herkennen?

Modeling the biosphere: the natural historian's perspective

Kunnen we een tomaat van 400 jaar oud proeven

PhyloTastic: names-based phyloinformatic data integration

SUPERSMART pipeline intro

Reconstructing paleoenvironments using metagenomics

Synthesising disparate data resources to obtain composite estimates of geophy...

The Galaxy bioinformatics workflow environment

Retrieving useful information from connected specimen- and data collections

NeXML - phylogenetic data as XML

Vos at NCB Naturalis

Kürzlich hochgeladen

Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju

KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1

CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456

Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane

Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard

AZOTOBACTER AS BIOFERILIZER.PPTXGovt. N.P.G College of Science Raipur (C.G)

ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...Chayanika Das

GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide

linear Regression, multiple Regression and AnnovaMansi Rastogi

Introduction Classification Of AlkaloidsNandakishor Bhaurao Deshmukh

DETECTION OF MUTATION BY CLB METHOD.pptx201bo007

LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyChayanika Das

The Sensory Organs, Anatomy and FunctionJadeNovelo1

Probability.pptx, Types of Probability, UGSoniaBajaj10

PLASMODIUM. PPTXGovt. N.P.G College of Science Raipur (C.G)

6.2 Pests of Sesame_Identification_Binomics_Dr.UPRPirithiRaju

Let’s Say Someone Did Drop the Bomb. Then What?LUMINATIVE MEDIA/PROJECT COUNSEL MEDIA GROUP

Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsMarkus Roggen

Introduction of Human Body & Structure of cell.pptxMedical College

Ultrastructure and functions of Chloroplast.pptxDigambarrao Bindu ACS College, Bhokar

Kürzlich hochgeladen (20)

Pests of Sunflower_Binomics_Identification_Dr.UPR

KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf

CHROMATOGRAPHY PALLAVI RAWAT.pptx

Loudspeaker- direct radiating type and horn type.pptx

Science (Communication) and Wikipedia - Potentials and Pitfalls

AZOTOBACTER AS BIOFERILIZER.PPTX

ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...

GenAI talk for Young at Wageningen University & Research (WUR) March 2024

linear Regression, multiple Regression and Annova

Introduction Classification Of Alkaloids

DETECTION OF MUTATION BY CLB METHOD.pptx

LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology

The Sensory Organs, Anatomy and Function

Probability.pptx, Types of Probability, UG

PLASMODIUM. PPTX

6.2 Pests of Sesame_Identification_Binomics_Dr.UPR

Let’s Say Someone Did Drop the Bomb. Then What?

Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids

Introduction of Human Body & Structure of cell.pptx

Ultrastructure and functions of Chloroplast.pptx

Assembling the Tree of Life from public DNA sequence data

1. FROM PUBLIC DNA SEQUENCE DATA PITFALLS, CHALLENGES AND SOLUTIONS Rutger Vos, Naturalis Biodiversity Center Twitter: @rvosa

2. Public DNA sequence data. A lot.

4. Want a moonshot? Build a rocket.

5. Assembling a Tree of Life If you want to build The Tree of Life you will need to build a system for building Trees of Life. Such a system has many moving parts: -data mining -marker selection -tree inference -tree calibration -tree grafting

6. Data mining Keyword searches allow one to locate specific genes for specific taxa, assuming that gene names are standardized and applied correctly.

7. Data mining Similarity ("BLAST") searches allow one to locate sequences that are similar to a query sequence

8. Pitfalls in data mining  Bad gene naming conventions  Incomplete knowledge of orthology  Ambiguous taxonomy

9. Possible solutions  TNRS helps resolve ambiguities in names (e.g. synonyms): http://taxosaurus.org  Sequence clustering databases help avoid keyword searching and haphazard BLASTing, e.g.: http://phylota.net

10.

11. Digression: orthology assignment  Based on gene names? Hopeless: outside of very few markers, gene names are a mess  Based on pairwise genome comparisons? Sort-of done for proteins (e.g. InParanoid) but not for non-coding  Based on pairwise reciprocal best BLAST hits? Quick and cheap, but error prone depending on losses and gains, and database completeness

12. Supermatrix packing How to optimize combinations of markers to maximize taxon sampling and minimize sparseness?

13. Tree inference  The current gold standard in species trees from multilocus alignments: *BEAST. Very expensive.  Cheaper, scalable Bayesian tree inference is provided by other tools, e.g. ExaBayes.  Maybe the two can be combined? For example:  A cheap, large, taxonomically broad backbone  Within-clade relationships resolved more expensively

14. Tree calibration  Fossils now available through web service of http://fossilcalibrations.or g  Scalable tree calibration tools now exist, e.g. treePL  *BEAST has more sophisticated methods

15. Tree grafting Pitfalls: • What if your exemplar species aren't on either side of the root? • Grafting could then easily result in negative branch lengths • Also, node density effects will lead to younger nodes in the backbone

16. All in one: SUPERSMART The SUPERSMART platform helps assemble high quality marker sets and analyze them using configurable divide-and- conquer tree inference approaches. The platform is available as a VM, a Docker container, or as an easy to install (using Puppet)software stack.

17. Results: Primates and Palms

18. Conclusions  More and more moving parts for a ToL moonshot are becoming available  Workflows that result in high-quality estimates can be composed from these  SUPERSMART uses a recursive workflow with different inference methods at different levels

19. Acknowledgements  Alexandre Antonelli  Hannes Hettling  Mike Sanderson  Bengt Oxelman  Karin Nilsson  Mats Töpel  Hervé Sauquet  Henrik Nilsson  Daniele Silvestro  Fabien Condamine  Ruud Scharn

20. Questions?  Thanks for listening!  Contact me at @rvosa  For more about our project:  http://www.supersmart-project.org

Hinweis der Redaktion

Hi. My name is Rutger Vos and I am a researcher at Naturalis Biodiversity Center, the natural history museum of the Netherlands. I don't know if there is any live tweeting of this event but in any case my handle is @rvosa.
These kinds of figures are probably very well known to all of you. There is a lot of DNA sequence data in public databases and it’s growing rapidly. Not quite as rapidly growing, but still impressive, are other databases in biology, such as fossil databases or taxonomic names architectures. Given all this data, shouldn't we be able to go big and…
…attempt a moonshot? An actual attempt to approximate the tree of life, somehow?
If you want to do a moonshot, you're going to have to build a rocket. And not just a rocket, but probably also machines to build rockets, and widgets to build those machines, etc. In other words, a lot of infrastructure. Here are some examples of infrastructure projects in phylogenetics.

Assembling the Tree of Life from public DNA sequence data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Assembling the Tree of Life from public DNA sequence data

Ähnlich wie Assembling the Tree of Life from public DNA sequence data (20)

Mehr von Rutger Vos

Mehr von Rutger Vos (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Assembling the Tree of Life from public DNA sequence data

Hinweis der Redaktion