5. Assembling a Tree of Life
If you want to build The
Tree of Life you will
need to build a system
for building Trees of
Life.
Such a system has
many moving parts:
-data mining
-marker selection
-tree inference
-tree calibration
-tree grafting
6. Data mining
Keyword searches allow one to locate specific
genes for specific taxa, assuming that gene
names are standardized and applied correctly.
8. Pitfalls in data mining
Bad gene naming conventions
Incomplete knowledge of
orthology
Ambiguous taxonomy
9. Possible solutions
TNRS helps resolve ambiguities in names
(e.g. synonyms): http://taxosaurus.org
Sequence clustering databases help avoid
keyword searching and haphazard BLASTing,
e.g.:
http://phylota.net
10.
11. Digression: orthology
assignment
Based on gene names?
Hopeless: outside of very few markers, gene
names are a mess
Based on pairwise genome comparisons?
Sort-of done for proteins (e.g. InParanoid) but
not for non-coding
Based on pairwise reciprocal best BLAST
hits?
Quick and cheap, but error prone depending
on losses and gains, and database
completeness
13. Tree inference
The current gold standard in species trees
from multilocus alignments: *BEAST. Very
expensive.
Cheaper, scalable Bayesian tree inference is
provided by other tools, e.g. ExaBayes.
Maybe the two can be combined? For
example:
A cheap, large, taxonomically broad backbone
Within-clade relationships resolved more
expensively
14. Tree calibration
Fossils now available
through web service of
http://fossilcalibrations.or
g
Scalable tree calibration
tools now exist, e.g.
treePL
*BEAST has more
sophisticated methods
15. Tree grafting
Pitfalls:
• What if your exemplar species aren't on either side of the root?
• Grafting could then easily result in negative branch lengths
• Also, node density effects will lead to younger nodes in the backbone
16. All in one: SUPERSMART
The SUPERSMART
platform helps
assemble high quality
marker sets and
analyze them using
configurable divide-and-
conquer tree inference
approaches.
The platform is
available as a VM, a
Docker container, or as
an easy to install (using
Puppet)software stack.
18. Conclusions
More and more moving parts for a ToL
moonshot are becoming available
Workflows that result in high-quality estimates
can be composed from these
SUPERSMART uses a recursive workflow with
different inference methods at different levels
19. Acknowledgements
Alexandre Antonelli
Hannes Hettling
Mike Sanderson
Bengt Oxelman
Karin Nilsson
Mats Töpel
Hervé Sauquet
Henrik Nilsson
Daniele Silvestro
Fabien Condamine
Ruud Scharn
20. Questions?
Thanks for listening!
Contact me at @rvosa
For more about our project:
http://www.supersmart-project.org
Hinweis der Redaktion
Hi. My name is Rutger Vos and I am a researcher at Naturalis Biodiversity Center, the natural history museum of the Netherlands. I don't know if there is any live tweeting of this event but in any case my handle is @rvosa.
These kinds of figures are probably very well known to all of you. There is a lot of DNA sequence data in public databases and it’s growing rapidly. Not quite as rapidly growing, but still impressive, are other databases in biology, such as fossil databases or taxonomic names architectures. Given all this data, shouldn't we be able to go big and…
…attempt a moonshot? An actual attempt to approximate the tree of life, somehow?
If you want to do a moonshot, you're going to have to build a rocket. And not just a rocket, but probably also machines to build rockets, and widgets to build those machines, etc. In other words, a lot of infrastructure. Here are some examples of infrastructure projects in phylogenetics.