You can’t even begin to understand biology, you can’t understand life, unless you understand what it’s all there for, how it arose - and that means evolution. — Richard Dawkins
Introduction
ggtree is a bioconductor package which facilitates the implements for viewing, annotating and manipulating phylogenetic trees.
In addition to Newick and Nexus, ggtree supports the file formats using these parser functions:
Some Basic Functions: - ggtree(tree_object): Viewing a phylogenetic tree - get.fields: get the annotation features - get.tree: onvert tree object to phylo (or multiPhylo for r8s) object - groupOTU: clustering related OTUs (from tips to their most recent common ancestor),monophyletic (clade), polyphyletic or paraphyletic - fortify: convert tree object to a data.frame
Annatation geometry objects: + geom_text for displaying the node (if available) and tip labels simultaneously + geom_tiplab for adding tip labels + geom_cladelabel for labelling selected clades + geom_hilight for highlighting selected clades + geom_range to indicate uncertainty of branch lengths + geom_strip for adding strip/bar to label associated taxa (with optional label) + geom_taxalink for connecting related taxa + geom_treescale for adding a legend of tree scale
ggtree provides write.jplace function to combine Newick tree file and user’s own data to a single jplace file that can be parsed and the data can be used to annotate the tree directly in ggtree.
## 'beast' S4 object that stored information of
## 'D:/Program Files/R-3.4.4/library/treeio/extdata/BEAST/beast_mcc.tree'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 14 internal nodes.
##
## Tip labels:
## A_1995, B_1996, C_1995, D_1987, E_1996, F_1997, ...
##
## Rooted; includes branch lengths.
##
## with the following features available:
## 'height', 'height_0.95_HPD', 'height_median', 'height_range', 'length',
## 'length_0.95_HPD', 'length_median', 'length_range', 'posterior', 'rate',
## 'rate_0.95_HPD', 'rate_median', 'rate_range'.
# Tips: length_95%_HPD will be changed to length_0.95_HPD
So, we must intensely curious about what the beast looks like in the simplest ggtree function, the grammar of graphics on the tree, just simply tap the ggtree(beast), and ggtree, is a short cut to visualize a tree which equal to ggplot(beast, aes(x, y)) + geom_tree() + theme_tree(). By default, the tree is viewed in ladderize form, set the parameter ladderize = FALSE to disable it.
The branch.length is used to scale the edge, defulat is phylogram, set the parameter branch.length = "none" to only view the tree topology (cladogram)
geom_treescale() supports the following parameters: + x and y for tree scale position + width for the length of the tree scale + fontsize for the size of the text + linesize for the size of the line + offset for relative position of the line and the text + color for color of the tree scale
ggtree supports several layouts, including - rectangular(by default) - slanted - circular - fan - unrooted for Phylogram and Cladogram, unrooted , layout, time-scaled and two dimentional phylogenies.
Moreover,the heigh_0.95_HPD is meaningful since the branch is scaled by height(default, check using get.fields, it is the order of [‘height’,‘length’,‘rate’]). We can redefine the branch by specifing branch.length in ggtree function.
Finally lets convert ggtree object to our favourite data.frame in R.
beast_data <-fortify(beast)
# only show 10 colums coz the table was too enormous
knitr::kable(head(beast_data[,1:10], 10))
node
parent
branch.length
x
y
label
isTip
branch
angle
height
1
26
1.3203366
19.926089
13
A_1995
TRUE
19.265920
312
18
2
25
3.2619545
20.926089
12
B_1996
TRUE
19.295111
288
17
3
28
3.3119924
19.926089
7
C_1995
TRUE
18.270092
168
18
4
17
8.8216633
11.926089
1
D_1987
TRUE
7.515257
24
26
5
29
3.2710481
20.926089
9
E_1996
TRUE
19.290565
216
17
6
27
0.8311578
21.926089
14
F_1997
TRUE
21.510510
336
16
7
21
5.1939723
16.926089
6
G_1992
TRUE
14.329102
144
21
8
23
2.1125745
16.926089
10
H_1992
TRUE
15.869801
240
21
9
24
1.7572880
18.926089
11
I_1994
TRUE
18.047445
264
19
10
20
1.5614320
7.926089
5
J_1983
TRUE
7.145373
120
30
Tree import
nwk & nhx file
nwk <-system.file("extdata", "sample.nwk", package="treeio")
tree <-read.tree(nwk)
p <-ggtree(tree) +geom_tippoint(color="steelblue", alpha=1/4, size=10)
p
# %<% for update tree# rtree for Generating Random Trees
p %<%rtree(30)
## 'phylip' S4 object that stored information of
## 'D:/Program Files/R-3.4.4/library/treeio/extdata/sample.phy'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
## K, N, D, L, J, G, ...
##
## Unrooted; no branch lengths.
##
## with sequence alignment available (15 sequences of length 2148)
User can view sequence alignment with the tree via msaplot() function.
## 'jplace' S4 object that stored information of
## 'D:/Program Files/R-3.4.4/library/treeio/extdata/sample.jplace'.
##
## ...@ tree:
## Phylogenetic tree with 13 tips and 12 internal nodes.
##
## Tip labels:
## A, B, C, D, E, F, ...
##
## Rooted; includes branch lengths.
##
## with the following features available:
## 'edge_num', 'likelihood', 'like_weight_ratio', 'distal_length',
## 'pendant_length'.
# get.placements method to access the placement
## get only best hit
get.placements(jp, by="best")
Softwares using rst: - BASEML : each ten bases are separated by one space - CODEML : each three bases (triplet) are separated by one space
## from BASEML
brstfile <-system.file("extdata/PAML_Baseml", "rst", package="treeio")
brst <-read.paml_rst(brstfile)
brst
## 'paml_rst' S4 object that stored information of
## 'D:/Program Files/R-3.4.4/library/treeio/extdata/PAML_Baseml/rst'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
## A, B, C, D, E, F, ...
## Node labels:
## 16, 17, 18, 19, 20, 21, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
## 'marginal_subs', 'joint_subs', 'marginal_AA_subs', 'joint_AA_subs'.
## from CODEML
crstfile <-system.file("extdata/PAML_Codeml", "rst", package="treeio")
crst <-read.paml_rst(crstfile)
crst
## 'paml_rst' S4 object that stored information of
## 'D:/Program Files/R-3.4.4/library/treeio/extdata/PAML_Codeml/rst'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
## A, B, C, D, E, F, ...
## Node labels:
## 16, 17, 18, 19, 20, 21, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
## 'marginal_subs', 'joint_subs', 'marginal_AA_subs', 'joint_AA_subs'.
# use theme_tree2() to display the tree scale by adding x axis
p <-ggtree(brst) +geom_text(aes(x=branch, label=marginal_AA_subs), vjust=-.5, color='brown') +theme_tree2()
p
The mlc file output by CODEML contains dN/dS estimation.
Note : / and * are not valid characters in names, they were changed to _vs_ and _x_ respectively.
So dN_vs_dS is dN/dS, N_x_dN is N*dN, and S_x_dS is S*dS.
## 'codeml_mlc' S4 object that stored information of
## 'D:/Program Files/R-3.4.4/library/treeio/extdata/PAML_Codeml/mlc'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
## A, B, C, D, E, F, ...
## Node labels:
## 16, 17, 18, 19, 20, 21, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
## 't', 'N', 'S', 'dN_vs_dS', 'dN', 'dS', 'N_x_dN', 'S_x_dS'.
Then, Combine them using read.codeml which is highly recommended, we can annotate dN/dS with the tree in rstfile and amino acid substitutions with the tree in mlcfile.
ml <-read.codeml(crstfile, mlcfile)
ml
## 'codeml' S4 object that stored information of
## 'D:/Program Files/R-3.4.4/library/treeio/extdata/PAML_Codeml/rst' and
## 'D:/Program Files/R-3.4.4/library/treeio/extdata/PAML_Codeml/mlc'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
## A, B, C, D, E, F, ...
## Node labels:
## 16, 17, 18, 19, 20, 21, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
## 'marginal_subs', 'joint_subs', 'marginal_AA_subs', 'joint_AA_subs', 't',
## 'N', 'S', 'dN_vs_dS', 'dN', 'dS', 'N_x_dN', 'S_x_dS'.
HYPHY tree
nwk <-system.file("extdata/HYPHY", "labelledtree.tree", package="treeio")
ancseq <-system.file("extdata/HYPHY", "ancseq.nex", package="treeio")
tipfas <-system.file("extdata", "pa.fas", package="treeio")
hy <-read.hyphy(nwk, ancseq, tipfas)
hy
## 'hyphy' S4 object that stored information of
## 'D:/Program Files/R-3.4.4/library/treeio/extdata/HYPHY/labelledtree.tree',
## 'D:/Program Files/R-3.4.4/library/treeio/extdata/HYPHY/ancseq.nex' and
## 'D:/Program Files/R-3.4.4/library/treeio/extdata/pa.fas'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
## K, N, D, L, J, G, ...
## Node labels:
## Node1, Node2, Node3, Node4, Node5, Node12, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
## 'subs', 'AA_subs'.
### r8s file r8s output contains 3 trees, namely TREE, RATO and PHYLO for time tree, rate tree and absolute substitution tree respectively. Of Course, This is a time-scaled tree by specifying the parameter, mrsd (most recent sampling date).
The data contains amino acid substitutions from parent node to child node and GC contents of each node.
tree <-system.file("extdata", "pa.nwk", package="treeio")
data <-read.csv(system.file("extdata", "pa_subs.csv", package="treeio"), stringsAsFactor=FALSE)
print(tree)
## 'jplace' S4 object that stored information of
## 'C:\Users\Lyole\AppData\Local\Temp\RtmpSK5Q4y\filef383a435e41'.
##
## ...@ tree:
## Phylogenetic tree with 15 tips and 13 internal nodes.
##
## Tip labels:
## K, N, D, L, J, G, ...
##
## Unrooted; includes branch lengths.
##
## with the following features available:
## 'label', 'subs', 'gc'.
Tree Manipulation
ggtree implement several functions to manipulate a phylogenetic tree.
taxa can be clustered together using groupClade or groupOTU functions
clades can be collapsed via collapse function
collapsed clade can be expanded by using expand function
clade can be re-scale to zoom in or zoom out by scaleClade function
selected clade can be rotated by 180 degree using rotate function
position of two selected clades (should share a same parent) can be exchanged by flip function
Node Manipulation
Get the internal node number using geom_text2 or get the internal node number by using MRCA()(most recent commond ancestor ) function, using viewClade to visualize a clade of a phylogenetic tree
And groupClade and groupOTU methods were designed for clustering clades or related OTUs. groupClade accepts an internal node or a vector of internal nodes to cluster clade/clades.
nwk <-system.file("extdata", "sample.nwk", package="treeio")
tree <-read.tree(nwk)
p <-ggtree(tree) +geom_text2(aes(subset=!isTip, label=node), hjust=-.3) +geom_tiplab()
multiplot(p, ggtree(tree) %>%groupClade(node=21) +aes(color=group, linetype=group), viewClade(p, node=21),ncol=3)
groupOTU accepts a vector of OTUs (taxa name) or a list of OTUs, related OTUs are not necessarily within a clade, they can be monophyletic (clade), polyphyletic or paraphyletic.
tree <-groupOTU(tree, c("D","E","F","G"))
ggtree(tree, aes(color=group)) +geom_tiplab()
tree <-groupClade(tree, c(21, 17))
p <-ggtree(tree, aes(color=group)) +scale_color_manual(values=c("black", "firebrick", "steelblue"))
p2 <-rotate(p, 21) %>%rotate(17)
multiplot(p, p2, ncol=2)
flip clade
multiplot(p, flip(p, 17, 21), ncol=2)
Two dimensional tree
ggtree implemented two dimensional tree. It accepts parameter yscale to scale the y-axis based on the selected tree attribute. The attribute should be numerical variable. If it is character/category variable, user should provides a name vector of mapping the variable to numeric by passing it to parameter yscale_mapping.
## One hundred bootstrap trees
btrees <-read.tree(system.file("extdata/RAxML", "RAxML_bootstrap.H3", package="treeio"))
# ggtree(btrees) + facet_wrap(~.id, ncol=10)
p <-ggtree(btrees, layout="rectangular", color="lightblue", alpha=.3)
best_tree <-read.tree(system.file("extdata/RAxML", "RAxML_bipartitionsBranchLabels.H3", package="treeio"))
df <-fortify(best_tree, branch.length ='none')
p+geom_tree(data=df, color='firebrick')
Tree Annotation
Most of the phylogenetic trees are scaled by evolutionary distance (substitution/site), in ggtree a phylogenetic tree can be re-scaled by any numerical variable inferred by evolutionary analysis (e.g. species divergence time, dN/dS, etc). Numerical and category variable can be used to color a phylogenetic tree.
# geom_strip : add a strip/bar to indicate the association with optional label ggtree(tree) +geom_tiplab() +geom_strip(61, 80, barsize=2, color='red') +geom_strip(53, 56, barsize=2, color='blue')
# geom_taxalink layer that allows drawing straight or curved lines between any of two nodes in the tree, allow it to represent evolutionary events by connecting taxaggtree(tree) +geom_tiplab() +geom_taxalink('t3', 't31') +geom_taxalink('t14', 't10', color='red', arrow=grid::arrow(length = grid::unit(0.02, "npc")))
ggtree provides a function, inset, for adding subplots to a phylogenetic tree. The input is a tree graphic object and a named list of ggplot graphic objects (can be any kind of charts), these objects should named by node numbers. To facilitate adding bar and pie charts (e.g. summarized stats of results from ancestral reconstruction) to phylogenetic tree, ggtree provides nodepie and nodebar functions to create a list of pie or bar charts.
set.seed(2015-12-31)
tr <-rtree(15)
p <-ggtree(tr)
a <-runif(14, 0, 0.33)
b <-runif(14, 0, 0.33)
c <-runif(14, 0, 0.33)
d <-1-a -b -c
dat <-data.frame(a=a, b=b, c=c, d=d)
## input data should have a column of `node` that store the node number
dat$node <-15+1:14
## cols parameter indicate which columns store stats (a, b, c and d in this example)
bars <-nodebar(dat, cols=1:4)
inset(p, bars)
The sizes of the insets can be ajusted by the paramter width and height.
inset(p, bars, width=.06, height=.1)
Users can set the color via the parameter color. The x position can be one of ‘node’ or ‘branch’ and can be adjusted by the parameter hjust and vjust for horizontal and vertical adjustment respecitvely.
Similarly, users can use nodepie function to generate a list of pie charts and place these charts to annotate corresponding nodes. Both nodebar and nodepie accepts parameter alpha to allow transparency.
# further annotate the tree with another layer of insets# p2 <- inset(p, bars2, x='branch', width=.06, vjust=-.4)# p2 <- inset(p2, pies, x='branch', vjust=.4)# bx2 <- lapply(bx, function(g) g+coord_flip())# inset(p2, bx2, width=.4, height=.06, vjust=.04, hjust=p2$data$x[1:15]-4) + xlim(NA, 4.5)
Customized Annotation
Phylo4d Example
nwk <-system.file("extdata", "sample.nwk", package="treeio")
tree <-read.tree(nwk)
p <-ggtree(tree)
dd <-data.frame(taxa = LETTERS[1:13],
place =c(rep("GZ", 5), rep("HK", 3), rep("CZ", 4), NA),
value =round(abs(rnorm(13, mean=70, sd=10)), digits=1))
## you don't need to order the data
## data was reshuffled just for demonstration
dd <-dd[sample(1:13, 13), ]
row.names(dd) <-NULL
taxa
place
value
D
GZ
63.7
B
GZ
78.5
J
CZ
85.5
A
GZ
59.8
E
GZ
58.6
I
CZ
50.9
H
HK
75.7
M
NA
61.0
L
CZ
70.8
F
HK
77.6
G
HK
78.4
K
CZ
72.7
C
GZ
58.0
We can imaging that the place column stores the location that we isolated the species and value column stores numerical values (e.g. bootstrap values).
We have demonstrated using the operator, %<%, to update a tree view with a new tree. Here, we will introduce another operator, %<+%, that attaches annotation data to a tree view. The only requirement of the input data is that its first column should be matched with the node/tip labels of the tree.
After attaching the annotation data to the tree by %<+%, all the columns in the data are visible to ggtree. As an example, here we attach the above annotation data to the tree view, p, and add a layer that showing the tip labels and colored them by the isolation site stored in place column.
p <-p %<+%dd +geom_tiplab(aes(color=place)) +geom_tippoint(aes(size=value, shape=place, color=place), alpha=0.25)
p+theme(legend.position="right")
Once the data was attached, it is always attached. So that we can add other layers to display these information easily.
phylo4d was defined in the phylobase package, which can be employed to integrate user’s data with phylogenetic tree. phylo4d was supported in ggtree and the data stored in the object can be used directly to annotate the tree.
The width parameter is to control the width of the heatmap. It supports another parameter offset for controlling the distance between the tree and the heatmap, for instance to allocate space for tip labels.
For time-scaled tree, as in this example, it’s more often to use x axis by using theme_tree2. But with this solution, the heatmap is just another layer and will change the x axis. To overcome this issue, we implemented scale_x_ggtree to set the x axis more reasonable.