Module 2: From sequences to trees

By the end of this video, you should understand how we build genetic divergence phylogenetic trees from consensus genomes.

When first generated, consensus genomes from different samples will be of different lengths and will have different numbers of sites where the precise nucleotide is known.
To make sure that we are comparing the same sites in the genome across all of the sequences in our data set, we must perform a multiple sequence alignment.
When your sequences are all aligned, you can start to see patterns of shared and unique genetic changes. This pattern of shared genetic changes and unique mutations within a sequence is the information that we will encode in the phylogenetic tree.
We build phylogenetic trees by hierarchically clustering sequences according to which mutations they share. When a mutation occurs on a branch, all the samples that descend from that branch will carry that mutation.
Mutations that occur earlier in the tree are inherited by more samples.
Mutations that occur on a terminal branch, or a branch leading to only one sample, are unique to that genome sequence.

Given the tree in the figure below, which mutation is shared by genomes A, B, and C? Which mutation is unique to genome C?

As we discussed in the video, trees assume that evolution occurs clonally, that is, in a linear fashion in which mutations accrue across a genomic “background”. Given this principle, as well as the patterns of shared mutations that we can see in the tree above, which mutation occurred first: A101G or T8897C?

From Sequences to Trees