Module 4: Estimating evolutionary rates (molecular clocks) from sequence data

By the end of this video, you should be familiar with how we can take sequences, and information about when they were sampled, to infer a molecular clock that describes the evolutionary rate of a pathogen.

Take-home messages

We refer to data associated with a sequence as metadata. The important piece of metadata that we need for estimating evolutionary rates is the sample collection date.
You can calculate a root-to-tip distance for a sample by summing up the branch lengths between the root of the tree and a tip.
The rate estimate comes from the slope of a regression line through a scatter plot of the root-to-tip distances plotted as a function of sampling date.
The evolutionary rate represents the amount of genetic divergence we expect to accrue over a certain amount of time on average.
Molecular clock, the evolutionary rate, or the substitution rate are all synonyms that describe this concept. These are different from the intrinsic mutation rate of a pathogen.
Estimates of molecular clocks are improved by serial sampling, which means that genomic sequence data are collected consistently over time, rather than in large

Questions

Using the tree below, calculate the root-to-tip distance of sample hCoV-19/Cameroon/CPC-21v-33021/2021.
Now, pretend that hCoV-19/Cameroon/CPC-21v-33021/2021 was sampled on February 1, 2021. What would be the x and y coordinates of this sample if you were going to plot it on a root-top-tip plot?
Below is a root-to-tip plot from a Nextstrain build of SARS-CoV-2 viruses. What does the regression line represent? When samples fall above the line, what does that mean? When samples fall below the line, what does that mean?