Chapter 4 Public Health Use Cases of Genomic Epidemiology
Allison Black
Washington State Department of Health, Seattle, Washington
Here we describe some of the key ways one can use genomic epidemiological analysis in public health investigations. For each of these areas, we provide concrete examples of the types of questions that fall within these topical areas, the fundamental theoretical principles that you will draw upon to investigate those questions, and different sampling schemes and analysis methods for investigating those questions. This chapter is pertinent to most readers, as it describes the public health utility of different genomic epidemiologic analyses. Additionally, readers engaged in building genomic epidemiological capacity or in performing genomic epidemiology themselves will benefit from the descriptions of how to design, analyse, and interpret various genomic epidemiological investigations.
4.1 Assessing Epidemiologic Linkage Between Cases
In public health, we often question whether there is a connection between cases presenting with the same infection. Often we’ll draw on an assessment of shared exposures, and geographic and temporal data, to determine whether infections are linked. Pathogen genomic data provide another data stream by which we can assess relationships between cases, as we’ll discuss here. Some examples of questions that you may ask when using pathogen genomic data to assess epidemiologic linkage include:
Both of these cases have similar reported exposures. Is that common exposure the likely source of both infections?
I’m observing multiple detections of disease in the same setting. Is this setting facilitating transmission, or are we simply detecting community-acquired infections in this setting due to enhanced screening or surveillance?
4.1.1 Fundamental Principles to Draw Upon
As a pathogen, let’s say a virus, circulates in a population, it infects different people. During those infections, the virus replicates, making errors as it copies its genome for packaging into progeny virions. While different pathogens make errors at different rates, and we may see substitutions in consensus genomes at different rates (see Chapter 2, Section 2.2), this principle means that pathogen genome sequences will accumulate changes over the course of an outbreak. This process also means that cases separated by a minimal amount of transmission will generally have more genetically similar infections, while cases that are separated by large amounts of transmission will typically have more genetically divergent infections.
4.1.2 How Should You Sample?
When you are looking at the pathogen genome sequences of two infections you are not extrapolating from those data to a broader understanding of the outbreak as a whole. Therefore, you can assess possible linkages between cases using sequences collected through either a targeted sampling scheme or a representative sampling scheme. That said, it remains important to include contextual sequence data in your analysis, as described in Chapter 3 and as shown in (Figure 4.1).
4.1.3 Tools and Approaches You Can Use to Explore the Question
To investigate questions of linkage you will need a way to compare and summarise the genetic similarity between cases of interest. There are multiple methods for summarizing genetic distances between samples, but we generally recommend using tree structures such as phylogenetic placements or phylogenetic trees (see Chapter 6 for more details) when assessing sequence similarity between samples.
While it is technically possible to build a tree structure with just your samples of interest, we caution against doing this, as the addition of other contextual sequences usually makes the relationships between your samples of interest more clear. Furthermore, in the case that your samples of interest are not genetically similar to each other, the presence of contextual sequences allows you to see other samples that they are closely-related to (Figure 4.1).
Both phylogenetic placements and phylogenetic trees will provide a tree structure for assessing similarity between samples of interest. However, during public health investigations, our questions about the genetic similarity of samples of interest are usually limited in scope and we need answers quickly. When rapid situational awareness is more important than a richer analysis, we recommend performing a phylogenetic placement. If after performing a phylogenetic placement you have further descriptive epidemiological questions about the samples, then we recommend following-up a phylogenetic placement with inferring a phylogenetic tree.
4.1.4 Caveats, Limitations, and Ways Things Go Wrong
As discussed in Chapter 2, genomic data are powerful for ruling linkage out, but less powerful for unequivocally ruling linkage in. Furthermore, except in rare cases, you cannot infer the directionality of transmission from sequence data alone when you have highly genetically similar consensus genomes. This principle is probably clearest when cases have identical genome sequences, since if all of the sequences are the same there’s no genomic signal for directionality. However, we caution that directionality is still challenging to infer accurately even when some genomes are slightly diverged. As an example of this issue, in Figure 4.2 we illustrate two different transmission patterns that yield the same phylogenetic tree topology. In the first scenario, we have a person-to-person transmission pattern, where directionality is from Persons 1 to 2 to 3 to 4. In the second scenario, we instead have a superspreading event, in which Person 1 directly infects Persons 2, 3, and 4. Despite having different transmission patterns, the genetic divergence trees are identical. This example shows how challenging it can be to disentangle genetic diversity accrued during person-to-person transmission from within-host diversity that is transmitted when an index case infects multiple secondary cases.
Additionally, in much the same way that contextual data may change your understanding of linkage between cases, lacking sequences from infections may also change your interpretation of linkage. When assessing linkage between cases, we recommend keeping your sampling frame in mind. If you’re assessing linkage between two samples collected through representative surveillance, then consider what proportion of cases you have sequence data for. If you sequence only a small proportion of cases, then there may be intervening cases along a transmission chain that go unsequenced, and therefore do not appear in a phylogenetic tree. In that case, genetically similar samples could be part of the same transmission chain, but not necessarily directly epidemiologically linked.
Finally, some molecular epidemiological methods suggest using Single Nucleotide Polymorphism (SNP) distances and thresholds for ruling cases into the same outbreak or categorizing them as belonging to different outbreaks. While we understand how thresholds can make decision making easier, we generally discourage using SNP thresholds. This is because using SNP thresholds reduces the granularity of the data; they make a binary assessment (in or out of outbreak) when in fact the data are continuous (0, 1, 2, 3 … SNPs separating two sequences). While reducing the data down to a single binary outcome does make assessment easier, it does so by hiding nuance and uncertainty, which is critical for making epidemiologic inferences and assessing your confidence in them. Furthermore, making genetic distance into a binary call results in a loss of information. Pathogens accrue mutations at specific evolutionary rates, thus the actual count of differences between two sequences provides a quantitative estimate of how related or unrelated two sequences are.
4.2 Exploring Relationships Between Cases of Interest and Other Sequenced Infections
In the previous use case, we showed how to think about using pathogen genomic data when investigating a hypothesis of linkage. This use case is similar, but more exploratory in nature, asking instead whether there are previously unrecognised relationships between other sequenced samples and a case (or cases) of interest. If the previous use case was hypothesis-evaluating, you can think of this use case as hypothesis-generating.
Examples of the kinds of questions you might ask that fall into this use case are:
Is my outbreak related to other outbreaks occurring in other geographic locations?
Are there any additional cases in the community that might actually be part of this outbreak?
4.2.1 Fundamental Principles to Draw Upon
Exploring relationships between cases of interest and other sequenced infections relies on the same principles as discussed when assessing linkage between cases. The reason that we have separated these concepts within this handbook is not because the principles underlying these concepts are different, but rather because the sequences you are considering in your analysis will likely come from different populations. In the previous use case, public health practitioners are seeking to use genomic epidemiology to understand epidemiologic relationships between individuals within their own community, or wherever they have jurisdiction. In contrast, in this use case we describe why and how to explore relationships between your samples of interest and sequenced infections from other communities and regions.
4.2.2 How Should You Sample?
Within this broad use case, a question of interest centers around exploring possible links between your own cases (or outbreaks) and cases or outbreaks occurring in other communities. Contextualizing your own transmission in this way brings up a key concept and challenge; if you are looking for linkage with transmission in other contexts, then you want access to sequences that accurately capture the full scope of pathogen diversity circulating in those other contexts. This means that while you can select your own cases of interest in a way that aligns with the question you’re investigating, you typically need contextual data from other jurisdictions that have been sampled representatively. When sampled representatively, contextual data provide a more accurate summary of the circulating pathogen diversity in those other areas. When available sequence data from other regions accurately summarises those outbreaks, you can have greater confidence that you are accurately capturing the true presence or absence of cross-jurisdictional links.
This concept brings forth a few important considerations. Firstly, the need for representatively sampled contextual data is one of the reasons why we need broad, baseline genomic surveillance programs. In such programs, sequencing is performed at random, without regard to specific clinical presentations, performance on diagnostic assays, or public health questions. Representative sequencing is done partially for the common good; everyone will need contextual data from other communities or populations at some point. This is the main reason why we encourage groups building genomic surveillance systems to perform some degree of representative sequencing beyond their targeted outbreak sequencing. Furthermore, this rationale is also why annotating those sequences as representatively sampled surveillance specimens is a critical part of genomic surveillance data management.
4.2.3 Tools and Approaches You Can Use to Explore the Question
To answer these types of questions, we recommend phylogenetic approaches that summarise the genetic relationships between samples on a tree structure. Currently, there are two primary ways to infer a tree structure that includes your cases of interest: phylogenetic placements and phylogenetic trees. While the output of these two types of analyses may look similar, they are performing different processes. Each of these tree structures, as well as tools for performing phylogenetic placements and inferring phylogenetic trees, can be found in Chapter 6.
As an additional layer to building a phylogenetic tree, genomic epidemiologists will frequently investigate cross-community outbreak linkage using phylogeography. Phylogeographic analyses take in geographic information about where sequences were sampled and probabilistically reconstruct where ancestral pathogens in the tree likely circulated. More information about phylogeography and implementing phylogeographic analyses is given in Chapter 6.
4.2.4 Caveats, Limitations, and Ways Things Go Wrong
While designing and maintaining representative sampling within your own jurisdiction can be challenging, influencing how sampling occurs externally is close to impossible. Variability in access to resources often leads to variability in which communities have pathogen genomic representation, which can directly impact your own inferences. While resourcing broad genomic surveillance programs at higher levels of jurisdictional authority (e.g., nationally) and clear annotation of representatively-collected data can mitigate this issue, the lack of control that you’ll generally have surrounding which contextual data actually exist publicly means that you should exercise some caution in interpreting your findings. This does not mean all is lost! For many investigations, genomic epidemiologists have simply analysed whichever sequences existed or could be generated. And despite such non-ideal sampling, those analyses have still contributed greatly to our understanding of a pathogen’s epidemiology. We simply recommend recognising that the data you have access to may be incomplete or biased, and that your interpretations may change upon the addition of more data.
As the amount of publicly-available sequence data increases, you may find yourself shifting from including all contextual sequences that you can find to having to choose which contextual data to include. This will be true in particular for phylogenetic tree-based analyses, where there is a practical limit on the number of sequences you can include. For guidance on how to choose which contextual data to include, please refer to Section 6.4 within Chapter 6.
4.3 Estimating the Start and Duration of an Outbreak
Once we have detected that an outbreak is occurring, we often want to know more about it. How did this outbreak begin, how long has it been going on for, and when will we be able to declare it over? Pathogen genomic data can help us to understand outbreaks, especially the timing of when they began, which can be challenging to determine from case surveillance data when rates of asymptomatic infection are high or when most cases are not readily captured by passive surveillance. The types of questions your might find yourself exploring within this use case include:
Our syndromic surveillance system isn’t as sensitive as we would like, and we aren’t sure how long we’ve had circulation of this pathogen in our community. When was this pathogen introduced into our community, and how long has it been circulating for?
We received reports of some initial cases of a disease, but we weren’t able to confirm an outbreak until later. Did this outbreak begin around the time of those initial case reports?
We detected an outbreak and we believe that we brought it under control, however we continue to have sporadic cases in the facility. Is the outbreak still ongoing?
4.3.1 Fundamental Principles to Draw Upon
Genomic data are useful for estimating the timing of epidemiologic events in a few different ways. Firstly, estimating the timing of disease circulation from genomic data can be more accurate when the disease in question causes large numbers of asymptomatic infections or mild cases that do not seek treatment. In these scenarios, surveillance systems may only start to record cases once a sufficient number of infections have occurred to produce a subset of more severe cases that seek care or diagnostic testing. In contrast, even when an infection is asymptomatic or mild, viral replication will occur, leading to mutations that may be carried forward during transmission. In this way, a record of the infection can be left in the pathogen genome even when the infection does not rise to a sufficiently symptomatic level to be a recorded case. Furthermore, since genomic data can enable you to resolve distinct but concurrent outbreaks, temporally resolved phylogenetic trees can provide cluster-specific estimates of timing. This feature is particularly useful once you have ongoing transmission, and the emergence of distinct clusters (such as the emergence of a new variant) may not be apparent within an epidemiologic curve.
In Chapter 2, we introduced molecular clocks, which represent the average rate at which a particular pathogen evolves. When we know on average the rate at which genetic diversity accumulates, we can take an observed amount of genetic diversity and ask: “How long would it have taken to accumulate this much diversity?” When we are looking at the genetic divergence between sequences sampled from the same outbreak or epidemic, calculating how much time was needed to generate that diversity provides an estimate of when that outbreak or epidemic likely started.
We generally investigate this type of question using a temporally resolved phylogenetic tree, and we frame the question as “What is the time to the most recent common ancestor of these sequences?” Or, “When did the most recent common ancestor of these sequences likely circulate?” When sequences are genetically similar, little evolutionary time has passed, and the ancestor from which those sequences are descended will be more recent. When sequences are very diverged, a large amount of evolutionary time has passed, and the ancestor of the sequences of interest will have existed further back in time.
Every internal node within a temporally resolved phylogenetic tree represents an ancestor and will have an attached date. So, which internal node should you look at? Typically, the ancestral sequence that you are looking for is the youngest, or least basal, internal node from which all sequences of interest descend. For example, if you are interested in estimating when an outbreak in a skilled nursing facility began, then the internal node that you would look for is the youngest node in the tree from which all skilled nursing facility-associated samples descend. Or if you are interested in when Zika virus likely arrived in the Americas, then you would look for the ancestral node from which all Zika virus genomes collected across all countries in the Americas descend.
4.3.2 How Should You Sample?
In order to estimate the molecular clock accurately, you will need genome sequences collected over longer time spans. The reason for this is fairly simple; it is very hard to accurately estimate the slope of a line when the only data points you have come from a single cross-sectional sample. In contrast, data points collected over time allow you to see much more clearly how genetic diversity and time correlate, thereby providing more consistent and accurate estimates of the evolutionary rate. Ideally, samples used in estimating the molecular clock should be sampled representatively. These serially sampled sequences can either be sequences you generated, or they can be publicly available contextual sequences.
Once you have your estimate of the evolutionary rate, the samples of interest (whose ancestor you would like to date) can be sampled either representatively or in a targeted fashion, in accordance to your question of interest. If you are interested in when a localised outbreak began, then you may want to intensively sample and sequence infections that occur within that particular facility or setting. However, when the event that you would like to date has yielded many sequenced samples, it is best to sample the sequences representatively. For example, if I wanted to estimate when SARS-CoV-2 was first introduced to California, I should select a representative sample of sequences from the entire state of California, and estimate when their ancestor likely circulated. While in theory I could perform targeted-sampling of the entire state of California and sequence every positive case, in reality that would be completely infeasible. Thus, to accurately estimate how much time was necessary for Californian viral diversity to accrue, I must have a representative sample of Californian SARS-CoV-2 diversity. If my sample is not representative, then the date that I infer will represent when the ancestor of the sample that I have circulated.
4.3.3 Tools and Approaches You Can Use to Explore the Question
Using the molecular clock to translate the branch lengths of trees from genetic divergence (evolutionary time) to calendar time is a fairly common method within genomic epidemiology. As such, there are various phylogenetic inference tools that you can use for building time trees. Given the rapid turnaround times that we typically desire in public health, Nextstrain analyses are currently the most common way of inferring time trees within public health contexts (see Chapter 6). Currently, phylogenetic placements do not allow you to make time trees.
4.3.4 Caveats, Limitations, and Ways Things Go Wrong
As discussed above, it is important that you always remember that you are inferring the timing of ancestors of your sampled sequences. Any pathogen genetic diversity that circulated in a population, but is not captured in your dataset, will not be included in your inferential procedure. This is why representative sampling of large populations is important for accurately estimating when transmission started.
As a concrete example, imagine that you want to estimate when SARS-CoV-2 first emerged into the world. But also imagine we’re back in October 2021, and the Delta lineage of SARS-CoV-2 has completely taken over. All other variants that previously circulated have largely gone extinct. All contemporaneously sampled viruses are Delta lineage.
When building your dataset for the analysis of when SARS-CoV-2 first emerged, you download a representative sample of all publicly-available SARS-CoV-2 genome sequences from the past two months. You see that the estimated age of the root of that tree seems to be in spring of 2021, but you know that SARS-CoV-2 was circulating for more than a year before that. What went wrong?
All of the sequences in your theoretical analysis are Delta-lineage descendents. In your large analysis, you have estimated when the ancestor of all of that Delta-lineage diversity likely circulated, but that ancestor isn’t the same as the ancestor of all SARS-CoV-2 diversity that ever existed. In order to estimate when SARS-CoV-2 originally emerged, you would need to ensure that you include sequences that represent the viral diversity that circulated before Delta took over.
An additional point to keep in mind is that molecular clocks vary depending on multiple factors (discussed in Chapter 2). Different datasets might give you slightly different estimates of the rate of molecular evolution, and subsequently slightly different estimates of when ancestral viruses circulated. This variability is an inherent part of these analyses, and so we recommend always providing confidence intervals around your date estimates. When you see this variability, do not panic! If you estimate a faster evolutionary rate in a particular dataset it is unlikely that the pathogen is now suddenly evolving more quickly. The more likely explanation is that you have dense genomic sampling over a short time frame, meaning that purifying selection has not purged deleterious mutations from the pathogen population. Since there hasn’t been enough time for purifying selection to act, you’ll see more diversity and estimate a faster evolutionary rate. This scenario occurred during the epidemic of Ebola Virus Disease in West Africa in 2013–201612.
4.4 Assessing How Demographic, Exposure, and Other Epidemiological Data Relate to a Genomically Defined Outbreak
Much of the richness of genomic epidemiological analysis comes not simply from the analysis of genomic data alone, but rather from the integration, and joint inference, of genomic and epidemiologic data together. This use case therefore describes the situation in which you seek to bring genomic and epidemiologic data together to understand how exposures or demographic information relate to genomically defined relationships between samples. Examples of the kinds of questions that you might ask within this use case are:
Are different lineages of this pathogen circulating in younger people versus older people?
Are different lineages of this pathogen circulating in vaccinated individuals versus unvaccinated individuals?
Does outbreak size or transmission intensity appear to correlate with circulation in particular populations of individuals?
4.4.1 Fundamental Principles to Draw Upon
Rather than relying on any specific principle within genomic epidemiology, this use case presents how one can bring inferences from genomic epidemiology together with surveillance or other epidemiological data. In doing these analyses, the user is looking at genomic relationships between samples, and overlaying those relationships with additional information about where cases lived or worked, their demographic information, possible exposure settings etc. This allows the epidemiologist to qualitatively assess potential relationships between an exposure or demographic variable and clustering patterns in the tree.
4.4.2 How Should You Sample?
You can bring together genomic and surveillance data across any kind of sample set, although you may find different utility in adding surveillance data to outbreaks sampled in a targeted way as compared to representatively-sampled surveillance samples. For example, if you are seeing multiple outbreaks across various skilled nursing facilities (SNFs) in your jurisdiction, you may wish to conduct dense, targeted sampling of cases among employees and residents of those various facilities. Then, you may wish to see how the facility that cases are associated with interacts with the genomic relationships you observe. Do individuals from a single SNF tend to cluster together in clades that are genetically diverged from cases in the other SNFs? Or do all the cases cluster together within a single clade, and adding data about which SNF a case came from shows that cases from all of the SNFs are highly intermingled? In this targeted-sampling example, you are interested in how an exposure of interest (SNF) is associated with close genetic relationships between infections.
In contrast, when joining epidemiologic data and representatively-sampled genomic data, you may be less interested in exposure-outcome associations, and more interested in how well you are capturing a representative cross-section of your population. Adding demographic data to a tree can help you qualitatively see trends such as whether you’re capturing pathogen genetic diversity sampled only from urban centers, or from rural areas as well, and whether you’re capturing cases from different age groups and racial, ethnic, or national-origin groups. Doing this kind of procedure allows you to investigate whether you are likely capturing the full breadth of circulating viral diversity, or whether there is a population that appears to systematically lack genomic surveillance representation.
4.4.3 Tools and Approaches You Can Use to Explore the Question
One of the most common approaches for bringing together surveillance data with genomic data in public health applications is to use the “metadata overlay” feature in Nextstrain. This feature allows the user to colour the tips of a Nextstrain tree visualised in Nextstrain Auspice (https://auspice.us/) according to additional variables specified in an external spreadsheet. One of the reasons why public health practitioners use this particular workflow is because it provides a way to join genomic data objects (the tree) with epidemiologic data, which often contains personally identifiable information (PII) or personal health information (PHI) that epidemiologists must keep secure. This need for storing PII/PHI on secured computational infrastructure usually precludes it from being incorporated into the tree directly, since bioinformaticians usually infer trees on scientific-computing infrastructure that is not authorised to store PII. In the case of the Nextstrain metadata overlay, the data table containing the surveillance data remains “client-side”, that is, the information never leaves the secure computer. Explicit instructions about how to format and use the Nextstrain metadata overlay are described in Chapter 6.
4.4.4 Caveats, Limitations, and Ways Things Go Wrong
What would an epidemiologic handbook be without at least one mention that correlation does not imply causation? In the case where you bring genomic and surveillance data together and look at how the exposure data relate to the phylogenetic patterns, you are in essence looking qualitatively for patterns of correlation between the surveillance data and the genomic data. To say that this is qualitative and correlative is not to undermine its utility; indeed, when brought together these data sources typically work synergistically to provide rich information regarding what transmission dynamics may be at play. However, as with any qualitative, observational analysis, the observed dynamics could be subject to confounding. As such, we recommend using this tool to derive quick situational awareness, and suggest that epidemiologists follow-up with more rigorous studies if the relationships observed warrant deeper investigation.