Chapter 1 The Value of Pathogen Genomics in Applied Epidemiology
Gytis Dudas
Vilnius University, Vilnius, Lithuania
Have you ever wondered “What would be the benefit of using pathogen genomic data in this investigation?” This chapter attempts to answer that question. We’ll describe the utility of pathogen genomic data for public health surveillance and for outbreak response. We distinguish between surveillance and outbreak response use cases for two reasons. Firstly, the turnaround times for which genomic data retain utility are different for surveillance versus outbreak response. Typically, outbreak response requires sequencing and genomic analysis to occur more rapidly. Secondly, the sampling frames that you use for surveillance are different from what you would use for outbreak response. We will elaborate on sampling frames in Chapter 3. We recognise that getting a genomic epidemiology program up and running can be slow and challenging. We therefore also discuss how genomic data can remain useful even when they are not available in real time. This chapter should be pertinent to most readers, since it provides a broad rationale for why genomic surveillance and genomic epidemiology help support applied epidemiology activities.
1.1 The Value of Genomic Epidemiology for Surveillance
Genomic data provide additional resolution to determine relationships between cases. From a surveillance standpoint, this ability to delineate clusters of related cases more sensitively allows public health practitioners to see separate pathogen lineages circulating within a particular area of interest. This capacity allows the epidemiologist to see distinct chains of transmission even when they circulate in a population concurrently, a finding that can be challenging to see in case surveillance data alone.
What is the benefit of detecting these distinct transmission chains? Firstly, this additional layer of resolution enables the public health practitioner to detect the emergence or introduction of new lineages into their community or surveillance area, and distinguish introduction events from endemic transmission within the surveillance area. Developing a more precise understanding of how the processes of introduction and endemic transmission drive incidence enables public health authorities to intervene in more precise and tailored ways. For example, if most cases within the surveillance area result from multiple introductions of distinct lineages that transmit only minimally after introduction, then policies that seek to reduce the frequency with which introductions occur are likely to be more effective in reducing case counts. In contrast, if most cases are attributable to a single circulating transmission chain, then public health policies focusing on reducing travel-acquired cases are likely to be less effective, and greater focus should be oriented towards interventions designed to interrupt transmission within the community.
Taking our ability to distinguish between distinct transmission chains a bit further, when we have genomic surveillance data collected over time, these data can help the epidemiologist to see how different clusters contribute to the overall disease burden (Figure 1.2). For example, perhaps you detect four circulating genotypes within your community. Despite the existence of these four transmission chains, the majority of the cases in your community can be attributed to the circulation of just one of those transmission chains. In this case, the public health practitioner can focus attention on understanding what risk factors or demographic profiles appear to be most associated with the primary circulating transmission chain, with the hope that a more detailed understanding of that particular transmission chain will underscore more tailored and effective interventions.
Furthermore, evaluating the contribution of different pathogen genotypes to case counts over time can provide more precise situational awareness about how different variables shape the epidemiology of a pathogen. For example, we may be concerned about how travel and holiday behaviour contribute to cross-border pathogen spread. Genomic surveillance data can enable us to investigate that question, monitoring the prevalence of particular genotypes in different countries and observing changes in genotype-specific incidence rates following major travel periods. Such analyses enabled scientists to monitor the rise of the B.1.177 lineage of SARS-CoV-2 viruses in Europe during the summer of 2020, attributing the rising dominance of the variant to holiday travel, rather than to increased viral fitness or transmissibility2.
While most mutations are simply typos in the genome and have little or no effect on the pathogen, some mutations can change the dynamics of transmission. Such lineages, which we may refer to as being “more fit”, do emerge. By monitoring genotype-specific incidence rates, we are well poised to detect an acceleration in the transmission of a particular lineage. Enabled with high-resolution descriptive epidemiological information, we can follow up with analytical epidemiological studies to test hypotheses about why a particular strain is rising in frequency. This process was critical in detecting the emergence of SARS-CoV-2 lineage B.1.1.7 (Alpha in the World Health Organization nomenclature) in the United Kingdom, and attributing the rise in frequency to the actual increased transmissibility of B.1.1.7 lineage viruses3,4.
Finally, a change in commonly reported symptoms or illness severity associated with an increase in a particular genotype could indicate a change in the clinical presentation of illness, possibly associated with a genotypic change in the pathogen. Like standard surveillance data, genomic data are observational data, and therefore to infer causality between genotype and presentation you would need to draw upon other experimental approaches.
1.2 The Value of Genomic Epidemiology for Outbreak Response
Pathogen genome sequences can help us detect or rule out linkages between cases. Similarly to how specific case definitions improve the sensitivity of traditional epidemiologic study designs, genomic data can support outbreak response by improving one’s ability to accurately classify which cases form an outbreak cluster. Below are some toy examples to clarify how this works, and to demonstrate how this information is useful in guiding public health practice.
In the first toy example, imagine that three skilled nursing facilities (SNFs) in your community are all experiencing outbreaks of a particular illness. As spaces of congregate living that typically house more vulnerable people, such facilities may be more prone to a higher incidence of disease. Thus, you may ask yourself, are the outbreaks I see in these three facilities independent outbreaks, attributable simply to the higher risk setting? Or might these outbreaks be linked? Furthermore, are cases detected within each of the facilities linked? Or do some cases form a cluster, while others have simply been detected due to enhanced screening?
Figure 1.3 shows how genomic data can help elucidate some of these possibilities. From the surveillance data, we can see cases detected across all three SNFs. When we sequence the viruses from those infections, an interesting picture unfolds. Each SNF has a subset of cases with closely related genome sequences (indicated here by colour). Each SNF also has cases whose genome sequences are distinct from each other and from the related infections found across all of the SNFs. This theoretical picture helps us understand a few things. Firstly, these are not three independent outbreaks, but rather one outbreak across three facilities. This may cause us to ask whether residents are transferred between facilities regularly, or if staff members work at multiple facilities. Secondly, not all of the cases that were detected across the SNFs appear to form part of the outbreak. Rather, we have one primary cluster, along with detections of infections that were likely acquired in the community, separately from the outbreak.
Similarly, genomic data can distinguish cases that are not linked, even if they are detected within the same setting or facility, or infect individuals with similar demographic profiles. Distinguishing between unrelated and related cases within a single setting can enable epidemiologists to more clearly see what precise factor(s) is driving transmission.
Let us take another toy example, in this case a factory where you detected an elevated incidence of disease above baseline. The cases are distributed across the factory, among both managerial staff who primarily work in offices and individuals working on assembling products. All the cases have been detected around the same time period, meaning that the temporal pattern supports a single outbreak across the entire facility. In that case, what factor is facilitating transmission? How will you act in order to interrupt transmission?
Figure 1.4 shows how the addition of genomic data enables you to see that the cases among office-based managerial staff are unrelated to the infections occurring amongst workers on the assembly line. While all cases occurred among employees of the factory, infections among managerial staff are genetically distinct from cases that occurred among assembly staff. Therefore, managerial infections appear to have been acquired externally from the workplace. In contrast, all cases among assembly staff are genetically related. Distinguishing the office staff cases from the assembly staff cases allows you to see that any transmission occurring within the factory appears to be specific to the assembly area of the factory. This improves the sensitivity with which you can detect the factors or behaviours that are facilitating transmission, and hopefully making it easier to intervene.
1.3 The Value of Retrospective Data
In public health, we frequently want short turnaround times that allow us to learn and act as quickly as possible. However, initially you might have delays in generating and using pathogen genomic sequence data. This is completely normal and goes hand-in-hand with the challenges of building out new microbiological capacity within the public health lab, new data infrastructure for performing bioinformatic analysis and data linkage, and new inferential tools for analysing the data. While decreasing turn-around-times will help you to utilise the data in real-time to inform outbreak response, genomic data that are retrospective in nature are still useful!
For example, sequencing retrospective samples can increase or change our understanding of the dynamics and timing of outbreaks. You may find that sequencing older samples retrospectively identifies earlier detections of particular pathogen lineages, therefore suggesting that a lineage might be more prevalent in the present than assumed. Such updated information may improve models used for epidemiological forecasting, or can guide policies surrounding therapeutics administration, if it is known that different lineages show different susceptibilities to those therapeutics. Similarly, hypotheses about the role of particular events in initiating transmission or disseminating an outbreak might be altered or dismissed based on retrospective data. An early sequence might also become a “missing link” that adds weight towards a hypothesis that two distinguishable local transmission chains arose locally and were not the result of two separate introduction events. While it is rare for retrospective samples to be paradigm-shifting, we will look at two examples where retrospective sequencing and phylogenetic analysis changed our fundamental understanding of a disease’s epidemiology.
The human immunodeficiency virus type 1 (HIV-1) group M pandemic resulted from a spillover event from Central African chimpanzees, and the Democratic Republic of Congo (DRC) was likely a starting point. However, owing to the late identification of human immunodeficiency virus (HIV) as the causative agent, the earliest genetic sequences of HIV-1 came from samples collected in the 1980s, likely multiple decades after the initial spillover event from chimpanzees to humans. With many years of transmission and evolution occurring before viruses were ever sequenced, the genome sequences from the 1980s are already genetically diverse. Retrospective sequencing of HIV genomes from two blood samples collected in Kinshasa, the capital of the DRC, in 1959 and 1960 were illuminating. Both samples belonged to different subtypes of HIV-1 group M and thus indicated that substantial genetic diversity of HIV-1 was already present in the DRC by 1960 (Figure 1.5). This finding helped to revise our understanding of the origins of the HIV-1 pandemic, pushing back our estimates of when HIV-1 group M viruses spilled over into the human population towards the turn of the century, a period of time involving rapid urbanisation in Central Africa.
A similar and even more exaggerated example comes from ancient hepatitis B virus samples. Hepatitis B virus (HBV) is a circular double-stranded DNA virus that infects hepatocytes and causes hepatitis. A number of genotypes of hepatitis B have been described, and genetic diversity tends to correspond to broad geographic regions. Since hepatitis B virus has a DNA genome and double-stranded DNA is relatively stable, it has been possible to extract DNA from well-preserved remains that are hundreds to thousands of years old, from all around the world. This work has allowed us to sequence ancient HBV viruses and compare them to contemporary sequences. One of the most surprising findings of these studies has been that all historical hepatitis B virus sequences to date clearly fall within known contemporaneous genotypic hepatitis B diversity (Figure 1.6, indicating that the diversification of hepatitis B virus predates any of the ancient infections.
In both the HIV-1 group M and hepatitis B virus studies, retrospective sequencing provided a lot of new information because no information existed prior to contemporary sequences. Though this information isn’t actionable for any response happening now, it does nonetheless paint a much clearer picture of when these viruses emerged, and puts constraints on hypotheses about past events. Even though we have explored examples with timescales on the order of decades to hundreds of years, the broader lessons would hold in a similar contemporary situation.
Even when data from an epidemic are plentiful, we may still have local blindspots. One example of this has been the Guinean portion of the 2014 West African Ebola virus epidemic, where a seemingly rare viral lineage (marked GN-1 in Figure 1.7 was detected by three different groups who were sequencing sporadically during the epidemic. Each time the lineage was detected, it seemed to circulate at such low prevalence that each set of investigators thought that the lineage would likely not persist beyond the study period. The persistence of this lineage is remarkable because it was only ever detected at low frequencies and primarily in Guinea, despite the presence of a fairly robust genomic surveillance system across the region. A potential hypothesis arising from these data is that this lineage actually circulated at higher frequencies, allowing the lineage to be sustained over long periods, but that it circulated in an area within Guinea where case detection or sequencing coverage was not as dense. When the lineage reached areas with good sequencing coverage, it was observed, but only rarely. If we assume that PCR testing was done at the source location and that those samples were somehow preserved, retrospective sequencing in regions of Guinea where sequencing was less intense could illuminate the circumstances that allowed this lineage to circulate for so long while only rarely being detected.
At the end of the day, situations where retrospective sequencing can significantly alter our perspective will be more common in resource limited settings, and we expect that retrospective sequencing generally will confirm transmission that we hypothesised was occurring.