Bulletin July 2020 Number 191

Whole genome sequencing with data sharing across a global community has been unprecedented and essential in informing the public health response during the pandemic.

Advancements in genomic capability over the past decade have resulted in significant improvements to the speed and reliability of whole genome sequencing (WGS). WGS can be expected to make important real-time contributions to scientific understanding and the public health response to the Coronavirus disease 2019 (COVID-19) pandemic, to a degree that was not feasible during the SARS epidemic (2003) or H1N1 influenza pandemic (2009). The COVID-19 Genomics UK Consortium (COG-UK) was launched on 23 March 2020 to deliver large-scale and rapid WGS for SARS-CoV-2 across the UK. In this article, we discuss the vital role that genomics has played so far and how WGS can be harnessed to shape the ongoing response to COVID-19.

Rapid sequencing of SARS-CoV-2 contributed to early development of molecular diagnostic assays

Cases of pneumonia of unknown aetiology in Wuhan City, Hubei Province, China, were reported in December 2019. The causative agent was identified as a novel coronavirus on 7 January 2020, later named Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). The unprecedented effort of Chinese scientists led to rapid publication of the first SARS-CoV-2 genome1 on www.virological.org on 10 January 2020, followed quickly by a further five sequences deposited to www.GISAID.org (Global Initiative on Sharing All Influenza Data).2 This is an open access database that fully recognises the contribution of those depositing sequence data, which is important to encourage sharing of information

From the earliest stages of the outbreak, the UK participated in a collaborative effort with laboratories across Europe that was instrumental in securing early molecular diagnostic capability.3 Three assays were selected, targeting the RdRp, E and N genes, based on good matching to the first six publicly available sequences. At this point, clinical isolates of the novel virus were not available to international public health bodies and collective experience working with SARS-CoV/Middle Eastern Respiratory Syndrome Coronavirus (MERS-CoV) was critical in developing a diagnostic workflow. Detection of two UK cases on 31 January 2020 provided our first clinical material, facilitating further assay validation. Sequencing of these isolates was rapidly performed and deposited to GISAID on 3 February.

Global genomic data deposited to the public domain

As of 18 June 2020, 48,012 SARS-CoV-2 genomes have been deposited to GISAID, including 21,433 sequences from the UK. High volumes of genomic data are being rapidly generated. These data can provide important insight into the evolution and epidemiology of the pandemic, be used to evaluate control strategies (drugs, vaccines) and aid in the refinement of diagnostic assays.

SARS-CoV-2 is a large RNA virus (~30kb) with potential for genetic recombination, point mutation and some replication error correction. The mutational rate (approximately two nucleotide substitutions/month4) is lower than other RNA viruses (such as influenza) due to inherent capacity for proof reading. Virus nomenclature is not fully established yet and is likely to be further refined. Rambaut et al. recently proposed the naming of two major lineages (A and B) and several descendent lineages.5

Use of genomics to understand the origins and biological properties of SARS-CoV-2

Zoonotic transmission events resulted in human coronavirus outbreaks caused by SARS-CoV in 2003 and MERS-CoV in 2012. The resulting increased awareness of the pathogenic potential of coronaviruses led to an escalation of research into animal reservoir hosts, and viral properties that might govern their emergence and transmission in humans. Bats are a natural reservoir for a large number of coronaviruses, some of which may have potential for cross-species transmission.6,7 Indeed, serological evidence from areas of rural China suggests that human spillover events occur at low frequency.8 Other animal species can act as intermediate hosts enabling the species jump into humans, as observed with dromedary camels for MERS-CoV and palm civets for SARS-CoV.

Comparative analysis identified SARS-CoV-2 as a group 2B coronavirus.

This pre-existing genomic surveillance has proven vital to our understanding of SARS-CoV-2. Comparative analysis identified SARS-CoV-2 as a group 2B coronavirus. Similarities between the receptor binding domain (RBD) of SARS-CoV and SARS-CoV-2 indicate they use the same ACE2 receptor, supported by structural and biochemical analyses.9,10 However, SARS-CoV-2 has a Furin cleavage site in the spike glycoprotein that is not found in SARS-CoV.11 Another bat coronavirus (RmYN02) isolated in Yunnan province in 2019 has a similar cleavage site in the spike protein, providing evidence that such insertion events can happen in nature.12 Polybasic cleavage sites in avian influenza viruses were found to lead to a highly pathogenic phenotype.13 However, understanding the significance of this cleavage site for SARS-CoV-2 pathogenicity or transmissibility requires further work.

High viral loads were reported in environmental samples from the Wuhan wet market, where a variety of animals were sold,14 but it is not yet known whether or which animal species initiated the first human infections.

Bat CoVs= orange and red; pangolin CoV= green; human SARS-CoV-2= light blue; SARS-CoV= dark blue
Bat CoVs= orange and red; pangolin CoV= green; human SARS-CoV-2= light blue; SARS-CoV= dark blue

Phylogenetic analysis suggests that the virus originated in Wuhan in November/December 2019,4 although a preceding period of cryptic transmission in humans cannot yet be ruled out.14 High viral loads were reported in environmental samples from the Wuhan wet market, where a variety of animals were sold,14 but it is not yet known whether or which animal species initiated the first human infections. A bat coronavirus isolated in Yunnan province in 2013 (RaTG13) was found to have the greatest overall sequence similarity (~96%) to SARS-CoV-2, although there are differences in the RBD that indicate it is not the direct progenitor.11 Coronaviruses isolated from pangolins were also found to be genetically related to SARS-CoV-2, particularly in the RBD, but evidence is inadequate to support this species as the intermediate host (Figure 1).15 The cleavage site described above is notably absent from both the closest bat and pangolin viruses indicating that another, as yet unidentified, virus is the direct progenitor of SARS-CoV-2. Current scientific thinking is that SARS-CoV-2 originated from multiple naturally occurring recombination events among viruses present in bats and other wildlife. It is worth noting that our understanding of the range of coronaviruses in other species is limited owing to the lack of knowledge of sequence diversity. Many questions remain about the origins of SARS-CoV-2 and where and how the virus became adapted for efficient human-to-human transmission – genomic data will prove critical to unravelling this.

Monitoring animals for infection forms part of the risk assessment for new reservoirs and transmission chains.

During the course of the pandemic, small numbers of SARS-CoV-2 infections have been detected in wild and domestic animals including cats, dogs, ferrets, mink, pangolins and big cats, some of which displayed respiratory symptoms. On several mink farms in the Netherlands, outbreaks of SARS-CoV-2 infection were reported, which were thought most likely to have resulted from reverse zoonosis events. Additionally, two possible cases of human infection acquired from mink have been reported by Dutch authorities. Pigs, chickens and ducks were not susceptible to experimental infection with SARS-CoV-2.16,17 The consequence of exposure of livestock and companion animals to infected humans is not known, i.e. which animal species can become infected (symptomatically or subclinically) and whether they can be infectious to other animals or humans. Monitoring animals for infection forms part of the risk assessment for new reservoirs and transmission chains.

Box 1: Public health, clinical and scientific questions to which whole genome sequencing may contribute


  • Periodic comparisons of primers/probe sequences against circulating lineages to identify mismatches that might affect the accuracy of molecular diagnostic assays; these can be further evaluated in the wet lab.
  • Investigate unexpected negative PCR results using WGS – for example, Artesi et al.23 recently described a low frequency variant that was associated with failure of the Roche Cobas E gene assay in a small number of cases.


  • Sequence virus from relevant body compartments across spectrum of clinical presentations (e.g. respiratory/gastrointestinal/ paediatric multisystem inflammatory syndrome) to assess for mutations associated with particular disease phenotypes.
  • Understand the distribution of clades between population subgroups (e.g. age, ethnicity, geography).
  • Assess for genomic features that may associate with disease outcome or transmissibility.

Therapeutics and vaccines

  • Antiviral susceptibility testing – a possible role for sequencing the polymerase or protease genes in patients treated unsuccessfully with inhibitors of these viral proteins (e.g. remdesivir)
  • Vaccine development − identification of conserved sequences within epitopes; investigation of vaccine escape mutants, e.g. S gene mutations

Transmission and virus evolution

  • Understand how UK strains fit in the global phylogenetic tree
  • Investigate whether UK outbreaks are due to external introductions or ongoing transmission within the community
  • Identify transmission hotspots or diminishing lineages
  • Understand nosocomial transmission within UK hospitals and care settings
  • Investigate epidemiological clusters and outbreaks

Virus biology

  • Direct RNA sequencing of transcriptomes, e.g. MinION.
  • Identification of genomic changes occurring cell passage, e.g. deletions in the spike protein cleavage site described by Davidson et al.24
  • Use of reverse genetics and pseudotype viruses to assess the consequences of genomic changes (e.g. as described by Hoffman et al.9)

Viral sequencing to characterise transmission chains, clusters and outbreaks

Sequencing can be valuable to complement epidemiological evidence of clusters and outbreaks. The first COVID-19 case in the USA was detected on 21 January in Washington, in a traveller from Wuhan. Five weeks later, a second positive individual in this region with no travel history was identified through testing of influenza surveillance samples. Sequencing revealed that the two isolates had identical genomes except for three mutations and both possessed a genetic variant present in only a small proportion (2 of 59 – 3.4%) of genomes published from China at the time. This indicated a low probability that the variant occurred by chance and, combined with the geographical proximity, was used as evidence to suggest that community transmission had occurred within the USA.18 Six of the first ten UK cases resulted from a cluster of cases linked to travel to a ski chalet in France19 and three clusters of local transmission in Singapore were identified early in the pandemic.20 Going forward, genomic data will be key to supporting cluster and outbreak investigations such as those described in care homes.21,22

Developing genomic capability across the UK

On 23 March 2020, the UK government announced a £20 million initiative to launch the COVID-19 Genomics UK Consortium (COG-UK). The consortium involves expertise from the four UK public health agencies, several academic institutions and collaborating organisations. Combined with epidemiological and clinical information, this large-scale WGS aims to contribute directly to informing interventions and UK policy decisions. As of 11 June 2020, 25,052 sequences have been uploaded. A preliminary analysis of SARS-CoV-2 importation into the UK identified at least 1,356 independently introduced lineages, mainly from European countries, followed by local transmission in the UK.22 Real-time sequence analysis has potential to inform a wide range of applications, detailed in Box 1.

Caveats to interpreting genomic data

Interpretation of the significance of genomic data has its limitations and caution should be taken to avoid overinterpretation of findings. First, the lower error rate of the coronavirus genome limits genetic diversity, meaning that purely using genomic data to infer transmission events is not straightforward. Epidemiological metadata can add significant weight to genomic findings and may be essential in ruling possible sources in or out.

Interpretation of the significance of genomic data has its limitations and caution should be taken to avoid overinterpretation of findings.

Second, nucleotide mismatches in primer/probe binding regions have been described25 but require further testing to assess for any effect on diagnostic assay performance and to understand the prevalence among UK isolates.

Third, as an RNA virus adapting to humans as a novel host, SARS-CoV-2 can be expected to accumulate genetic change, some of which will confer selective advantage. Other changes may be deleterious or have no biological effect. For example, Korber et al.26 described the increasing frequency of a mutation (D614G) in the spike protein that the authors purported led to increased viral transmissibility. Functional assays are required before assertions can be made about the significance of viral mutations, deletions or insertions for viral transmissibility, disease outcome or therapeutic use.


From the earliest stages of the emergence of SARS-CoV-2, genomics has proven critical to the public health response. The timely generation of whole genome sequences from the global community has contributed to the development of diagnostics, understanding of transmission dynamics and analysis of the origins and evolution of the new virus. Rapid and open-access sharing of data has been unprecedented and played a critical role in the response. In the months and years to come, for as long as SARS-CoV-2 continues to circulate among humans, genomics will play a vital role in shaping public health policy and aid in decision making.

References available.