Mutation Spectra in Streptococcus pneumoniae
By Bruhad Dave
An Introduction to Mutation Spectra
It is intutitive that an organism’s ecological, phenotypic, or epidemiological context exposes it to distinct mutagens, and might thus produce specific signatures and patterns of mutation – that organism’s mutational spectrum.
This idea is well-established in oncology. Cancer epidemiolgy studies have shown that a handful of genes, most prominently the human p53 gene, show patterns of mutation specific to the corresponding cancer types. Further, certain mutagens are associated with certain mutation types or patterns. For example, in Pfeifer & Besaratinia’s 2009 review1 on the subject, they summarised findings from various studies on mouse models containing human p53 genes (Hupki mouse models).
Mutational spectra in the (human) p53 gene in Hupki mouse embryonic fibroblasts were enriched in G>T mutations when exposed to a tobacco-derived carcinogen (panel A above) and the overall pattern of mutation resembled that from lung-cancer patients who smoked. Similarly, exposure to a plant extract implicated in the etiology of urothelial cancers produced a elevated A>T mutations (panel B above), a hallmark for those cancers (panel B above), and exposure to a particulate air pollutant produced a mutational pattern identical to that seen in other systems exposed to that pollutant (panel C above).
Mutational Spectra in Microbes
In a 2021 paper2, Christopher Ruis and colleagues observed that Mycobacterium abscessus isolated from the lungs of cystic fibrosis sufferers showed spectra of mutations that were distinct from environmental isolates (Fig 2 below). In a subsequent work3, Ruis et al., calculated mutational spectra for data from a range of microbial samples, and attempted to associate them with specific mutagenic contexts. They found that distinct patterns of variation were associated with specific DNA-repair defects and ecological niches. To resonctruct these mutational spectra, Ruis and colleagues wrote a tool called MutTui, which they later used to show that mutation patterns in SARS-CoV2 were different between virus lineages that replicated in the upper and lower respiratory tracts4.
Mutational Spectra in Pneumococcal Epidemiology
Given the evidence for associations between the ecological niche a microbe occupies and the patterns of mutation it accumulates, we reasoned that mutational spectra might be likewise correlated to epidemiological factors in Streptocuccus pneumoniae. S. pneumoniae is a genetically diverse, pathogenic bacterium that can cause pneumonia and meningitis in individuals with weaker or weakened immune systems, such as children or the elderly, and immunodeficient or immunosuppressed individuals. One epidemiological feature of interest was carriage duration, the length of time that the bacterium resides in a given individual before it is transmitted to another. A 2017 study led by Lees et al.5 estimated carriage duration for samples from the Maela dataset6, a densely sampled, longitudinal dataset derived from a camp for displaced persons in Thailand. That work showed that carriage duration is heritable, and its variability is attributable to the pathogen’s genotype.
We used carriage duration data estimated from this study, and combined it with information about sample lineage derived from the Global Pneumococcal Sequencing (GPS) project. The GPS project uses PopPUNK, which applies a kmer-based approach to calculate similarity between the input samples and assigns them to lineages, or clusters, which, in the GPS database, are referred to as GPS clusters (henceforth referred to as GPSCs).
In our workflow, we first obtained precalculated GPSC assignments for each Maela sample, and then aggregated the Maela dataset by GPSC. We then used MutTui to reconstruct mutational spectra for all the GPSCs represented in the data, that contained more than 20 samples. As the 33 clusters we retained, represented an intersection of the Maela data and the larger GPS dataset, each of the clusters is represented as maela_gpsc (e.g. maela_1 = Maela samples assigned to GPSC-1).
MutTui: An Overview
As noted above MutTui reconstructs mutational spectra for microbial samples. The tool does so using a phylogenetic tree of each subset of the data (in our case, a tree for each GPSC), a variant alignment of all the samples in that subset (we used Gubbins to produce both the phylogenetic tree and the variant alignment), a reference genome (we used Streptococcus pneumoniae ATCC 7006697), and a conversion file (this maps variants in the Gubbins VCF file back to their genomic position; this is achieved using a call to MutTui convert-vcf
).
The tool performs ancestral reconstruction on the phylogenetic tree using TreeTime, and then calculates a Single Base-Substitution (SBS) spectrum, and also a Double Base-Substitution (DBS) spectrum. Note that here, we only worked with SBS spectra. MutTui outputs include plots showing proportions of SBS and DBS, as well as csv files containing frequencies for each type of mutation. MutTui takes into account the context of each mutation, i.e. the nucleotides flanking each mutation site, in its outputs.
Our initial trials with MutTui produced skewed SBS spectrum plots, but we discovered that rescaling the branches of the Gubbins phylogenetic tree (wherein branch lengths are in the units of substitutions per genome) by the length of the reference genome, so that the branches were represented in substitutions per site, fixed this issue. Chris confirmed that this step was necessary when working with Gubbins phylogenetic trees.
An Initial Trial
While we were setting up our workflow, Chris was kind enough to send us mutational spectra that he had reconstructed for a subset of GPSCs. We used MutTui cluster
to see how these GPSCs clustered. One thing we looked at was whether GPSC2, which contains a majority of the Sertotype 1 (which is implicated as the most pathogenic) samples in GPS, stood out from the rest, but this was not the case. The outlier GPSCs did contain samples from other pathogenic serotypes (namely 14 and 19A), but these serotypes are represented in multiple GPSCs, so this observation does not point to a strong correlation between mutation spectra and serotypes (as a proxy for pathogenicity).
Mutational Spectra and Carriage Duration
Having reconstructed spectra for the 33 Maela_GPSCs, we used UMAP (an algorithm for dimensionality reduction based on building neighbourhood graphs in the data) to create clustering visualizations of these spectra and found that the scatterplot showed 2 distinct groupings. Colouring each point on this plot by the average duration of carriage of the corresponding Maela_GPSC showed that the two groups were not separated based on carriage duration, nor did it appear to be a driver for this clustering, indicating that mutation spectra might not be correlated with duration of carriage.
We also coloured this UMAP projection with a range of metadata: we converted categorical and binary data into 1s and 0s, and plotted averages for a quick, overall view. For examples, for drug susceptibility, we assigned 1 to susceptible samples and 0 to both resistant ones and ones that had intermediate phenotypes; averaging over such data would yeild the fraction of susceptible samples in each cluster. However no clear patterns emerged for any of the metadata we tested. As noted earlier, MutTui produces mutation spectra that account for mutation context, i.e. the nucleotides on either side of the variant site. We removed this context, summing all mutations of the same type, and ran UMAP on this simpler dataset, obtaining similar results.
Next, we performed a risk calculation for each substitution type, calculating a True/False value for:
$\frac{\mathrm{Proportion \ of \ a \ substitution \ in \ mutations \ in \ each \ cluster}}{\mathrm{Proportion \ of \ that \ substitution \ in \ all \ mutations \ in \ the \ data}} > 1$
for each substitution type. Here again, we found that the 33 Maela_GPSC clusters separated into two groups. However, we note that the two groups obtained in the UMAP projection of mutation spectra are not the same as the two obtained using this risk analysis. The grouping we observed with the risk analysis also did not seem to be correlated to mean carriage duration.
We then calculated the mean spectrum for the two groups we observed in the UMAP, and compared them by subtracting the mean spectrum of group 2 (on the right of the plot, n=10) from group 1 (on the left, n=23). We observed that group2 had a higher proportion of T>C mutations across all contexts.
Conclusion
In the brief, exploratory analysis we performed, we did not uncover any particularly striking correlations between the pattern of mutations for a group of related samples, and epidemiological information such as antimicrobial susceptibility or whether, for example, the samples in that group were collected mostly from infants.
However, more fine-grained investigation into associations between pneumococcal mutational spectra and phenotypes or ecology is certainly merited. One obvious extension would be to incorporate more types of metadata where it is available. As is often the case with open-ended analyses, there is a fair chance that potential associations turn up. The clear obstacle to this is that many interesting types of metadata might not be available for some or most of the input data, or indeed, not available at all.
Another potential way forward might be to decompose clusterwise mutation spectra into per-sample mutation spectra. This would be a useful way to increase the resolution of the analysis, and it could prove a good way to handle the fact that metadata is often collected on a per-sample basis, and can be binary or categorical rather than continuous. One part of this approach might be to start from raw sequencing reads (or better yet, variants called from reads) instead of assemblies, as this is likely to help create more complete sample-wise mutation spectra. In a similar vein, applying this sort of analysis on a larger dataset might also create robust spectra. We restricted our analysis to Maela data such that it contained only GPSC clusters represented by >20 Maela samples: it might be useful to increase the cutoff to 50, for example.
Finally, while UMAP and other clustering algorithms are extremely useful to quickly get an idea about data like mutational spectra, it might be useful to apply other statistical approaches, including perhaps those from machine-learning, to deconvolute relationships between a sample’s (or cluster’s) mutational spectrum and associated biological or ecological information. This might ultimately lead to potential predictive models whereby one might be able to estimate phenotypes, ecological niches, or active mutagenic forces more broadly, using patterns of mutation, something that would be quite useful in outbreak surveillance and the genomic epidemiology of emerging or nosocomial pathogens.
Thanks to Sophie Belman (Bentley Lab at the Sanger Institute) for initiating this project, Chris Ruis for his help and discussions regarding MutTui, and John Lees for guiding the project!
References and Links
Pfeifer, G. P. & Besaratinia, A. Mutational spectra of human cancer. Hum Genet 125, 493–506 (2009). ↩︎
Ruis, C. et al. Dissemination of Mycobacterium abscessus via global transmission networks. Nature Microbiology 6, 1279–1288 (2021). ↩︎
Ruis, C. et al. Mutational spectra analysis reveals bacterial niche and transmission routes. bioRxiv 2022.07.13.499881 (2022) doi:10.1101/2022.07.13.499881. ↩︎
Ruis, C. et al. Mutational spectra distinguish SARS-CoV-2 replication niches. bioRxiv 2022.09.27.509649 (2022) doi:10.1101/2022.09.27.509649. ↩︎
Lees, J. A. et al. Genome-wide identification of lineage and locus specific variation associated with pneumococcal carriage duration. eLife 6, e26255 (2017). ↩︎
Chewapreecha, C. et al. Dense genomic sampling identifies highways of pneumococcal recombination. Nat Genet 46, 305–309 (2014). . ↩︎
Streptococcus pneumoniae ATCC700669 EMBL Accession: FM211187 ↩︎