graph-gene Caller (ggCaller), combines gene annotation and pangenome clustering steps within population-wide de Bruijn Graphs. Using population-frequency information, ggCaller improves consistency of gene annotations, leading to more accurate clustering and significantly reduced run-times versus linear-genome based annotation and pangenome analysis.
ggcaller is written by Sam Horsfield, and in collaboration with Nicholas Croucher.
Split k-mer analysis (SKA) can be used to produce alignments from closely related sequence assemblies or
reads, quickly (because it is alignment-free) and with minimal fuss (due to the interface).
This enables downstream analysis such as phylogenetics or sequence completeness.
In collaboration with Simon Harris.
A rust rewrite of pp-sketchlib particularly targeted at very large datasets by using a new file format. To find nearest neighbours or
perform subset queries, this code is an improvement over pp-sketchlib.
Library of sketching functions used to rapidly calculate core and accessory distances between bacterial genomes. Designed as a faster drop-in back-end for PopPUNK, you can also use to replace mash (100x speedup, further 50x with GPUs).
Tools for bacterial genomic epidemiology. Quickly find core and accessory distances between whole-genome sequences, and use these to find genetically clusters. New data can be rapidly ‘queried’ against existing clusters, giving consistent nomenclature.
In collaboration with Nicholas Croucher.
Poplation analysis PIPEline. Designed to be run downstream of PopPUNK, to obtain subclusters and visualisations.
A Snakemake pipeline that requires some config modifications to run.
STUdying Balancing Evolution (NFDS) To Investigate GEnome Replacement
This is an R package to simulate and fit a population genetic model with parameters for vaccination effectiveness, negative frequency-dependent selection (NFDS) and immigration. Based on the model of Corander et al, and implemented as an
odin.dust model.
A suite of R packages to assist in fast, reproducible state-space models, particularly compartmental models used in epidemiology.
In collaboration with Rich Fitzjohn and RESIDE at MRC GIDA
odin: a domain-specific language that makes compartmental models easy to write, extend, and are fast to run:
dust: parallelised code for random number generation for Monte Carlo models, used to actually run state space models forward in time
mcstate: statistical support which allows running, forecasting and inference from these models
Comprehensive pangenome-wide association studies in microbes. Find genetic variation that is linked to a phenotype or clone of interest.
In collaboration with Marco Galardini.
Packages to count and determine presence of non-redundant sequence elements. Use ‘caller’ both to determine what the are unitigs in a population, and to quickly determine presence/absence in a new population. The ‘counter’ is no longer supported, but is an alternative to the first purpose.