Search for a command to run...
This repository contains the LIT cg/wgMLST schema of Legionella pneumophila, as well as all the necessary resources for its local implementation in a routine surveillance scenario or research frameworks. All data available in this repository were generated in the frame of the study "Advancing Legionella pneumophila genomic surveillance with a high-resolution cg/wgMLST schema for outbreak detection and investigation" by Mixão et al. 2026. Specifically, in this repository you will find: The wgMLST schema (00_wgMLST_schema.zip) This folder contains the wgMLST schema created and prepared to be run with chewBBACA allele caller, as well as the respective training file (*.trn). This schema is already populated with a large and global dataset of L. pneumophila (n=9572). The list of the loci that constitute the static cgMLST schema are provided in the file cgMLST_Legionella_pneumophila_LIT.txt. An adapted version of this schema is available in chewie-NS. This adapted version has an alternative loci nomenclature, which is distinct from the "lpLIT" numbering that is available in this Zenodo repository. Therefore, inside this folder, you can also find the file chewie-NS_loci_correspondence.xlsx, which provides the correspondence between the loci nomenclature used in the schema version available in this Zenodo repository and the one available in chewie-NS. The allele matrix (01_chewBBACA_9572_genomes.zip) This folder contains the wgMLST allele matrix obtained with chewBBACA for the dataset of L. pneumophila that was used to populate the schema. The file Lp_alleles_missing_code_original.tsv corresponds to the original output of chewBBACA, being a useful resource to evaluate the reason why a given locus was not called in a certain (group of) sample(s). The Lp_alleles_missing_code_0.tsv corresponds to the same matrix but with missing data indicated by "0", thus representing the necessary input for clustering analysis (e.g. with ReporTree). Disclaimer: This allele matrix comprises the allelic diversity of a large dataset that was compiled to create and populate the LIT schema. Genomic clustering based on this matrix alone does not imply epidemiological relatedness, transmission, or outbreak linkage. The authors are not responsible for the identification or interpretation of genetic clusters involving isolates included in this dataset, as any conclusions must be supported by appropriate epidemiological and contextual data, which is out of the scope of this study. This allele matrix is provided for research and methodological purposes only. Example of a ReporTree output (02_example_ReporTree_output_9572_genomes.zip) This folder contains the outputs obtained from a ReporTree run for the dataset of L. pneumophila that was used to populate the schema using single-linkage hierarchical clustering (HC) or the GrapeTree MSTreeV2 (GT) algorithms. Therefore, it provides an example of how to run ReporTree and what to expect from this tool but also presents comprehensive information about the L. pneumophila dataset, integrating clinical, epidemiological and genomic data. Of note, to demonstrate the relevance of a dynamic wgMLST approach for L. pneumophila outbreak investigation, these ReporTree runs were performed in “zoom-clusters” mode, in which, for each cgMLST cluster, the schema is automatically extended with the cluster-specific accessory loci and an additional cluster-specific cgMLST analysis is performed, according to user-specified parameters. How can I use this cgMLST schema to analyze my L. pneumophila collection? 1. Download the Zenodo repository and unzip the folders 2. Go to 00_wgMLST_schema.zip and unzip L_pneumophila_wgMLST_schema.zip 3. Run chewBBACA allele caller on your assemblies using the downloaded schema (the schema is already prepared to be run with chewBBACA) Command line example: chewie AlleleCall -i list_your_assemblies.txt --schema-directory L_pneumophila_wgMLST_schema/ --output-directory chewie 4. Replace missing data by "0" creating the results_alleles_missing_code_0.tsv file Command line example: chewie ExtractCgMLST -i chewie/results_alleles.tsv -o chewie --t 0 5. Create a metadata table with clinical and/or epidemiological information for each of your assemblies (metadata.tsv) 6. Run ReporTree providing the files results_alleles_missing_code_0.tsv (-a) and metadata.tsv (-m) as input. ReporTree is a surveillance-oriented tool that has many functionalities, including for a smooth integration of genetic and clinical/epidemiological data. From these, we highlight: - the possibility of indicating the threshold levels at which you intend to determine the genetic clusters with "--HC-threshold" or "-thr" arguments for HC and GT clustering, respectively - the possibility to indicate the metadata columns for which you wish to obtain summary reports with cgMLST cluster characterization with the "--columns_summary_report" argument - the possibility to perform a dynamic zoom-in analysis for the cgMLST clusters at a user-selected threshold level that harbor your samples of interest (e.g. new samples) with the “zoom-cluster-of-interest” OR for all clusters at a user-selected threshold with the “zoom-clusters” mode - the possibility to indicate two metadata columns for which you intend to also generate summary reports with the "--metadata2report" argument If you intend to start a local nomenclature system in a routine surveillance context, every time you run the tool you can indicate the file *partitions.tsv as "--nomenclature-file" and the cluster names will be maintained. Additionally, you can request a hierarchical nomenclature code to be attributed to each isolate reflecting the respective genetic clustering at user-selected levels (“--nomenclature-code-levels”). Note: for the usage of this schema we strongly advise you to only perform clustering with isolates with at least 95% cgMLST loci called (--loci-called). Command line examples: 1. cgMLST command line python reportree.py -a results_alleles_missing_code_0.tsv -m metadata.tsv -out ReporTree -l cgMLST_Legionella_pneumophila_LIT.txt --loci-called 0.95 --analysis HC --HC-threshold single-100,single-50,single-20,single-10,single-6,single-1 --columns_summary_report "ST,outbreak_number" 2. routine cgMLST-based surveillance (maintaining cluster nomenclature) python reportree.py -a results_alleles_missing_code_0.tsv -m metadata.tsv -out ReporTree -l cgMLST_Legionella_pneumophila_LIT.txt --loci-called 0.95 --analysis HC --HC-threshold single-100,single-50,single-20,single-10,single-6,single-1 --columns_summary_report "ST,outbreak_number" --nomenclature-file partitions.tsv 3. routine cgMLST-based surveillance + dynamic zoom-in for clusters of interest python reportree.py -a results_alleles_missing_code_0.tsv -m metadata.tsv -out ReporTree -l cgMLST_Legionella_pneumophila_LIT.txt --loci-called 0.95 --analysis HC --HC-threshold single-100,single-50,single-20,single-10,single-6,single-1 --columns_summary_report "ST,outbreak_number" --nomenclature-file partitions.tsv --sample_of_interest "sampleA" --zoom-cluster-of-interest single-6 --site-inclusion 1.0 4. routine cgMLST-based surveillance + dynamic zoom-in for all clusters python reportree.py -a results_alleles_missing_code_0.tsv -m metadata.tsv -out ReporTree -l cgMLST_Legionella_pneumophila_LIT.txt --loci-called 0.95 --analysis HC --HC-threshold single-100,single-50,single-20,single-10,single-6,single-1 --columns_summary_report "ST,outbreak_number" --nomenclature-file partitions.tsv --zoom-all single-6 --site-inclusion 1.0 Citation If you use this schema or the provided datasets in your work, please cite: Verónica Mixão, Christophe Ginevra, Camille Jacqueline, Sophie Jarraud, Marco Gabrielli, João Paulo Gomes, Melisa J. Willby, Jennafer A. P. Hamlin, Vítor Borges (2026) Advancing Legionella pneumophila genomic surveillance with a high-resolution cg/wgMLST schema for outbreak detection and investigation. medRxiv. doi: https://doi.org/10.64898/2026.02.18.26346554. This repository Funding VM contribution was funded by national funds through FCT - Foundation for Science and Technology, I.P., in the frame of Individual CEEC 2022.00851.CEECIND/CP1748/CT0001 (doi: 10.54499/2022.00851.CEECIND/CP1748/CT0001). VB contribution was supported by the European Union project “Sustainable use and integration of enhanced infrastructure into routine genome-based surveillance and outbreak investigation activities in Portugal” - GENEO [101113460] on behalf of the EU4H programme [EU4H-2022-DGA-MS-IBA-01-02], and by the DURABLE “Research Network against Epidemics” project. DURABLE is co-funded by The European Commission Union under the EU4Health Programme (EU4H) [101102733]. HCL and INSA are designated as European Reference Laboratory for Public Health on Legionella (EURL-PH-LEGI). This publication is funded by the EU4Health programme under grant agreement 101194818, as part of the project EURL-PH-LEGI. However, views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European Health and Digital Executive Agency. Neither the European Union nor the granting authority can be held responsible. CDC disclaimer: The use of trade names is for identification only. It does not constitute endorsement by the U.S. Department of Health and Human Services, the U.S. Public Health Service, or the Centers for Disease Control and Prevention. The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.