grp-bork/gunc: v1.1.0

20260 citationsOthergreen Open Access

Authors

Luís Pedro Coelho · Queensland University of Technology

Abstract

v1.0.7 Summary ^^^^^^^ This release adds support for two new reference databases (ProGenomes 3, GTDB r214) and a custom database option. A new gunc check subcommand validates your environment before submitting a long job, and gunc rescore is introduced as a clearer alias for gunc summarise. In addition a test_data database type has been added, which comprised of a minimal test set (sample, db, taxonomy) which can be used in CI/CD pipeline. A warning is now emitted when genomes have low reference representation scores. Packaging has been modernised to pyproject.toml and the CI pipeline updated. Features ^^^^^^^^ Added support for progenomes_3 and gtdb_214 reference databases. Added support for test_data set, a minimal set of data that can be used in CI/CD pipelines). Added --custom_genome2taxonomy option to allow use of a custom reference database. Diamond version pinned to 2.1.24; enforced at startup with a clear error message. Set GUNC_SKIP_DIAMOND_VERSION_CHECK=1 to bypass. Added test_data option to gunc download_db (--db test_data): downloads a minimal diamond database and two test genomes (chimeric and clean) that can be used to verify a GUNC installation end-to-end. Added gunc rescore as the preferred name for the summarise subcommand; gunc summarise remains as a backward-compatible alias. Added gunc check subcommand to validate environment (tool dependencies, database file, custom genome-to-taxonomy TSV format, output directory write access) without running the pipeline. All subcommands (run, plot, merge_checkm, summarise) now log the output file path on completion. --file_suffix error message now suggests the correct flag usage when no files are found. Fixed metavar="\\b" hack in summarise argparse definitions; replaced with meaningful placeholders (FILE, DIR, FLOAT). Documentation: added gunc summarise section with worked example; fixed --file_suffix incorrectly listed as required; fixed --gunc_file help referencing gunc_scores.tsv (actual filename is GUNC.{db}.maxCSS_level.tsv); added --custom_genome2taxonomy file format spec; added output column definitions table; updated DB names to underscore convention throughout. Bugfixes ^^^^^^^^ Fixed summarise subcommand incorrectly marking all genomes as passing GUNC. Fixed pass.GUNC column being silently converted to strings in output TSV; summarise now uses proper NaN detection instead of string comparison. Fixed summarise not rescoring genomes with boolean False in pass.GUNC; previously only the string "False" was matched, so boolean values (the normal case) were silently skipped. Fixed genome identity corruption in split_diamond_output when contig names contain /; now uses rsplit to always extract the genome name from the last path segment. Fixed DB detection logic duplicated across three code paths with subtly different ordering; extracted into single detect_db_from_filename() function. Fixed prodigal() leaving partial output files on disk when gene calling fails; partial files are now removed so the caller's size check correctly excludes failed genomes. Fixed extract_node_data() in visualisation missing colour entries for class and order tax levels, causing KeyError when non-default --tax_levels are used. Extracted plot=True path from chim_score() into dedicated get_base_data_for_plotting() function; chim_score() now has a single consistent return type. Fixed empty diamond output files not being named correctly when a genome fails to map ( thanks to @pamelaferretti ). Fixed edge case where contamination score was incorrectly calculated when contamination portion was NaN. Fixed crash when no genes were called or mapped to the reference database. Fixed shell injection risk in get_record_count_in_fasta. Other ^^^^^ Removed versioneer; version is now statically set. Fixed 8 flake8 errors: import ordering in get_scores.py and visualisation.py, trailing whitespace in gunc.py, spurious f-string prefixes in gunc_database.py. Extracted CSS_CHIMERIC_THRESHOLD = 0.45 and TAX_LEVELS as named constants in get_scores.py; replaced all three scattered hardcoded copies of the threshold and tax level list across gunc.py, checkm_merge.py, and visualisation.py. Fixed all sys.exit(string) calls in visualisation.py and get_scores.py to use logger.error() + sys.exit(1) consistently with the rest of the codebase; added module-level logger to get_scores.py. Fixed add_empty_diamond_output() using print() for progress output; now uses logger.info(). Fixed check_diamond_version() using shell=True; now uses list-form subprocess call. Added guard against empty gunc_output list before pd.concat() in run_gunc() to give a clear error instead of a cryptic ValueError. Reference data files renamed to reflect database version (e.g. genome2taxonomy_pg2.1ref.tsv). Documentation updated: diamond version, all four database options, --custom_genome2taxonomy flag. Migrated packaging from setup.py + setup.cfg + MANIFEST.in + requirements.txt to a single pyproject.toml (PEP 621); fixed package_data paths, license field (GPLv3), dropped universal=1, and added minimum version pins for numpy (>=1.20), scipy (>=1.7), and plotly (>=5.0). Replaced all from module import * in test files with explicit named imports; marked network-dependent tests in test_gunc_database.py with @pytest.mark.integration; added conftest.py registering the integration marker. Added tests for summarise(), get_scores_using_supplied_cont_cutoff(), read_genome2taxonomy_reference() (all 4 DBs + custom + unknown), split_diamond_output() round-trip, and detect_db_from_filename(). New Contributors @pamelaferretti made their first contribution in https://github.com/grp-bork/gunc/pull/53 Full Changelog: https://github.com/grp-bork/gunc/compare/v1.0.6...v1.1.0

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18982033

Command Palette

grp-bork/gunc: v1.1.0

Authors

Abstract

Topics & Keywords

Publication Details