Search for a command to run...
Under REACH mutagenicity assessment relies on in vitro gene mutation in bacteria, and in vitro chromosome aberration/micronucleus and gene mutation study in mammalian cells, before in vivo testing is conducted if necessary. This study sought to examine whether the inherent correlation between different assays could be leveraged to develop multitask in silico models, and whether their predictive performance was significantly improved when compared with models built for each individual assay (single task). To that end, a significant effort was made to compile as extensive a genotoxicity dataset as possible. The dataset compiled comprised over 12,000 substances, including algorithmically curated REACH data and information from several public sources. A range of different models were then investigated ranging from traditional machine learning techniques using chemical fingerprints to deep learning methods using graphs for molecular structure representation. Sixty-four deep learning models for gene mutation, chromosomal aberration and micronucleus assays in mammalian cells, as well as gene mutation in bacteria, were developed. Deep learning methods achieved cross-validation test balanced accuracy that was on average 4% higher than traditional machine learning with the improvement reaching 8% for gene mutation detection for specific bacterial strains and metabolic activation. Deep learning methods exhibited cross-validation test balanced accuracy ranging from 72% for in vitro assays in mammalian cells to over 93% for gene mutation detection in specific bacterial strains and metabolic activation. Genotoxicity information was also retrieved from ToxValDB and other literature sources and similarly curated to construct external (hold-out) test sets for a stringent assessment of the models’ generalised performance. External test set balanced accuracy ranged from 64% to 78% depending on the endpoint and deep learning architecture, when there were at least 200 positive and 200 negatives. Multitask models had on average 8% higher cross-validation test balanced accuracy than single task models for gene mutation in bacteria for all model architectures when modelled at the strain/metabolic activation level during cross-validation, but were comparable when assay outcomes were aggregated and during external validation. Graph and fingerprint-based deep learning methods performed comparably, with the former being marginally better for the assays with the largest training sets. The dimensionality-reduced molecular embeddings from graph neural network models were analysed to assess their ability to distinguish positives from negatives and cluster structures that trigger known genotoxicity alerts. The models were also used to identify structural moieties linked to predicted negative genotoxicity in bacteria and positive genotoxicity in mammalian cells.