Search for a command to run...
Advances in QSAR are led by two core paradigms, 1) descriptor engineering, where complex fixedlength vectors of compounds are generated and conventional ML methods are applied to those representations and 2) raw chemical inputs (e.g., SMILES, 2D-graph) being provided to deep learning neural network models, which construct their own internal representations of molecules and learn iteratively over them.Here we present the Tsetlin Machine (TM) -which combines the accuracy and easy-use of existing rule-based QSAR ML methods (e.g., RF and XGBoost), the iterative learning aspect of NN algorithms and its intrinsic interpretability.The TM uses teams of finite-state automata which capture frequent patterns as propositional logic (clauses) via reinforcement learning.The benchmarking pipeline presented here demonstrates that TM-QSAR coupled with ECFP4 descriptors frequently performs better than existing rule-based QSAR methods for ROC-AUC, PRC-AUC and PPV, with a high capacity for inter-scaffold generalisation.However, due to the binary nature of TM-QSAR, performance it is currently limited when descretised continuous descriptors are used.TM-QSAR demonstrated particularly impressive classification scores for MOR (ROC-AUC = 0.87, PRC-AUC = 0.77) and CYPA4 (ROC-AUC = 0.92, PRC-AUC = 0.63), when compared to RF and XGBoost.Using TM in combination with substructural fingerprinting descriptors allows for an interpretability suite which can be extracted directly from clauses.Here we detail molecule property maps (TM-MPM) to view atom-wise TM-QSAR bioactivity contributions for single molecules and closed-form WAC scores (Weights Activations Clauses) for descriptor-wise contributions to regions of predicted chemical space.These methods show strong alignment of TM-QSAR interpretations to known ligandprotein interactions of the MOR target and gives non-linear, conditional interpretations for greater predicted bioactivity.Given this combination of accuracy, computational efficiency and interpretability, we provide a basis for TM-QSAR to be explored as a standard methodology in virtual screening toolkits.