The Tsetlin Machine: A “Third Way” in QSAR Modelling

20260 citationsJournal Articlegreen Open Access

Authors

Paul F. A. Clarke · University of Agder

Ivan Čmelo · National Computational Infrastructure

Runar Helin · University of Agder

Mayur Kishor Shende · University of Agder

Ole‐Christoffer Granmo · University of Agder

Darren Fayne · International Life Sciences Institute

Abstract

Advances in QSAR are led by two core paradigms, 1) descriptor engineering, where complex fixedlength vectors of compounds are generated and conventional ML methods are applied to those representations and 2) raw chemical inputs (e.g., SMILES, 2D-graph) being provided to deep learning neural network models, which construct their own internal representations of molecules and learn iteratively over them.Here we present the Tsetlin Machine (TM) -which combines the accuracy and easy-use of existing rule-based QSAR ML methods (e.g., RF and XGBoost), the iterative learning aspect of NN algorithms and its intrinsic interpretability.The TM uses teams of finite-state automata which capture frequent patterns as propositional logic (clauses) via reinforcement learning.The benchmarking pipeline presented here demonstrates that TM-QSAR coupled with ECFP4 descriptors frequently performs better than existing rule-based QSAR methods for ROC-AUC, PRC-AUC and PPV, with a high capacity for inter-scaffold generalisation.However, due to the binary nature of TM-QSAR, performance it is currently limited when descretised continuous descriptors are used.TM-QSAR demonstrated particularly impressive classification scores for MOR (ROC-AUC = 0.87, PRC-AUC = 0.77) and CYPA4 (ROC-AUC = 0.92, PRC-AUC = 0.63), when compared to RF and XGBoost.Using TM in combination with substructural fingerprinting descriptors allows for an interpretability suite which can be extracted directly from clauses.Here we detail molecule property maps (TM-MPM) to view atom-wise TM-QSAR bioactivity contributions for single molecules and closed-form WAC scores (Weights Activations Clauses) for descriptor-wise contributions to regions of predicted chemical space.These methods show strong alignment of TM-QSAR interpretations to known ligandprotein interactions of the MOR target and gives non-linear, conditional interpretations for greater predicted bioactivity.Given this combination of accuracy, computational efficiency and interpretability, we provide a basis for TM-QSAR to be explored as a standard methodology in virtual screening toolkits.

Topics & Keywords

Digital Image Processing Techniques Optimization and Search Problems Fuzzy and Soft Set Theory

Publication Details

Published in: ChemRxiv

DOI: 10.26434/chemrxiv-2026-dsz7v/v2

Field-Weighted Citation Impact: 0.00

Command Palette

The Tsetlin Machine: A “Third Way” in QSAR Modelling

Authors

Abstract

Topics & Keywords

Publication Details