Search for a command to run...
Lipophilicity is a fundamental physicochemical property that significantly influences various aspects of drug behavior, such as solubility, permeability, metabolism, distribution, protein binding, and excretion. Consequently, accurate prediction of this property is critical for the successful discovery and development of new drug candidates. The classical metric for assessing lipophilicity is logP, defined as the partition coefficient between n-octanol and water at physiological pH 7.4. Recently, graph-based deep learning methods have gained considerable attention and demonstrated strong performance across diverse drug discovery tasks, from molecular property prediction to virtual screening. These models learn informative representations directly from molecular graphs in an end-to-end manner, without the need for handcrafted descriptors. In this work, we propose a logP prediction approach based on a fine-tuned pre-trained GraphormerMapper model, named GraphormerLogP. To evaluate its performance, the model was tested on two datasets: one is compiled by us from publicly available sources and contains 42 006 unique SMILES-logP pairs (named GLP); the second consists of 13 688 molecules and is used for benchmarking purposes. Our comparative analysis against state-of-the-art models (Random Forest, Chemprop, CheMeleon, StructGNN, and Attentive FP) demonstrates that GraphormerLogP consistently achieves competitive or superior predictive accuracy across both datasets, attaining mean absolute error values of 0.251 and 0.269, respectively. The GLP dataset is available in the GitHub repository https://github.com/cimm-kzn/GraphormerLogP/tree/main/data.Scientific contribution This paper presents two key scientific contributions. First, we have collected and carefully curated a large and diverse dataset of molecules with measured logP values, comprising over 42 000 compounds. Second, we propose a Graphormer-based model with a task-specific fine-tuning architecture for logP prediction, tailored to leverage representations learned from reaction data. This model demonstrates high performance in benchmark studies on both established literature data and the newly compiled dataset.