Search for a command to run...
This study aims to evaluate the effect of latent diffusion models on molecular representation learning from the perspective of generalization performance in molecular property prediction. To this end, we formulate a deep generative model for molecular representation learning based on a latent diffusion-based prior distribution, and introduce an evaluation methodology of generalization for learned molecular representations using the widely applicable information criterion (WAIC) and the widely applicable Bayesian information criterion (WBIC). Furthermore, we propose an analysis framework based on smoothness and multi-modality to analyze the factor of generalization in molecular representations. We constructed the graph latent diffusion autoencoder (Graph LDA), a deep molecular generative model that combines a transformer-based graph variational autoencoder and latent-diffusion-based latent prior distribution, designed to construct graph-level molecular representations through unsupervised learning. We compared the generalization performance of Graph LDA with other molecular representation learning models using WBIC and WAIC across multiple molecular properties, including HOMO energy, solubility, and biological activities. The results demonstrate that molecular representations learned by different models exhibit distinct generalization behaviors, and that representations learned by Graph LDA-using a latent diffusion-based prior-consistently show improved generalization in molecular property prediction. Using our proposed framework, we empirically demonstrate that the superior generalization performance of Graph LDA is attributable to the smoothness and multimodality of its learned molecular latent representation. These findings provide a principled understanding of the role of latent diffusion-based molecular representation learning in improving generalization performance. Scientific contribution: This work systematically analyzed the effect of latent diffusion-based priors in molecular representation learning from the perspective of generalization performance in molecular property prediction. Through generalization evaluation using WBIC and WAIC, together with an analysis framework for molecular representations, it is empirically demonstrated that latent diffusion-based priors contribute to deep generative models extracting smooth and multimodal latent representations, which in turn lead to enhanced generalization performance of molecular representations. These findings offer a principled guideline for developing molecular representation learning models with high generalization.