Predictive <scp>AI</scp> in Clinical Pharmacology: A Call to Action to Develop Robust Benchmarking Practices

20251 citationsJournal Articlegold Open Access

Authors

Ana Victoria Ponce Bobadilla · Ludwigshafen University of Business and Society

Dominic Stefan Bräm · University of Basel

Ali Farnoud · Boehringer Ingelheim (Germany)

Holger Fröhlich · Fraunhofer Institute for Algorithms and Scientific Computing

Alexander Janßen · Erasmus MC - Sophia Children’s Hospital

Niklas Korsbo · Maryland Department of Natural Resources

Klaus Lindauer

Abstract

AI-driven predictive analytics is transforming clinical pharmacology by enhancing precision and integrating high-dimensional data. Insights from a recent AI in Clinical Pharmacology meeting organized in April 2025 have underscored a critical challenge among others: the lack of robust, standardized benchmarking datasets and evaluation tasks that reflect real-world clinical data complexities. This perspective addresses this challenge and proposes a roadmap for developing robust datasets and metrics to advance the use of AI in pharmacometrics and systems pharmacology. AI offers transformative potential for clinical pharmacology, particularly through its applications in predictive modeling [1]. These applications promise to enhance the accuracy of drug response predictions, optimize clinical trial design, and support individualized treatment decisions [2]. Although AI stands to significantly improve many aspects of clinical pharmacology, such as predictive modeling, research efficiency, and operational efficiency, this perspective primarily focuses on predictive modeling applications and the challenges related to standardization. A key advantage of AI, specifically machine learning (ML), in Clinical Pharmacology is its ability to train predictive models on high-dimensional data such as medical imaging and multi-omics data collected from patients during clinical trials [2, 3]. This capability, often missing in traditional statistical and mechanistic approaches, can enhance the accuracy of treatment response predictions. Furthermore, AI/ML allows the use of high-dimensional and complex real-world data such as wearable device information, enhancing our understanding of a drug's effectiveness in specific disease conditions [2]. Realizing these benefits necessitates addressing critical challenges—particularly the lack of widely accepted standards and reference datasets for evaluating newly proposed algorithms [4]. In this regard, restricted access to realistic clinical data poses a significant barrier. These challenges undermine confidence in newly proposed model architectures and, especially, their broader application. In addition, while regulatory bodies have acknowledged the potential for AI applications, adequate validation methods remain a key requirement for acceptance. Hence, appropriate evaluation and benchmarking of AI algorithms are essential for determining which AI approaches can reliably contribute to clinical decision making. This perspective outlines the current challenges and advocates for a database of comprehensive and realistic benchmarking datasets. It describes the various advantages of having such a database and its impact on diverse stakeholders, emphasizing the importance of interdisciplinary collaboration to fully harness AI's potential in the field. As clinical pharmacology practitioners implementing AI methodologies, we consistently encounter four fundamental challenges when evaluating approaches for specific applications. First, methodological papers often use diverse performance metrics tailored to different modeling objectives—such as interpolation, extrapolation, or synthetic subject simulation—making it difficult to draw meaningful comparisons. Second, we must assess whether the approach taken to validate each new model architecture offers a correct and reliable representation of model performance. Critically, there remains a significant gap between performance on synthetic or simplified data and real-world clinical datasets, whose complexity and variability often exceed those of the former. This gap diminishes trust in new AI methodologies, especially given the practice of “self-benchmarking” where models are assessed on proprietary, often synthetically generated datasets. These datasets tend to gloss over common clinical data challenges such as missingness, irregular sampling/dosing, and outliers which can lead to overly optimistic assessments of model performance. As a result, when tested on realistic data, often unavailable to external developers, model performance frequently falls short. Figure 1 highlights the key differences between synthetic and real data. We follow the definition of synthetic data in line with the glossary of US Food and Drug Administration (FDA) on Digital Health and Artificial Intelligence (https://www.fda.gov/science-research/artificial-intelligence-and-medical-products/fda-digital-health-and-artificial-intelligence-glossary-educational-resource) and conceptualized in the recent literature review by Pasculli et al. [5]. The third fundamental challenge is the discrepancies between the model described in the article and the one that was actually implemented resulting in another critical challenge. This is exacerbated by the unavailability of the model code or issues with the reproducibility of results based on the shared model code as exemplified by Chung et al. for the blood coagulation network [6]. Such issues make it difficult to conduct meaningful comparisons and hinder building upon previous work in the field. The final fundamental challenge is that evaluation protocols across modeling paradigms diverge. In data-driven ML workflows, it's standard to use nested cross-validation followed by a hold-out test set, with performance metrics (e.g., RMSE, AUROC) used to evaluate how well the model predicts unseen (i.e., out-of-sample) data. In contrast, traditional non-linear mixed effects (NLME) and ML-augmented NLME models typically rely on in-sample graphical and simulation-based diagnostics (e.g., goodness-of-fit plots, VPC/pcVPC, NPDE), all computed on the same dataset used for model estimation. In this context, in-sample refers to diagnostics performed on the same data used to build the model, while out-of-sample refers to evaluating model performance on new, unseen data not used during model training. To ensure methodologically comparable model performance assessments, we advocate for out-of-sample evaluation for all frameworks and the exclusive reporting of metrics calculated on unseen data. Framework evaluation in unseen data, especially for data-driven frameworks, is necessary because it can expose problems related to overfitting and can provide indications of whether the model generalizes beyond its training data set. In this context, the performance of the frameworks in out-of-distribution data should also be an important aspect of model evaluation. Only under the same evaluation methodology can ML, NLME, and ML-augmented NLME approaches be compared rigorously and fairly. Difficulties in assessing the suitability of different methodologies are highly consequential in clinical pharmacology. This issue is intensified by the fast-paced environment typical of clinical projects, where the analyst's primary responsibility is often to deliver timely answers. Meeting strict deadlines incentivizes analysts to select conservative, well-established methodologies. Any additional uncertainty in the performance of a new technique pushes decisions away from potentially superior, yet less familiar solutions toward simpler and more predictable methods. This preference for tried-and-tested approaches inadvertently limits innovation and slows the adoption of advanced methodologies. While we recognize and support a certain conservatism in a discipline where patient safety is paramount, our goal is to mitigate unnecessary uncertainty. Doing so will enable analysts to confidently explore more advanced methodologies and explore the strengths and weaknesses. This could ultimately enhance outcomes for specific projects and accelerate progress within the broader clinical pharmacology field. Recently, Sale and Liang proposed an annual benchmarking exercise for machine learning in pharmacometrics [7]. While sharing similar goals, our proposal extends their framework by broadening the scope beyond PK/PD model selection and periodic assessments. The vision is to establish a repository of tasks and datasets that are also envisioned to be used during the development of novel methods e.g., like ImageNet [8]. These approaches could work complementarily to create a comprehensive ecosystem for evaluating AI applications in clinical pharmacology. Integrating AI into clinical pharmacology necessitates clearly defined approaches for evaluating and benchmarking newly proposed models against established state-of-the-art techniques. Effective validation methodologies involve defining specific and meaningful evaluation criteria tailored to clinical applications. To test algorithms in a standardized manner, we see the need to build a repository of realistic, public benchmarking datasets, which should be crafted with careful attention to real-world scenarios. Due to data privacy reasons, incorporating synthetic data that emulates different clinically relevant scenarios would be highly valuable. Equally important is the detailed and accurate description of the methodologies used. Proposed methodologies should be described thoroughly enough to allow others to replicate them and obtain comparable outcomes. The public release of model code and evaluation pipelines can further support this effort. Establishing such methodologies allows for more consistent and reliable AI models and the subsequent assessments, providing a foundation for clinical adoption. We propose a publicly available repository of benchmarking datasets. The benchmarking datasets for the use of AI in clinical pharmacology will focus on, but not be limited to, aiding typical tasks in pharmacometrics such as population PK and PKPD analysis. This repository would include both real and synthetic datasets. For synthetic datasets, we call for careful creation and curation such that the synthetic datasets mirror realistic clinical data more closely including missingness, inconsistencies and other features. For instance, these datasets should consider irregular sampling intervals, missed doses, and data outliers. Evaluation metrics in these benchmarking datasets can be standardized, so there is an objective comparison with newer AI models. For example, in a benchmarking PK dataset, the ML models could be evaluated on capturing crucial aspects like Cmax (maximum concentration of a drug in the body), covariate relationships, or specific cut-offs, allowing practitioners to make informed decisions when selecting or investing in models based on their alignment with specific clinical queries. By establishing gold-standard benchmarks, models can gain increased credibility and acceptance not only from regulatory bodies but also for internal decisions. This is fundamental not only for accuracy but also for fostering trust in AI-based methods within the clinical domain. Just as essential is the ability to reproduce and scrutinize modeling pipelines, an aspect that benefits greatly from shared resources and transparent workflows embedded within collaborative research efforts. Key future opportunities lie in fostering interdisciplinary collaborations to address these challenges, encouraging a culture of openness in both data and methodology, and advancing evaluation frameworks that reflect the intricacies of real-world clinical contexts. By seizing these opportunities, the industry can drive AI innovations that significantly improve patient outcomes and advance pharmacometrics. This perspective highlights the importance of having standardized benchmarking datasets and evaluation tasks for ML frameworks introduced in the field. Stakeholders are urged to collaboratively define key evaluation tasks and establish comprehensive benchmarking datasets that reflect real-world challenges. With this initiative, we would not only provide the scientific community with methodological standardization, but also enhance the reproducibility of ML frameworks, promote the more widespread adoption of these methods, and strengthen the translational relevance of research outputs. By creating standardized benchmarking datasets similar to those pioneered by ImageNet in the field of computer vision [8], MedSegBench in the field of medical images [9] or for deep-learning-based protein function predictions [10], our initiative aims to enable robust evaluations of AI methodologies in clinical pharmacology. This approach can foster significant advancements and enhance the credibility of AI-driven solutions, paralleling the transformative impact ImageNet had on AI development for computer vision. In Table 1, we highlight the different strategic impacts of this initiative. Current practices for assessing AI methodologies in clinical pharmacology vary significantly across studies, hindering objective comparisons. Researchers frequently employ not only different evaluation metrics but also distinct synthetic datasets, each with its own unique characteristics and limitations. This fragmentation in evaluation practices makes it challenging to objectively determine the relative effectiveness or reliability of different methodologies. As a result, pharmacometricians, researchers, and regulators face difficulties in confidently identifying the most suitable modeling approaches for specific clinical scenarios. As we move forward, we invite collaboration from all stakeholders to create a comprehensive framework serving the diverse needs of the clinical pharmacology community. By integrating complementary approaches, we can accelerate the adoption of AI methodologies that truly enhance clinical practice. As leaders and practitioners in the field of clinical pharmacology, we must come together to embrace these challenges as opportunities for growth and transformation. We call upon researchers, industry experts, and stakeholders to collaborate in establishing standardized evaluation metrics and comprehensive benchmarking datasets that reflect real-world clinical complexities. In support of this effort, we have formed a working group dedicated to moving this initiative forward. This group aims to provide channels for discussion and implementation of key evaluation approaches, requirements for benchmarking synthetic datasets, and ultimately how AI technologies should be leveraged to provide value in clinical pharmacology. To learn more about the working group and to express your interest in participating, please visit http://bit.ly/46WzPU8. By joining forces, we can reshape the future of clinical pharmacology, providing stakeholders with improved means to assess AI frameworks and thereby select the best ones to enhance the accuracy of treatment response predictions. The time to act is now—let us lead the charge toward a future where AI-driven methodologies are integral to predictive modeling in clinical pharmacology. ChatGPT was used to edit language to help enhance the readability of the manuscript; all authors ensured that the content was correct and carry full responsibility for this work. All authors working in industry were employees and additionally may be shareholders of their respective companies at the time of writing. The remaining authors declare no conflicts of interest.

Topics & Keywords

Machine Learning in Healthcare Advanced Causal Inference Techniques Statistical Methods in Clinical Trials

Publication Details

Published in: CPT Pharmacometrics & Systems Pharmacology

Volume 15, Issue 1, pp. e70155-e70155

DOI: 10.1002/psp4.70155

Field-Weighted Citation Impact: 2.48

Command Palette

Predictive <scp>AI</scp> in Clinical Pharmacology: A Call to Action to Develop Robust Benchmarking Practices

Authors

Abstract

Topics & Keywords

Publication Details