Search for a command to run...
Missing data in patient-reported outcome (PRO) databases is a pervasive challenge, particularly in psychiatry and psychosocial rehabilitation. Incomplete data may reduce the generalizability of findings and introduce selection bias. It may also signal loss of access to care, potentially hindering recovery and rehabilitation efforts. Proactively anticipating missingness can help mitigate this issue by identifying individuals at risk of missing values, enabling services to take timely and targeted actions in response. However, the predictability of incomplete PRO data remains underexplored. We used data from the French multicentric psychosocial rehabilitation database REHABase, focusing our analysis on patients with schizophrenia. We developed an ensemble machine learning model to predict missing data occurrence across six PROs, incorporating treatment center affiliation, sociodemographic features and clinical predictors. To ensure interpretability, we applied the concept of Shapley values to quantify individual predictor contributions to missing data patterns. Our sample comprised N = 2,363 participants. Averaged areas under the receiving operating curve (AUC) measured on the holdout testing observations ranged from 0.73 to 0.78 across the six PRO scales, demonstrating good predictive performance of our ensemble model. Treatment center affiliation emerged as a critical predictor of missing data. The ten most influential patient-level predictors were: being a disabled worker beneficiary; educational attainment; housing status; duration of illness; antipsychotic medication; origin of the referrer; number of suicide attempts; addictions comorbidity; having a forensic history; and sex. We also identified directional contributions distinguishing positive (increased likelihood of missing values) and negative effects (decreased likelihood). To our knowledge, this work represents the first predictive analytics framework for the occurrence of missing PRO data in psychosocial rehabilitation. Our ensemble algorithm holds dual potential: improving data collection strategies and informing targeted interventions to enhance patient engagement and retention. By proactively identifying at-risk individuals and refining study designs, our model could also indirectly support better functional recovery outcomes for schizophrenia patients.