Search for a command to run...
Machine learning models for mental health often rely on self-reported data, which can be highly variable in quality. If performance of a system is measured against professionally assigned labels, noise in the self-reported labels will cause forced errors that are not part of evaluation metrics today. This study addresses the challenges associated with self-reporting label noise by evaluating depression prediction models under different self-reporting noise scenarios using probabilistic performance bounds. We introduce two self-reporting noise models—one representing low noise conditions derived from controlled test-retest settings, and another simulating high noise conditions by incorporating lower test-retest reliability. This approach allows us to assess how different levels of label noise affect model evaluation.The study utilizes multiple speech datasets labeled with PHQ-8 scores, collected through different methods, such as human-to-device and human-to-human interactions, across varied demographic groups. The models integrate natural language processing (NLP) and acoustic features to predict depression severity. Performance evaluation employs both regression and classification metrics, with probabilistic bounds estimating the best and worst-case scenarios influenced by dataset characteristics and label noise. We explore how these bounds shift under different noise levels and across subgroups defined by age and gender, highlighting the need for fair and consistent performance in diverse populations. This work is accompanied by a Python library developed to facilitate this evaluation framework, offering tools for estimating performance bounds and visualizing results.In the talk, I will discuss the implications of using noise models for model evaluation, the effect of label noise on predictive accuracy, and methods for assessing model fairness. This work underscores the critical importance of accounting for label reliability in mental health applications and provides a path forward for improving the robustness and fairness of machine learning models in this field.