Search for a command to run...
Background Social media provides timely proxy signals of mental health, but reliable tweet-level classification of depression subtypes remains challenging due to short, noisy text, overlapping symptomatology, and labeling bias. Large language models (LLMs) are increasingly used in mental health for tasks such as symptom extraction, risk screening, and triage, yet their reliability for fine-grained depression subtype classification from brief social media posts remains underexplored. Objective We benchmarked few-shot, prompt-only LLMs against parameter-efficient fine-tuned encoders for identifying depression subtypes in posts on X (formerly Twitter). Methods We used a curated dataset of 14,983 English-language tweets stratified into six clinically grounded categories: five depression subtypes (postpartum, major, bipolar, psychotic, atypical) and a no-depression class. We compared (i) instruction-tuned causal LLMs in a few-shot setting and (ii) supervised fine-tuning of transformer encoders (e.g., RoBERTa, DeBERTa, BERTweet) under identical splits and metrics. The primary evaluation metric was macro-F1 (with accuracy, precision, recall as secondary). We also report per-class precision, recall, and F1 scores, along with confusion matrices, for the best-performing model from each model family. Results Few-shot LLMs achieved macro-F1 = 0.73–0.77 (best: Llama-3-8B, accuracy 0.75). Fine-tuned encoders consistently outperformed prompt-only models, reaching macro-F1 = 0.94–0.96 (best: RoBERTa-large, accuracy 0.954). Relative improvements were largest for the clinically challenging classes. Fine-tuning increased F1 for postpartum and psychotic subtypes to ≈0.99 (substantially above few-shot) and boosted major-depression recall from ≈0.53–0.60 to ≈0.95–0.97. Error analyses showed prompt-only models frequently misclassified major and atypical depression as bipolar, patterns substantially reduced by fine-tuning. Conclusions On tweet-level depression subtyping, task-specific adaptation via fine-tuning yields substantially higher and more stable performance than few-shot prompting, particularly for nuanced, clinically anchored classes. These findings recommend fine-tuned encoders as strong, compute-efficient baselines for depression subtype classification from social media.