Search for a command to run...
Peripheral blood smear review remains pivotal to hematologic diagnosis. Morphologic abnormalities central to contemporary hematologic classification, including blasts, dysplastic neutrophils, nucleated red blood cells, and abnormal platelets, remain critical to diagnosis under both the World Health Organisation1 and the International Consensus Classification frameworks.2 While automated analysers have improved workflow and standardisation in cell counting, they often fall short in detecting key morphological features. These diagnostic gaps make smear interpretation by trained personnel indispensable in both routine and urgent haematology. However, access to expert haematological review is variable, and manual interpretation is time-intensive and subject to interobserver variability. This has led to increasing interest in artificial intelligence (AI) systems that can support or automate morphologic assessment by learning from large sets of labelled smear images. Recent advances in machine learning have enabled the development of models capable of classifying peripheral blood cells from digital images with high accuracy. Convolutional neural networks and transformer-based architectures have shown promise in this domain, achieving expert-level performance when trained on large, curated datasets.3 However, these supervised models rely on thousands of annotated examples for each cell type and often require fine-tuning before deployment. In most clinical settings, such resources are not readily available, and differences in staining, imaging, and labelling practices further limit generalizability across institutions. As a result, the practical application of supervised AI models remains dependent on the availability of large amounts of annotated data in addition to high-performance computing. Therefore, to address these limitations, recent studies have explored models that require far less annotated data to perform diagnostic tasks. One such approach is few-shot learning, where the model is shown a small number of labelled examples at the time of evaluation rather than being retrained. This mirrors how clinicians often reason, by comparing current findings to a mental archive of similar cases rather than relearning each diagnosis from first principles. Early applications of this method in pathology and radiology have shown that such models can perform well even with limited supervision, including recent works demonstrating label-efficient classification in general oncology.4, 5 However, this strategy has not been evaluated in haematology, where morphologic subtleties often distinguish closely related cell types. Recent reviews have highlighted the ongoing transition of hematopathology from conventional light microscopy toward digital and computationally enabled workflows, in which peripheral blood films are increasingly analysed as structured image data rather than solely through manual review. In this evolving framework, cell-level morphologic interpretation is being explored not only for automated classification, but also as a foundation for diagnostic and clinically meaningful inference from routine blood smears.6 Against this background, there is growing interest in approaches that can operate with minimal labelled data and without task-specific retraining, particularly in settings where expert morphologic review and large annotated datasets are not readily available. In this study, we evaluate the performance of GPT-4o, a vision-language model capable of interpreting both images and text, for the classification of peripheral blood cells using a few-shot prompting strategy. We benchmark its performance against two supervised models, ResNet-50 and Vision Transformer (ViT), trained on the BloodMNIST dataset, which includes 17,092 labelled smear images across eight common cell types.7 Our objective is to assess whether a model like GPT-4o, which requires no additional training and relies solely on contextual examples, can achieve clinically meaningful accuracy in morphologic classification. Such models, if effective, could support haematology workflows in settings where labelled data or trained personnel are limited. The peripheral blood smears of the BloodMNIST dataset contain the following eight morphologic classes: neutrophil, eosinophil, basophil, lymphocyte, monocyte, immature granulocyte, erythroblast, and platelet. The BloodMNIST images originate from the MedMNIST v2 collection7 and were derived from digitised peripheral blood smears obtained using standard Wright-Giemsa staining and brightfield microscopy. Images were captured using high-resolution laboratory slide scanners and digital microscopy platforms under controlled illumination and magnification, and subsequently cropped into single-cell fields of view. These images represent a wide range of staining intensities, cell sizes, and morphologic variation typical of routine haematology laboratory workflows. A fixed 192-image subset (24 per class) was reserved for both supervised and GPT-4o in-context evaluation. For supervised training, we fine-tuned ResNet-50 and Vision Transformer (ViT-B/16) models on annotated images using cross-entropy loss and Adam optimiser with a learning rate of 1e-4 for 10 epochs, batch size of 16 and applied early stopping if validation loss didn't improve. GPT-4o was evaluated using a few-shot in-context learning (ICL) approach via OpenAI Python API. In each shot setting (i.e. 0, 3 and 5-shot), GPT-4o was presented with n support examples per class as image-label chat messages. The final prompt included an unlabelled query image and an instruction to classify it among the eight cell types. No model fine-tuning or parameter updates were performed. Model performance was assessed using accuracy and 95% confidence intervals (CIs) estimated via 10,000 bootstrap resamples. Figure 1A presents the overall workflow illustrating dataset preparation, learning strategies, and evaluation metrics for both supervised models and GPT-4o ICL. GPT-4o achieved an overall accuracy of 0.31 (95% CI: 0.26–0.35) in the 0-shot setting, which improved progressively with the addition of labelled examples, reaching 0.41 (CI: 0.37–0.44) at 3-shot and 0.50 (CI: 0.46–0.53) at 5-shot as shown in Figure 1B. In contrast, the supervised models demonstrated superior performance, with ResNet-50 achieving an accuracy of 0.96 (CI: 0.94–0.98) and ViT achieving 0.95 (CI: 0.92–0.98) after training for 10 epochs. Analysis of the confusion matrices revealed class-specific recall trends for GPT-4o. In the 0-shot setting (left side of Figure 1C), the model exhibited modest recall across classes, with the highest recall observed for Platelets at 0.42 and Eosinophils at 0.38. Neutrophils and Monocytes were more prone to misclassification, given the challenge of cytologic differentiation without contextual examples. In 5-shot prompting (right side of Figure 1C), noticeable improvements were seen across most classes. Neutrophils and Eosinophils both reached a recall of 0.50, while Platelets improved to 0.67. Monocytes also showed an increase in recall from 0.33 to 0.50 with few-shot prompting. However, despite these gains, GPT-4o's few-shot accuracy remained much lower than that of fully supervised models. The confusion matrices of both ResNet-50 and ViT present near-perfect results, as shown in Figure 1D. Therefore, these findings suggest that while few-shot ICL can enhance performance with limited examples, it does not yet match the effectiveness of task-specific supervised training. However, the consistent improvement with increasing shot numbers underscores the potential utility of ICL in cytologic classification tasks, particularly in settings where annotated data is scarce or costly to obtain. It should also be noted that training supervised baselines required substantial resources: we fine-tuned ResNet-50 and ViT on an NVIDIA A100 GPU across multiple epochs, with thousands of annotated images per class, representing significant annotation and compute costs. By contrast, GPT-4o required no retraining, but each evaluation involved repeated API calls that accumulate financial cost, particularly when testing multiple-shot conditions with bootstrapping. And even though GPT-4o did not reach the accuracy of fully supervised deep learning models, its operational characteristics provide a fundamentally different paradigm. Unlike ResNet and ViT, GPT-4o requires no pretraining on domain-specific images, no local GPU infrastructure, and no labelled training dataset. Instead, it can be deployed immediately using only a small number of reference examples. This makes ICL particularly attractive for resource-limited labs, global health settings, and small hospitals that lack the technical or financial capacity to build and maintain supervised AI pipelines. As multimodal foundation models continue to improve, this paradigm may enable rapid, scalable access to AI-assisted hematologic interpretation without the barriers of traditional model development. Future work should expand evaluation to larger and more diverse smear datasets and include comparisons with junior laboratory staff or trainees to provide a practical anchor for performance thresholds. It will also be important to benchmark GPT-4o against models developed specifically for medical imaging, such as MedCLIP, BioViL, or other pathology foundation models, and to explore hybrid workflows that combine supervised training with in-context learning for practical deployment. Conceptualisation: Mobina Shrestha and Vishal Mandal. Methods: Mobina Shrestha, Salina Dahal and Vishal Mandal. Formal analysis: Mobina Shrestha, Salina Dahal and Vishal Mandal. Data analysis: Mobina Shrestha. Figures and visualisation: Mobina Shrestha. Original paper writing: Mobina Shrestha. Paper revision and edits: Vishal Mandal, Amir Babu Shrestha and Salina Dahal. All authors have read and approved the final manuscript. Not Applicable. The authors declare no conflicts of interest. This study was conducted using publicly available, de-identified datasets and did not involve identifiable patient data. As such, institutional review board approval and informed consent were not required. The data that support the findings of this study are available on request from the corresponding author.