Search for a command to run...
Machine learning for Covid-19 diagnosis from blood tests is a topical problem. Many studies of this problem are mainly devoted to comparing various algorithms’ efficiency. However, the first and often the most critical part of machine learning is the preparation of a relevant and correct dataset of the required size for developing the generalization models. This study demonstrates the lack of the models’ generalization performance based on some publicly available datasets. That leads to the futility of such models in practice even if they were developed using the best algorithms and achieved high metrics. Therefore, another dataset is proposed. Its features are discussed. This dataset splits into training and testing sets by stratification due to an imbalanced data structure. Machine learning models of the problem by various algorithms are developed based on the proposed dataset. The modelling results on the testing set have demonstrated that the best models - Gradient Boosting Classifier with fixing imbalance methods SMOTE and ADASYN, TensorFlow and Gene Expression Programming - handle negative Covid-19 diagnosis well enough since they have high precision and high recall. However, mixed signals have been obtained for a positive Covid-19 diagnosis. TensorFlow and Gene Expression Programming models have high precision and relatively low recall for positive Covid-19 diagnosing. It means these models can’t detect Covid-19 well enough but are highly reliable when they do. Gradient Boosting Classifier models do not have enough high precision and recall for positive Covid-19 diagnosing. New challenges of machine learning for Covid-19 diagnosis based on blood tests are found for future work.
Published in: 2022 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT)
Volume 4, pp. 721-727