Machine Learning Techniques for Cleaning Raw Data

20260 citationsJournal Article

Authors

Jaspreet Kaur · Chhattisgarh Swami Vivekanand Technical University

Neelabh Sao · Chhattisgarh Swami Vivekanand Technical University

Abstract

Machine learning (ML) systems typically rely on high-quality, well-structured datasets for effective performance.However, data collected from real-world environments is often incomplete, noisy, inconsistent, and heterogeneous.Such imperfections negatively impact model accuracy, generalization ability, and reliability, particularly in dynamic and large-scale applications. This study reviews existing research on data cleaning techniques and their role in improving machine learning outcomes. It examines how different types of data imperfections arise and how they influence various stages of the ML pipeline. The review also evaluates current approaches for handling issues such as missing values, outliers, and data inconsistencies, along with their limitations in real-world scenarios. Furthermore, it highlights the need for more adaptive and scalable solutions that integrate data cleaning within the learning process. The study concludes that data quality should be treated as a critical factor throughout the ML lifecycle rather than as a standalone pre-processing step. Keywords: Data Cleaning, Machine Learning, Data Pre-processing, Outlier Detection, Imputation

Topics & Keywords

Data Quality and Management Benford’s Law and Fraud Detection Data Analysis with R

UN Sustainable Development Goals

Industry, innovation and infrastructure

Publication Details

Published in: International Scientific Journal of Engineering and Management

Volume 05, Issue 03, pp. 1-9

DOI: 10.55041/isjem05905

Field-Weighted Citation Impact: 0.00