Search for a command to run...
Machine learning (ML) systems typically rely on high-quality, well-structured datasets for effective performance.However, data collected from real-world environments is often incomplete, noisy, inconsistent, and heterogeneous.Such imperfections negatively impact model accuracy, generalization ability, and reliability, particularly in dynamic and large-scale applications. This study reviews existing research on data cleaning techniques and their role in improving machine learning outcomes. It examines how different types of data imperfections arise and how they influence various stages of the ML pipeline. The review also evaluates current approaches for handling issues such as missing values, outliers, and data inconsistencies, along with their limitations in real-world scenarios. Furthermore, it highlights the need for more adaptive and scalable solutions that integrate data cleaning within the learning process. The study concludes that data quality should be treated as a critical factor throughout the ML lifecycle rather than as a standalone pre-processing step. Keywords: Data Cleaning, Machine Learning, Data Pre-processing, Outlier Detection, Imputation
Published in: International Scientific Journal of Engineering and Management
Volume 05, Issue 03, pp. 1-9
DOI: 10.55041/isjem05905