A COMBINED APPROACH TO DETECT PHISHING SITES USING ENSEMBLE MACHINE LEARNING MODELS

20260 citationsJournal Articlehybrid Open Access

Authors

O. Moiseienko · Ivano-Frankivsk National Technical University of Oil and Gas

Vira Harasymiv · Ivano-Frankivsk National Technical University of Oil and Gas

Abstract

Phishing attacks remain one of the most prevalent and damaging threats to modern information systems, targeting users through deceptive websites designed to steal sensitive credentials and financial data. The continuous evolution of phishing techniques significantly reduces the effectiveness of traditional blacklist-based and rule-based detection methods, especially when dealing with previously unseen attacks. As a result, machine learning–based approaches have become increasingly important; however, many existing solutions rely on complex models, large volumes of labeled data, or computationally expensive feature sets, which limits their applicability in real-world and resource-constrained environments. This paper proposes a combined approach for phishing website detection based on open data that integrates statistical analysis of URL characteristics with ensemble machine learning techniques. The study focuses on lightweight, interpretable features extracted directly from URL strings, including both standard attributes and newly introduced statistical indicators such as the ratio of letters to digits, URL entropy, density of special characters, subdomain length characteristics, and the number of repeated symbols. These features aim to capture structural patterns commonly associated with phishing websites while maintaining low computational complexity. Three classical machine learning models–Logistic Regression, Random Forest, and Naive Bayes – are first evaluated individually using the publicly available Phishing Websites Dataset from Kaggle. The dataset is preprocessed using normalization and class balancing techniques to mitigate class imbalance. Experimental results show that Random Forest achieves the highest accuracy and discriminative capability, while Logistic Regression provides stable and interpretable performance, and Naive Bayes demonstrates high precision but limited recall. To improve robustness and balance between precision and recall, a combined ensemble model based on a soft Voting Classifier is developed. The ensemble leverages the complementary strengths of the selected base classifiers through weighted probabilistic voting. Experimental evaluation demonstrates that the proposed combined model achieves a more balanced classification performance, with an F1-score of approximately 0.95 and a ROC-AUC value close to 0.99. While the ensemble slightly underperforms the best individual model in terms of raw accuracy, it significantly reduces variability and improves the trade-off between false positives and false negatives. The obtained results confirm that incorporating simple yet informative statistical URL features and combining heterogeneous machine learning models can effectively enhance phishing detection performance without increasing system complexity. The proposed approach is well suited for practical deployment in security monitoring systems and SIEM platforms, where interpretability, stability, and computational efficiency are critical. Future research will focus on adapting the method to streaming data scenarios and evaluating its resilience to concept drift in evolving phishing campaigns.

Topics & Keywords

Spam and Phishing Detection Cybercrime and Law Enforcement Studies Misinformation and Its Impacts

UN Sustainable Development Goals

Reduced inequalities

Publication Details

Published in: Municipal economy of cities

DOI: 10.33042/3083-6727-2026-1-196-13-21

Field-Weighted Citation Impact: 0.00