Search for a command to run...
Phishing attacks have evolved significantly, employing visual mimicry, semantic deception, and network-level manipulation to bypass traditional detection systems. Conventional approaches based on URL blacklists or single-modal feature analysis often fail against zero-day and dynamically generated phishing pages. This paper presents a multi-modal phishing detection framework that integrates URL lexical features, HTML/DOM structural attributes, visual cues, semantic content, and network-based indicators. Structured features are processed using a stacked ensemble model comprising Logistic Regression, LightGBM, and Linear SVM classifiers. Webpage screenshots are analyzed using a fine-tuned EfficientNet-B0 model to extract visual embeddings, while semantic representations are generated using DeBERTa-v3 Base to identify deceptive language patterns. These heterogeneous features are fused through dense neural layers to produce a final phishing probability score. The system incorporates cost-sensitive learning to address class imbalance and integrates explainability mechanisms, including Grad-CAM visualization and DOM-level feature highlighting. The proposed architecture aims to deliver a scalable, adaptive, and interpretable solution for detecting modern phishing attacks across multiple content modalities.