Search for a command to run...
In the current digital era, the Portable Document Format (PDF) is a commonly used file format for exchanging and storing documents, images, and other data types. The PDF format's popularity stems from its ability to preserve the original document's layout, font, and graphics, making it an ideal choice for sharing sensitive information such as financial reports, legal documents, and confidential data. However, this widespread adoption has also made PDFs an attractive target for attackers who seek to exploit vulnerabilities in these documents to spread malware. Several solutions have been proposed to identify and mitigate threats embedded within PDF files, including signature-based detection and behavioral analysis. However, these methods are often insufficient for detecting PDF-based threats. In this paper, we propose an approach that monitors incoming PDFs to identify patterns and anomalies indicative of malicious PDFs. We use an ensemble Machine Learning-based detection system based on Random Forest, Support Vector Machine (SVM), and Gradient Boosting which analyzes various PDF features, such as file size, metadata size, obj, Javascript, and metadata size at the network entry point. We evaluate the algorithm performance with a separate dataset where the result of our approach achieved an accuracy of up to 92%. We demonstrate the model's explainability by creating a visualization to interpret its decisions. Finally, we integrate the ML model obtained as a new plugin in the Snort IDS detection engine to enhance its capabilities by adding analysis techniques to its traditional rule-based detection mechanisms.