Machine learning for automatic malware representation and analysis
Malware analysis is an important research direction in the cybersecurity community. In particular, machine learning algorithms show highly promising results for the automatic analysis of malware. This thesis explores and proposes the application of machine learning for a diverse range of malware analysis such as automatic malware representation, malware feature extraction, malware detection and malware family identification. We focus on the static analysis of malware samples, specifically, the byte level n-grams of binary files for the Google's Android and Microsoft Windows platforms. The malware samples used in this thesis are all real-world samples collected from the wild. The methods are evaluated with a range of measurements to assess their effectiveness, efficiency and resiliency. We apply a random projection technique to byte n-gram term-frequency (tf ) representation and show that the new representation is equivalent to the first layer of a particular neural network (NN). This NN has an analytic solution with a non-iterative learning algorithm. We continue the malware representation exploration and extend static malware detection beyond byte level n-grams and detecting important strings. To this end, we develop a model that embeds the semantic similarity of byte level codes into a feature vector using embedding techniques. In this research path, we first propose a model (Byte2vec) with the capability of binary file feature representation and feature selection for malware detection. We show that the distance between the feature vector and its corresponding context vector provides a useful measure to rank features. Additionally, we utilize a deep Auto-encoder (AE) and show that the AE is capable of automatically learning a reasonable measure of semantic similarity within malware families. Finally, this thesis proposes a scheme that is suitable for both malware detection and malware family identification. In short, the byte n-grams 2 ≤ n ≤ 4 of both the classes.dex and AndroidManifest.xml binary files, extended from Android Apps, are pruned and selected for the classification phase. The scheme is based on the gradient boosting algorithm and outperforms a wide and diverse range of state-of-the-art methods. The scheme's F1-score for Drebin, DexShare and AMD datasets is 99.1%, 98.87% and 99.62% respectively. On average, the False Negative Rate is 2.1% for the PRAGuard dataset in which seven different obfuscation technique are implemented. In addition to fast run-time performance and resiliency against obfuscated malware, experiments show the model performs very efficiently for five zero-day families with 99.78% AUC.