<p>Malware analysis is an important research direction in the cybersecurity community. In particular, machine learning algorithms show highly promising results for the automatic analysis of malware. This thesis explores and proposes the application of machine learning for a diverse range of malware analysis such as automatic malware representation, malware feature extraction, malware detection and malware family identification. We focus on the static analysis of malware samples, specifically, the byte level n-grams of binary files for the Google's Android and Microsoft Windows platforms. The malware samples used in this thesis are all real-world samples collected from the wild. The methods are evaluated with a range of measurements to assess their effectiveness, efficiency and resiliency. We apply a random projection technique to byte n-gram term-frequency (tf ) representation and show that the new representation is equivalent to the first layer of a particular neural network (NN). This NN has an analytic solution with a non-iterative learning algorithm. We continue the malware representation exploration and extend static malware detection beyond byte level n-grams and detecting important strings. To this end, we develop a model that embeds the semantic similarity of byte level codes into a feature vector using embedding techniques. In this research path, we first propose a model (Byte2vec) with the capability of binary file feature representation and feature selection for malware detection. We show that the distance between the feature vector and its corresponding context vector provides a useful measure to rank features. Additionally, we utilize a deep Auto-encoder (AE) and show that the AE is capable of automatically learning a reasonable measure of semantic similarity within malware families. Finally, this thesis proposes a scheme that is suitable for both malware detection and malware family identification. In short, the byte n-grams 2 ≤ n ≤ 4 of both the classes.dex and AndroidManifest.xml binary files, extended from Android Apps, are pruned and selected for the classification phase. The scheme is based on the gradient boosting algorithm and outperforms a wide and diverse range of state-of-the-art methods. The scheme's F1-score for Drebin, DexShare and AMD datasets is 99.1%, 98.87% and 99.62% respectively. On average, the False Negative Rate is 2.1% for the PRAGuard dataset in which seven different obfuscation technique are implemented. In addition to fast run-time performance and resiliency against obfuscated malware, experiments show the model performs very efficiently for five zero-day families with 99.78% AUC.</p>
History
Table of Contents
1 Introduction -- 2 Background and Benchmarks -- 3 Malytics: A Malware Detection Scheme -- 4 Malware Representation and Feature Selection using an Unsupervised Neural network -- 5 MIFIBoost: Automatic Byte N-gram Feature Re-ranker for Android Malware Detection -- 6 Conclusion and Future Work -- References
Notes
In partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science
Awarding Institution
Macquarie University
Degree Type
Thesis PhD
Degree
Thesis (PhD), Department of Computing, Faculty of Science and Engineering, Macquarie University
Department, Centre or School
Department of Computing
Year of Award
2020
Principal Supervisor
Len Hamey
Additional Supervisor 1
Vijay Varadharajan
Additional Supervisor 2
Shiping Chen
Rights
Copyright: The Author
Copyright disclaimer: https://www.mq.edu.au/copyright-disclaimer