Macquarie University
01whole.pdf (2.73 MB)
Download file

Machine learning for automatic malware representation and analysis

Download (2.73 MB)
posted on 2022-11-16, 03:02 authored by Mahmood YousefiazarMahmood Yousefiazar

Malware analysis is an important research direction in the cybersecurity community. In particular, machine learning algorithms show highly promising results for the automatic analysis of malware. This thesis explores and proposes the application of machine learning for a diverse range of malware analysis such as automatic malware representation, malware feature extraction, malware detection and malware family identification. We focus on the static analysis of malware samples, specifically, the byte level n-grams of binary files for the Google's Android and Microsoft Windows platforms. The malware samples used in this thesis are all real-world samples collected from the wild. The methods are evaluated with a range of measurements to assess their effectiveness, efficiency and resiliency. We apply a random projection technique to byte n-gram term-frequency (tf ) representation and show that the new representation is equivalent to the first layer of a particular neural network (NN). This NN has an analytic solution with a non-iterative learning algorithm. We continue the malware representation exploration and extend static malware detection beyond byte level n-grams and detecting important strings. To this end, we develop a model that embeds the semantic similarity of byte level codes into a feature vector using embedding techniques. In this research path, we first propose a model (Byte2vec) with the capability of binary file feature representation and feature selection for malware detection. We show that the distance between the feature vector and its corresponding context vector provides a useful measure to rank features. Additionally, we utilize a deep Auto-encoder (AE) and show that the AE is capable of automatically learning a reasonable measure of semantic similarity within malware families. Finally, this thesis proposes a scheme that is suitable for both malware detection and malware family identification. In short, the byte n-grams 2 ≤ n ≤ 4 of both the classes.dex and AndroidManifest.xml binary files, extended from Android Apps, are pruned and selected for the classification phase. The scheme is based on the gradient boosting algorithm and outperforms a wide and diverse range of state-of-the-art methods. The scheme's F1-score for Drebin, DexShare and AMD datasets is 99.1%, 98.87% and 99.62% respectively. On average, the False Negative Rate is 2.1% for the PRAGuard dataset in which seven different obfuscation technique are implemented. In addition to fast run-time performance and resiliency against obfuscated malware, experiments show the model performs very efficiently for five zero-day families with 99.78% AUC.


Table of Contents

1 Introduction -- 2 Background and Benchmarks -- 3 Malytics: A Malware Detection Scheme -- 4 Malware Representation and Feature Selection using an Unsupervised Neural network -- 5 MIFIBoost: Automatic Byte N-gram Feature Re-ranker for Android Malware Detection -- 6 Conclusion and Future Work -- References


In partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science

Awarding Institution

Macquarie University

Degree Type

Thesis PhD


Thesis (PhD), Department of Computing, Faculty of Science and Engineering, Macquarie University

Department, Centre or School

Department of Computing

Year of Award


Principal Supervisor

Len Hamey

Additional Supervisor 1

Vijay Varadharajan

Additional Supervisor 2

Shiping Chen


Copyright: The Author Copyright disclaimer:




201 pages