Machine learning for automatic malware representation and analysis

Yousefiazar, Mahmood

doi:10.25949/21561864.v1

01whole.pdf (2.73 MB)

Machine learning for automatic malware representation and analysis

thesis

posted on 2022-11-16, 03:02 authored by Mahmood YousefiazarMahmood Yousefiazar

Malware analysis is an important research direction in the cybersecurity community. In particular, machine learning algorithms show highly promising results for the automatic analysis of malware. This thesis explores and proposes the application of machine learning for a diverse range of malware analysis such as automatic malware representation, malware feature extraction, malware detection and malware family identification. We focus on the static analysis of malware samples, specifically, the byte level n-grams of binary files for the Google's Android and Microsoft Windows platforms. The malware samples used in this thesis are all real-world samples collected from the wild. The methods are evaluated with a range of measurements to assess their effectiveness, efficiency and resiliency. We apply a random projection technique to byte n-gram term-frequency (tf ) representation and show that the new representation is equivalent to the first layer of a particular neural network (NN). This NN has an analytic solution with a non-iterative learning algorithm. We continue the malware representation exploration and extend static malware detection beyond byte level n-grams and detecting important strings. To this end, we develop a model that embeds the semantic similarity of byte level codes into a feature vector using embedding techniques. In this research path, we first propose a model (Byte2vec) with the capability of binary file feature representation and feature selection for malware detection. We show that the distance between the feature vector and its corresponding context vector provides a useful measure to rank features. Additionally, we utilize a deep Auto-encoder (AE) and show that the AE is capable of automatically learning a reasonable measure of semantic similarity within malware families. Finally, this thesis proposes a scheme that is suitable for both malware detection and malware family identification. In short, the byte n-grams 2 ≤ n ≤ 4 of both the classes.dex and AndroidManifest.xml binary files, extended from Android Apps, are pruned and selected for the classification phase. The scheme is based on the gradient boosting algorithm and outperforms a wide and diverse range of state-of-the-art methods. The scheme's F1-score for Drebin, DexShare and AMD datasets is 99.1%, 98.87% and 99.62% respectively. On average, the False Negative Rate is 2.1% for the PRAGuard dataset in which seven different obfuscation technique are implemented. In addition to fast run-time performance and resiliency against obfuscated malware, experiments show the model performs very efficiently for five zero-day families with 99.78% AUC.

History

1 Introduction -- 2 Background and Benchmarks -- 3 Malytics: A Malware Detection Scheme -- 4 Malware Representation and Feature Selection using an Unsupervised Neural network -- 5 MIFIBoost: Automatic Byte N-gram Feature Re-ranker for Android Malware Detection -- 6 Conclusion and Future Work -- References

Notes

In partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science

Awarding Institution

Macquarie University

Degree Type

Thesis PhD

Degree

Thesis (PhD), Department of Computing, Faculty of Science and Engineering, Macquarie University

Department, Centre or School

Department of Computing

Year of Award

2020

Principal Supervisor

Len Hamey

Additional Supervisor 1

Vijay Varadharajan

Additional Supervisor 2

Shiping Chen

Rights

Copyright: The Author Copyright disclaimer: https://www.mq.edu.au/copyright-disclaimer

Language

English

Extent

201 pages

Usage metrics

Keywords

Machine Learning Feature learning Malware Analysis Android malware Windows Malware

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Machine learning for automatic malware representation and analysis

History

Table of Contents

Notes

Awarding Institution

Degree Type

Degree

Department, Centre or School

Year of Award

Principal Supervisor

Additional Supervisor 1

Additional Supervisor 2

Rights

Language

Extent

Usage metrics

Categories

Keywords

Licence

Exports