Malware Detection Using Ensemble N-gram Opcode Sequences
DOI:
https://doi.org/10.3991/ijim.v15i24.25401Keywords:
Malware Detection, N-Gram, Opcode, Machine Learning, Ensemble, Grid SearchAbstract
Conventional approaches to tackling malware attacks have proven to be futile at detecting never-before-seen (zero-day) malware. Research however has shown that zero-day malicious files are mostly semantic-preserving variants of already existing malware, which are generated via obfuscation methods. In this paper we propose and evaluate a machine learning based malware detection model using ensemble approach. We employ a strategy of ensemble where multiple feature sets generated from different n-gram sizes of opcode sequences are trained using a single classifier. Model predictions on the trained multi feature sets are weighted and combined on average to make a final verdict on whether a binary file is malicious or benign. To obtain optimal weight combination for the ensemble feature sets, we applied a grid search on a set of pre-defined weights in the range 0 to 1. With a balanced dataset of 2000 samples, an ensemble of n-gram opcode sequences of n sizes 1 and 2 with respective weight pair 0.3 and 0.7 yielded the best detection accuracy of 98.1% using random forest (RF) classifier. Ensemble n-gram sizes 2 and 3 obtained 99.7% as best precision using weight 0.5 for both models.
Downloads
Published
2021-12-21
How to Cite
Yeboah, P. N., Amuquandoh, S. K., & Musah, H. B. B. (2021). Malware Detection Using Ensemble N-gram Opcode Sequences. International Journal of Interactive Mobile Technologies (iJIM), 15(24), pp. 19–31. https://doi.org/10.3991/ijim.v15i24.25401
Issue
Section
Papers
License
Copyright (c) 2021 Paul Ntim Yeboah, Stephen Kweku Amuquandoh, Haruna Balle Baz Musah
This work is licensed under a Creative Commons Attribution 4.0 International License.