Detection Using Ensemble N-gram Opcode Sequences Malware Detection Using Ensemble N-gram Opcode Sequences

— Conventional approaches to tackling malware attacks have proven to be futile at detecting never-before-seen (zero-day) malware. Research however has shown that zero-day malicious files are mostly semantic-preserving variants of already existing malware, which are generated via obfuscation methods. In this paper we propose and evaluate a machine learning based malware detection model using ensemble approach. We employ a strategy of ensemble where mul tiple feature sets generated from different n-gram sizes of opcode sequences are trained using a single classifier. Model predictions on the trained multi feature sets are weighted and combined on average to make a final verdict on whether a binary file is malicious or benign. To obtain optimal weight combination for the ensemble feature sets, we applied a grid search on a set of pre-defined weights in the range 0 to 1. With a balanced dataset of 2000 samples, an ensemble of n-gram opcode sequences of n sizes 1 and 2 with respective weight pair 0.3 and 0.7 yielded the best detection accuracy of 98.1% using random forest (RF) classifier. Ensemble n-gram sizes 2 and 3 obtained 99.7% as best precision using weight 0.5 for both models.


Introduction
The surge in malware attacks has become a major threat to internet security. Proliferation in malware attacks could be attributed to the high profit incentives derived from these illicit breaches [1,2]. A cyber threat report by SonicWall [3] shows that out of the millions of detection engines deployed worldwide, a total of 9.9 billion malware attacks were recorded in 2019 with over 440,000 malware variants. In 2020 SonicWall reported a total of 5.6 billion malware attacks, which is obviously a decline from the previous year. This emerging threat calls for a more sophisticated solution. The signature based method has been the conventional approach for malware detection. With this approach, malware footprint including byte sequences, hashes or anomalies are precomputed and used as a repository for future queries for suspicious files.
Signature-based detection methods face two major drawbacks; first, manual examination of executables and the requirement for regular update of detection engines has become unrealistic due to the volumes of malware released each day. Secondly, malware creators employ obfuscation techniques to generate semantic-preserving malware variants which easily circumvent detection [4,5]. As described by authors in [5,6], obfuscation by metamorphism inhibit detection on traditional anti-malware with strategies including variable renaming; changing names of variables, dead-code insertion; inserting sequences like no operation (NOP) instructions, code transposition; rearranging the order of instructions, etc.
Recent breakthroughs in machine learning (ML) in areas such as natural language processing and computer vision have inspired researchers to explore data-driven approach in malware detection. Reports have shown that machine learning based malware detection models could be a better alternative to the traditional methods considering promising results reported in literature including [7,8,9,10]. Such machine learning techniques mostly employ static and dynamic analysis to obtain useful discriminative features to build models to detect malware. In the static analysis, binary file components like raw hexadecimal bytes, operation codes (opcodes), text strings and control flow graph are extracted without the binary being executed [7,8,10,11,12]. On the other hand, the dynamic approach executes the binary in a controlled environment to collect features like API call traces, network-related information, memory and register usage, etc. [13,14,15]. While static analysis could be undermined by obfuscation, dynamic analysis is proven to be resilient against heavily packed malware but could be time consuming.
Various machine learning models with static opcode as input features for malware detection have been proposed [7,8,16]. Santos et al. [7] were one of the first to propose the use of opcode sequences for malware detection. In their approach they performed feature selection with information gain and obtained top 1000 n-gram opcode sequences and applied various machine learning classifiers to achieve a performance of 95.9% accuracy with SVM on n-gram of size 2. Quite recently, Manavi et al. [16] demonstrated an image processing method for malware detection using opcode graph and ensemble of SVM and k-nearest neighbour (KNN) models.
In this research, we contribute to existing literature on malware detection with our proposed weighted average ensemble model consisting of natural language processing (NLP) techniques and classical machine learning algorithms. Our major contribution is shown in how our ensemble strategy is implemented. Unlike other ensemble techniques like [17,18] where models of multiple machine learning classifiers are trained on the same feature set, our proposed ensemble strategy instead trains a single classifier on multiple feature sets obtained from different n-gram sizes of opcode sequences. We employ n-gram a NLP [19] technique to reinforce contextual meaning in opcode sequences. The main contributions of this research are: • To propose machine learning based malware detection model consisting ensemble of n-gram opcode sequences. • To evaluate the proposed model and find the optimal n sizes of ensemble n-gram opcode sequences that produce the best detection accuracy.

Related work
For decades, anti-malware creators have relied on signature based methods which quite recently has been proven ineffective against zero-day malware. Breakthroughs in machine learning however have inspired researchers to explore data driven approaches for the task of malware detection and classification [7,8,9,10]. Applying ML for malware detection requires optimal discriminative and representative features to be obtained from binary files. Static and dynamic analysis has become the de facto methods for generating representations for executables. The former extracts without executing the binary file components including opcodes, text strings and control flow graph [8,10,11,20], whereas the latter executes the binary to extract representation features like API call traces, network-related information, memory and register usage, etc. [13,14,20]. While both modes of feature generation are recommended, static analysis has the limitation of been susceptible to obfuscation. Dynamic analysis on the other hand can be time consuming but does better against obfuscation since the binary is executed for behavioural analysis.
A malware detection and classification research by Fuyong et al. [11] employed static features comprising of raw bytes of binary files and generated n-grams of the byte code attributes which was used as the basis to compare similarity between a test sample and malware or benign files. Vinayakumar et al. [13] proposed malware detection frameworks using static and dynamic extracted features. In their work, different static features including raw byte histogram, byte entropy histogram and strings were used to train classical machine learning and deep neural network models. They obtained 97.0% and 98.9% as best accuracies using random forest (RF) and deep neural network (DNN) respectively. In addition they proposed a dynamic malware detection approach where a Cuckoo sandbox was employed to collect machine activity data of executed binary samples and further used the extracted features to train shallow and deep machine learning models. A behavior-based malware detection model presented by Galal et al. [14] employed API sequences invoked from executed binaries and trained classification algorithms including RF, decision tree (DT) and support vector machine (SVM), which resulted accuracies of 96.89%, 96.14% and 94.8% respectively.
There have been several proposals for malware detection with ML methods using static instruction sequences representations [7,8,16,21]. The work by Santos et al. [7] have shown that machine learning-based malware detection models could be used as a complement to signature-based engines. In their work, instruction codes were proposed as representation for executable binaries for the detection of malware. Authors of [8] also applied different classical ML models for malware categorization using n-gram opcode sequences and recorded an f-measure of 98% as best results using SVM. Similarly, Ni et al. [10] and Lu [21] have shown promising results of 99.26% and 97.87% accuracies respectively on malware classification with deep learning models trained on opcode sequences.

Proposed methodology
The methodology to the proposed ensemble model is divided into three main parts: feature extraction, feature selection and ensemble classification model. At the feature extraction phase, we disassemble the binary file into assembly code (Bin2Asm) and extract discriminative components of the disassembled portions which serve as an indicator of maliciousness. We extract multiple sets of features using n-gram with different n sizes. Afterwards, we reduce the dimensions of the extracted features by selecting optimal n-gram features with most substantive information on the verdict on whether a binary is malware or goodware. Finally, these selected features are used to model an ensemble classification, where model predictions on the trained multi feature sets are weighted and combined on average to detect malware. This procedure is shown in Figure 1.

Feature extraction
Static and dynamic analyses are the major techniques that can be leveraged to extract features to represent binary files. In this paper we represent binaries by static opcodes. We obtain opcodes from executables by a disassembler. Each file sample is converted into respective assembly versions using the Radare2 disassembler 1 and all the opcodes presented in the disassembled file are extracted as a single sequence. Further, we generate n-grams from the opcode sequences by sliding a window of size n across the opcode sequences as shown in Figure 2. Applying different n-gram sizes on static and dynamic feature representations of binaries have yielded remarkable improvement in performance on ML-based malware detection models [9,22,23]. Kang et al. [8] for example achieved an f-measure of 98% on their android malware detection model using n-gram opcode of n size 4. Moskovitch et al. [24] experimental results with n-gram of size 2 outperformed all other n sizes on malware detection with opcode representation. This has been our inspiration for the proposed malware detection model using ensemble n-gram opcode sequences. In this research, we consider n-grams of sizes ranging from 1 to 4 and generate multiple feature sets of opcode sequences for the different n sizes. In Table 1, we show statistics on the number of unique n-gram opcodes obtained from our dataset, which has a total of 2000 benign and malware samples. Our findings attest to other research [7,8,25] that the number of unique n-grams increases proportionally to the size of n. Since machine learning classifiers only understand features in numerical representations, we vectorize each sample's n-gram opcode sequences using the term frequency-inverse document frequency (TF-IDF) [26,27]. TF-IDF works by creating a dictionary of unique n-gram opcode sequences and then measures the frequency of occurrence of each unique n-gram opcode within a given sample using the term frequency (TF) and with inverse document frequency (IDF), measures the importance of the unique n-gram opcode on the basis of frequency of occurrence across the entire corpus. Let D = {d 1 , d 2 , d 3 ,…, d n } be a set of documents for n number of disassembled file samples, and let d = {t 1 , t 2 , t 3 ,…, t m } be the output of a disassembled sample, where m is the number of n-grams in d. The term frequency TF(t, d) as shown in equation (1), computes the frequency of occurrence of t (i.e., n-gram opcode) in d: Using equation (2), the measure of importance of n-gram opcode t across the entire documents D (in our case a corpus of binary samples) is derived using the inverse document frequency IDF(t, d): (2) Finally, the true importance of n-gram opcode t to a disassembled output d in corpus of samples D is obtained using TF-IDF which is the product of TF(t, d) and IDF(t, D).

Feature selection
From statistics on Table 1, we generated a huge number of unique n-gram opcodes from our dataset especially from n-gram sizes 2 upwards. This huge vocabulary size is expected to reflect in the final vector representations obtained from the TF-IDF model. Training a machine learning model on the entire vocabulary may not be the ideal approach since not all the n-gram opcodes may carry substantial information for detection of malware. We therefore apply feature selection to reduce the dimensions of the TF-IDF vector representations and remove less important n-gram opcode features.
To select optimal features, we leverage on information gain [28] to filter out less useful features while we keep the top 1000 most informative features. However, for n = 1, since the total unique n-gram opcodes is less than 1000, we skipped feature selection for single opcode feature. We measure information gain for the n-gram opcode features by entropy [29], which is the measure in the level of uncertainty in a random variable. With entropy, we are able to ascertain the measure in reduction of uncertainty of a class variable (in our case malware or goodware) given a feature variable (i.e., n-gram opcodes).
Information gain (IG) as shown in equation (3) can be measured as the rate of reduction in entropy of a class variable N as a result of information provided by feature variable M about N.

Here, H(N) is the entropy of the class variable N and ( ) H N M
| is the entropy of class variable N after observing variable M. For two given feature variables M and Q, feature M is regarded as more useful indicator than feature Q for a class G if

Ensemble classification model
Research have shown that results of single classifier ML-based models could be reinforced using a hybrid scheme [17,18,30]. This hybrid approach also known as ensemble learning enhances results of a single model by fusion numerous classifiers. Common ensemble strategies consist of multiple classifiers trained either on a single or multiple feature sets. The benefit with multi feature sets is that, the ensemble model has the advantage of building a more in-depth discriminative characteristic about the samples. In this study, we employ the ensemble strategy where multiple feature sets generated from different n-gram sizes of opcode sequences are trained using a single classifier.
As shown in equation (4), individual predictions of the single classifier LM trained on multi features ni-gram opcode and nj-gram opcode sequences are weighted and merged on average to produce a final predicted class y .
Here, each model LM ni and LM nj predicts a vector of probabilities, with one probability for each class (i.e., malware or benignware). The predicted probabilities for each model are weighted with weights W values obtained from grid search on a pre-defined range between 0 to 1. Subsequently the argmax function [31] is applied to the weighted average of the predictions to obtain the index of the highest vector value which is the final predicted class (i.e., 0=benignware and 1=malware).
We trained and tested the proposed ensemble model using three different machine learning classification algorithms including: support vector machine (SVM) [32], random forest (RF) [33] and K-nearest neighbour (KNN) [34]. We leveraged the scikitlearn [35] implementation of these classifiers.

Dataset generation
Our dataset consists of a balance of 2000 malware and benign windows portable executable (PE) files obtained from a research project by Tuan et al. [36]. The original dataset contains 1000 goodware and 8970 malware, where the malware dataset comprises 5 different malware categories. All the malware samples were collected from a combination of virusshare [37] and malicia-project [38] and the benign binaries downloaded from [39]. Dataset was duly verified with VirusTotal [40]. To generate a balanced dataset, the malware dataset was downsampled by a random selection of 200 samples from each malware category to compose the 1000 malware, however, no sampling was applied to the benign samples.

Evaluation metrics
The efficacy of our proposed ensemble malware detection model was measured on the basis of total accuracy (shown in equation (5)), which is the number of correct predictions divided by all predictions made and precision (shown in equation (6)), which is the proportion of positive predictions correctly identified.

Accuracy TP TN TP TN FP FN
Where, true positive (TP) is the number of malware samples correctly identified, true negative (TN) is the total benign samples correctly classified, false positive (FP) is the number of benign binaries wrongly identified as malware and false negative (FN) is the number of malware misclassified as legitimate.

Experimental results and discussion
For our ensemble model, we trained a single classifier using two separate feature sets obtained from different n-gram sizes (n i -gram and n j -gram) of opcode sequences, where n i and n j could be any combination pair in the range 1 to 4. In order to derive optimal weight combination W ni and W nj for the ensemble n-gram opcode feature sets, we applied a grid search on a set of pre-defined weights in the range 0 to 1.
To evaluate performance of the proposed ensemble model, we adopted the K-fold cross validation which allows partitioning our dataset into K subsets and preforming K different rounds of learning and testing. We performed cross-validation with K= 5 for all ensemble models. Accuracy (acc) and precision (precc) scores for each round was averaged to obtain the global performance for each tested ensemble model.
As shown on Table 2, the overall best results in terms of accuracy was obtained by RF trained with gini criteria, which yielded an accuracy of 98.1% using ensemble n-gram opcode sizes 1 and 2 with weight pair 0.3 and 0.7 respectively. The overall precision best score was 99.7%, which was obtained with RF-gini using n-gram sizes 2 and 3. The other classifiers also obtained accuracies greater than 95%. SVM trained on rbf kernel yielded 97% as the best accuracy for models trained with SVM using ensemble n-gram sizes 1 and 3 with weight pair 0.6 and 0.4 respectively, and the best precision score of 96.7% using n-grams 1 and 2 with respective weights 0.6 and 0.4. Similarly, ensemble models trained with KNN with k neighbors=5 recorded best accuracy of 98% and precision of 98.4% using n-gram sizes 1 and 2 with respective weights 0.4 and 0.6. Generally, our model recorded higher precision scores as compared to accuracy. The higher precision score is an indication that our model performed better at classifying malware samples correctly, thus, the model has a minimal false positive alarm rate. On the other hand, the relatively lower accuracy scores can be attributed to the false negatives (i.e., the number of malicious samples misclassified as legitimate).
Finally, in Figure 3, we show that the overall best performing model in terms of accuracy obtained a mean area under curve (AUC) score of 1. To measure the true performance of the ensemble models, we compare results to classifiers trained on individual n-gram feature sets. We found that, the ensemble models outperformed models trained on individual n-gram opcode sequences. The best results in terms of accuracy for the models trained on single feature set as shown in Table 3 was obtained with RF, which yielded an accuracy of 97.6% and a corresponding precision of 98.1% using n-gram of size 2. Though individual models achieved promising results, the ensemble models as evaluated and shown (see bold RF, SVM and KNN results on Table 2) earlier produced slightly higher detection accuracies compared to models trained on individual n-gram opcode sequences.

Conclusion
In this paper, we evaluated ensemble of n-gram opcode sequences for malware detection. From our experiments, we found that ensemble of multiple feature sets of n-gram opcode sequences yielded higher detection results compared to classification models trained on individual n-gram opcode sequences. The best ensemble models were able to detect malware with an accuracy of 98.1% and 99.7% in terms of precision.
For future work, we will wish to explore other representation learning methods for malware detection. We would want to particularly concentrate on deep learning models for automation of feature extraction and learning of raw static representations of binaries.