Intelligent Security Schema for SMS Spam Message Based on Machine Learning Algorithms

— SMS spam messages represent one of the most serious threats to current traditional networks. These messages have been particularly prevalent overseas and are harmful to various types of devices. The current filtering scheme employed in conventional systems is unable to expose a large number of messages. To resolve this issue, a new intelligent security system is proposed to reduce the number of spam messages. It can detect novel spam messages that have a direct and negative impact on networks. The proposed system is heavily based on machine learning to explore various types of messages. The primary achievement of our study is the increase in the accuracy ratio as well as the reduction in the number of false alarms. According to the experimental results, it is clear that our system can realize outstanding results, detecting a massive number of massages.


Introduction
Security systems are considered one of the most important issues in the scientific research area [1]. A lot of modern applications have been suffered from luck protection techniques that expose various attacks. Short Message Service (SMS) or mobile text messages represent a communication service component of phone, web, or mobile communication systems that use standardized communication protocols for the exchange of short text messages between fixed-line or mobile phone devices [2] [3]. Mobile text messages are used for communication between cell phone users when voice communication is undesirable or impossible. However, some of the text messages that are forwarded to the user's device are bothersome and unwanted, and these are called SMS spam.
The user stores personal and confidential information on their smartphone, such as contact lists, numbers, passwords, and credit card information. Thus, using SMS spam, hackers can attack users' devices and exploit this information. Privacy invasion and access to sensitive or unauthorized information are the main problems arising from spam messages. The privacy of the user is violated by individuals commonly known as spammers, who use various unethical activities to access user data stored on smartphones without the knowledge of the user [4].
Spam messages are unwanted but are often unavoidable. SMS spam can be undesired emails delivered as text messages across mobile devices [5]. These messages are utilized by some businesses to promote and advertise their materials in order to increase their audience. Besides promoting services or products, SMS spam can threaten users' privacy and can lead to identity theft and fraud through the use of attacks via spam text messages [6]. Spam messages originate from all regions worldwide; however, China represents a major source of these messages over other countries [7].
Recently, the popularity of SMS has increased due to the development of different communities of mobile users, which present various techniques and tools to spam mobile phones in order to maximize the desired result. Problems associated with SMS spam have inspired researchers to present different techniques for the effective detection and prevention of spam SMS. The general phases of SMS spam filtering are outlined in the following figure  In the detection of spam messages, the availability of SMS datasets used in the training/testing techniques is still limited by the small size. Moreover, the number of features utilized for spam message detection in the text is low because of the short length of text messages. Different machine learning algorithms and techniques have been utilized to sort through spam messages and filter them. The objective of this paper is to present a system to resolve the problems associated with spam message detection. The proposed system is heavily based on the random forest algorithm and decision tree as classifiers of detection.
The rest of this paper is organized as follows: Section 2 presents related works. The methodology of the proposed approach is described in Section 3. The experimental results are given in Section 4. Finally, Section 5 presents conclusions and future directions.

Related works
In classifying/detecting SMS spam, many types of research have been presented and a variety of related issues have been discussed. This section presents and summarizes the earlier research related to this field.
In [4], the authors present a review to compare the performance of machine learning algorithms. The review proves that the use of support vector machine (SVM) and Naive Bayes leads to efficient performance. Another review for SMS spam classification/filtering is presented in [8]. This review focuses on expanding the number of features used for SMS classification and considers how the number of selected features affects the rate of accuracy. The authors aimed to contribute to determining spam's impact level or risk.
Machine learning has been widely utilized in the SMS spam classification field and many works have been presented. In [9], a survey is presented to prove the performance of SVM utilization to identify and filter spam messages. In [10], various methods are used to analyze SMS spam and a new pre-processing technique is utilized to obtain an actual dataset of SMS spam. Different algorithm techniques have used this dataset to develop a more suitable algorithm to achieve both recall and accuracy. The results prove that the Random Forest algorithm is capable of classifying a new dataset for ham and spam. To filter SMS spam, the Random Forest algorithm and Term Frequency-Inverse Document Frequency (TF-IDF) is used in [7]. The experimental results prove that the Random Forest algorithm achieves effective performance, with an accuracy of 97.50%.
In [11], different machine learning models are used, such as LightGBM, XGBoost, and Bernoulli Naive Bayes, to achieve greater speed and efficiency in the classification of SMS spam with low latency. The results prove that Bernoulli Naive Bayes, followed by LightGBM with the TF-IDF matrix, generated the highest accuracy of 96.5% in 0.157 seconds and 95.4% in 1.708 seconds, respectively. A Recurrent Neural Network with SVM is used in [12] to detect bot spam emails based on employing a spam dataset. The results prove that the presented solution can achieve better performance for the detection of spam emails with 98.7%. Machine learning techniques are used in [13] to develop a system for spam filtering and identification of legitimate emails. The results prove that the classification system presents 92% and 94% correct spam identification, with less identification of false positives, at 1.0%.
A framework for spam detection and risk estimation is presented in [14] based on data stream clustering and classification. The authors used Multinomial Naive Bayes and identified K-nearest neighbor algorithms for classification. In addition, the K-means algorithm is used in the clustering phase for SMS spam detection. For system evaluation, some metrics are used for performance assessment in classification/clustering methods. The WEKA text technique is used as a means of spam message classification/ filtering in [15]. Different algorithms are used for SMS dataset classification and some metrics, such as accuracy, error rate, and time, are computed to select the optimal one.
In [16], analyzing of Bayesian filtering techniques is made to know to what extent these techniques are used for email spam blocking. The authors proposed two SMS spam test sets with some specific words and significant size based on using Machine Learning algorithms. The results prove that Bayesian filtering techniques can be efficiently transferred from email to SMS spam. Filtering of SMS spam and a review of modern researches in SMS spam filtering presented in [17]. This paper also studies data collection, analyses a large corpus of SMS spam, and provides results. In [18], the Naive Bayes algorithm was used to propose a spam classification model for mobile devices. This system aims to correctly filter incoming SMS received by users. Efficient and dependable results were obtained by the proposed model.
A new public, large, real, and non-encoded SMS spam collection dataset was used in [3] with a comprehensive analysis. The performance presented by several machine learning techniques is compared. A novel machine learning system for SMS spam messages detection is proposed in [19]. This paper utilized feature extraction and decision making as a method dependent on the proposed system. The results prove that high detection rates in terms of classification accuracy and F-measure can be achieved by the proposed system compared with other proposed researches.
Through analyzing and searching the earlier mentioned system, our paper is distinguished from others by presenting a detection system based on utilizing the Random Forest Algorithm and decision tree as a classifier to solve the problems associated with spam message detection.

SMS spam security
Many threats attack SMS security involve message disclosure, Man-in-the-middle attack, SMS viruses, and SMS spamming. In SMS spamming, SMS used as a valid marketing channel, and many people had annoyed while receiving SMS spam. It easy for virtually everyone to send out mass SMS messages because of the availability of bulk SMS broadcasting [20]. To solve this problem spam SMS detection is dependent.
The most prevalent malware used email to spread and logins by passwords to a system to steal confidential data. The Top 10 malicious programs spread by email presented as follow [21]: One type of threat that could lead to a hazardous situation and exploit vulnerabilities is spam when the probability of potential risk to happen is high [22]. For SMS spam management, three phases which are spam detection, classification, and severity determination level are dependent [23]. Risk management processes are necessary for SMS spam detection which are: identification of risk, assessing risk, responding to risk, and risk monitoring. To manage spam, there are three main processes which are: spam classification, spam clustering, and determination level of spam's severity [24].

Methodology
In this paper, the main objective of the proposed work is to classify SMS spam messages either as normal or ham spam. The proposed system starts from the collection process of a dataset that generated from real-world to classification decision whereas normal or abnormal behaviors. The proposed work includes the process presented in Figure 2 below.

Fig. 2. Processes of SMS spam classification
In this paper, two tools of machine learning are utilized to identify normal from ham spam emails. However, these techniques are play important role in providing a secure environment among users that participate on various networks.

Dataset Source
The SMS spam dataset used in this research is obtained from the Kaggle, a machine learning repository [15]. It contains two labels, v1 and v2, and 5572 instances. The (v2) label represents the input messages, which are either ham spam or normal spam. The predicted label (v) has two classes, which are 0 = ham spam and 1 = normal spam. In the data, 4900 are ham spam instances and 672 are normal spam instances. The dataset is presented in Table 1 [24].  Table 1 explains the main contents of emails that were sent/received between users.

SMS message spam classification
Random forest algorithm (RF) and decision tree are machine learning algorithms used for a large number of datasets with various feature types, such as numerical, binary, and categorical [25]. In this system, RF and decision trees are utilized to efficiently classify normal and ham SMS spam. The system combines various sets of decision trees to eliminate the overfitting difficulties in the training phase. In the RF algorithm, each tree operates with randomly chosen attributes and it has the capability of providing prediction results that differ from others [26]. As a result, different levels of performance can be achieved by each tree, and the total average of their performance is generated and calculated. The processes of RF are presented in Figure 3.

Performance measurement
In this paper, some metrics are utilized for the measurement of the efficiency of the proposed system. The metrics, which are accuracy, confusing matrix, and recall, can be calculated as follows [27]: Number of correctly classified patterns Total numbe er of patterns To measure and evaluate the proposed system's performance, a confusing matrix is calculated, which involves four categories: true positive (TP), false positive (FP), true negative (TN) and false negative (FN). The calculation can be made by the following Equations [27]: • True positive (TP) is the number of correctly predicted normal spam messages.
• False-positive (FP) is the number of wrongly predicted normal spam messages.
• True negative (TN) is the number of correctly predicted ham spam messages.
• False-negative (FN) is the number of wrongly predicted ham spam messages.
The recall is the last metric used in this paper, which was calculated as follows: All these metrics are employed in this paper to test/evaluate the performance of the proposed security system.

Experimental results and discussion
In order to compare the performance of the algorithms used in this experiment, this paper provides performance evaluation measures such as accuracy, precision-recall, f1 score support, and time. Table 2 shows the performance evaluation of the random forest algorithm. Meanwhile, Table 3 shows the efficiency of the security system that is based on the decision tree. As presented in Tables 2 and 3, the random forest and decision tree machine learning algorithms achieved the highest accuracy in the classification of SMS spam. Moreover, we achieved a 98.2% accuracy rate with the random forest machine-learning algorithm. Table 3 shows a comparison of our proposed model with an earlier model using the decision tree and random forest machine learning algorithms. According to Table 4, we can see that our proposal that is based on RF is more accurate than others.

Conclusion and future directions
Nowadays, SMS spam detection is a major challenge due to the increase in the use of text messaging. In this paper, a technique for SMS spam detection is proposed based on utilizing the random Forest and decision tree algorithms. The dataset used in this work consists of 4900 ham spam instances and 672 normal spam instances. The experimental results show that the classification system using the random forest algorithm presents the best results, with a 98.2% accuracy rate. In future work, other machine learning methods will be employed to achieve an improved accuracy rate in the SMS spam classification field.