Paper—Efficient Detection of Phishing Websites Using Multilayer Perceptron Efficient Detection of Phishing Websites Using Multilayer Perceptron

Phishing is a type of Internet fraud that aims to acquire the credential of users via scamming websites. In this paper, a novel approach is utilized that uses a Neural Network with a multilayer perceptron to detect the scam URL. The proposed system improves the accuracy of the scam detection system as it achieves a high accuracy percentage of 98.5%. Keywords—Multilayer Perceptron (MLP), Activation function, Semantic attack, Phishing.


Introduction
In recent years, cyber-attacks are becoming increasingly common. The attackers use the computer as a tool or as a target and sometimes both. A cyber-attack is an intrusion by computer hackers utilizing one or more computers against single or multiple computers or against the infrastructure. A cyber-attack deliberately destroys computers, steals information, or use a compromised computer as a starting point for other threats [1,2].
The cyber-attacks are classified mainly into two categories. The first category is the syntactic attack that are grouped under the name "malicious software" or "malware" and this type of attacks include viruses, Trojan horses, and worms. As soon as the malicious software is inserted into a computer, the computer system starts doing undesired functions [3]. The second category is semantic attacks, where the attackers collect the victim information through some websites or links that looks like trusted websites or to acquire his/her username, password, and credit card information [4]. Table 1 shows some types of semantic attacks and a brief description for them [5]. The attacker uses a dictionary in an attempt to guess the password. Denial-of-Service Attack The attack focuses on the interruption of a network service. Backdoor Any secret method of bypassing normal authentication or security controls. Eavesdropping Listening to a private conversation. Spoofing Falsifying data.

Privilege Escalation
An attacker is able to fool the system into giving him/her access to restricted data. Phishing The attacker uses Email, Website, URL to crack usernames, passwords and credit card details directly from users.
Nowadays, most of the internet users are facing website phishing daily through different tools, such as email, SMS, or instant message from unsuspecting users by employing social engineering techniques. Phishing emails are designed to sound as if they were sent from a lawful corporation or a recognized individual. Such emails also aim to get the victim to visit a website that leads the victim to a fake website that claims to be legitimate. The victim may then be requested to enter confidential information, such as usernames and passwords for the credit card [6,7].
Phishing websites are now a significant issue, not only because of the rise in the number of such websites but also because of clever tactics used to develop these websites, so that even users with good experience with cybersecurity and the Web could be fooled [8].
The rest of this paper is organized as follows. Section II discusses previous antiphishing techniques. Preliminaries are discussed in Section III. The novelty of the proposed approach is discussed in Section IV, Section V describes the simulation results regarding the proposed work. Finally, concluding remarks and future works are offered in Section VI.

Prior Works
Phishing protection methods are classified into two main categories; denunciation platforms and heuristics-based solutions. denunciation platforms are built by developers and periodically provide the web browser with the updated blacklist [9]. Google developed SafeBrowsing and operates in the Safari, Chrome, Firefox browsers. Microsoft maintained SmartScreen and operates in the Internet Explorer and Edge. The main drawback of the blacklist model is the period of time needed to recognize the phishing sites; sometimes it takes zero-day (0_day) and sometimes takes months which is enough time to fraud for multiple victims. The second solution is heuristics-based solutions, where the heuristics algorithms study the URL features and predict if the URL is trusted or malicious [10].
Dhamija et al. introduced Dynamic Security Skins, by employing a shared secret image that enables the server to verify its identity to the user [11].
Beatson et al. proposed a Trusted Credentials Area (TCA) which is any third-party certification against phishing [12].
Other authentication techniques are used to protect the user against the phishing problem. These techniques deploy user authentication, email authentication, and server authentication. AOL introduced Passcode as a one of user authentication against password phishing where the authentication Passcode expired every 60 seconds [11]. Other techniques employed by Microsoft by sending Sender ID to cover the domain spoofing problem [12]. Another model called Phishing graph introduced by Jakobsson to visualize the flow of information of phishing attack, by using the phishing graph system enables him to understand and analyze the phishing attack [13].
Silva et al. used logistic regression classifier to analyze the features of URLs and identify the phishing URLs. Jain and Gupta employed the K-mean algorithm to predict the similarity of suspicious pages. Other prediction algorithms employed the content or information on the suspicious page [14]. Aburrous implemented hashing to identify malicious sites by verifying the CSS formatting as well as JavaScript or HTML [15].
Afroz et al. Integrate the potential of whitelisting strategies to prevent new or planned phishing scams with the ability of blacklisting and heuristic approaches to alert clients of harmful sites [16] [26].

Preliminaries
In this research, the proposed model is evaluated using the selected dataset. This section describes the dataset and the Neural Network used in the proposed model.

Dataset
The dataset used in this article collected from Phish Tank, Miller, Smiles, Google search (26/0/2015). The dataset contains 2456 instances and 30 attributes. The 30 attributes distributed over four features categories; Address bar, abnormal, HTML and JavaScript, and Domain [17][27] [28].

Multilayer perceptron (MLP)
A Multilayer perceptron is a class of feed-forward artificial neural network (FFNN) that consists of more than two layers; the first layer is the input layer and the last one is the output layer and there are some layer(s) between them called hidden layer(s). As the number of layers is increased, the time complexity is increased. Each neuron receives an input (x1, x2,..xn) and bias (b). Each of the input is multiplied with the weight (w) and then the output (y) is processed based on the activation function ( ). (1)

Proposed Model
Fig 1 shows the system flow diagram to recognize the URL. The proposed system reads the URL, then the URL is classified into features according to the dataset components. Then, the model applies the single attribute evaluator and ranks the link's features. Based on a single attribute evaluator, the proposed model eliminates irrelevant attributes. The next step is to combine attributes and apply the search strategy to remove the redundant data and keep the high correlated attributes. Finally, the system decides if the link is harmful or not.

Experimental Work
The proposed model uses Weka 3.6 and Python to evaluate the performance of the model. Table 2 shows the experimental parameters such as the learning rate, the number of epochs (number of passes through data), and the number of hidden layers, the batch size, and the momentum.

Discussion of Results
This section describes the results. The confusion matrix is demonstrated in Table 3. The accuracy and F measure of the proposed model is evaluated. After applying the single attribute evaluator, the system generated only 10 attributes (class is included). These attributes are as follows: Prefix_Suffix, having_ Sub_Domain, SSLfinal_State, Request_URL URL_of_Anchor, Links_in_tags SFH,web_traffic, and Google_Index. Fig 2 and Fig 3 describe the structure of the MLP network and the heat map to describe the correlation factor with the result (Safe, Unsafe).

Fig. 2. Heat Map After Applying Single Attribute Evaluator
The proposed model applies attributes combine to minimize the number of highly correlated attributes so that the accuracy of the system is increased as shown in Fig 3  and    Based on the proposed model, the system uses a confusion matrix to evaluate its performance according to accuracy and F-measure.
Where iJIM -Vol. 14, No. 11, 2020 Table 4 shows a comparison between different machine learning algorithms to detect the phishing URL and the corresponding accuracy for each one. Our proposed algorithm enhances the accuracy to achieve 98.5%. Table 4. The accuracy of different algorithms to detect phishing URL

Conclusion
The proposed model introduces a new phishing detection approach by using a Multilayer perceptron Neural Network. The model applies the processing steps; single attribute evaluator and attribute combine to achieve high accuracy of 98.5% where the