Naïve Bayes Classifier for Journal Quartile Classification

— Classification is a process for distinguishing data classes, with the aim of being able to estimate the class of an object with unknown label. One popular method that used for classifying data is Naïve Bayes Classifier. Naïve Bayes Classifier is an approach that adopts the Bayes theorem, by combining previous knowledge with new knowledge. The advantages of this method are the simple algorithm and high accuracy. In this study, it will show the ability of Naïve Bayes Classifier to classify the quality of a journal commonly called Quartile. This study use a dataset of 1491 instances. The results show an accuracy of 71.60% and an error rate of 28.40%.


Introduction
One of the reference sources in writing scientific papers is articles. Articles are one type of scientific work produced from research studies or literature studies. Scientific articles are divided into research and non-research articles. Research article are sourced from research reports, while the non-research articles contain the thoughts, arguments, and authors opinions with the support of scientific sources. In general, the contents of scientific articles include titles, lines of author, abstracts, keywords, contents or body of the text, and literature reviews.
People can find various articles in the journal. A journal is a special scientific work which contains articles in a particular field of science. The role of the journal in the preparation of scientific work is as a provider of reference materials. Each journal has different qualities, including Q1, Q2, Q3, Q4 and NQ. The value of Q1 shows the highest value. In the Scimago Journal and Country Rank, rankings from various types of journals can be found, but there are still unbalanced values in the journal ranking system [1]. The problem is that the Quartile category of each field of study is not the same. for example, a journal in the X field could be lower than the Quartile value in another field Based on these problems, a data processing method is required, namely classification. Classification is one method for finding a model or function that explains and distinguishes concepts or data classes. The model is derived based on the analysis of a series of training data and can be used to predict the class label of an object with unknown label [2].
Naïve Bayes is one of the data classification algorithms. This algorithm belongs to the top 10 algorithms in data mining [3]. The Naive Bayes algorithm is simple probabilistic classification. This algorithm calculates a set of probabilities by calculating the frequency and combination of values in a particular data set [4]. The probability of certain features in the data appears as members in a probability sequence and it obtained by calculating the frequency of each feature value in the class from the training data set. The training data set is a subset used to train classification algorithms. The training process uses known values to predict unknown values [5].
Several comparative studies of classification methods said that the Naïve Bayes Classifier algorithm shows the best area under the curve (AUC) value than LogR and LC algorithms [6]. The results in other studies show that the Naïve Bayes Classifier algorithm has the best accuracy compared with the Lazy-IBK algorithm, Zero-R and Decision Tree-J48 [7]. The Naïve Bayes Classifier algorithm has also proven effective and potentially good in many practical applications, including text classification, medical diagnosis and system performance management [8] [9].
This study uses the Naïve Bayes Classifier Algorithm to process numerical data. The data used is a journal ranking data obtained from Scimago Journal and Country Rank. Through this research, it is expected to be able to produce quality classification of journals in Q1, Q2, Q3, Q4 and NQ label with optimal accuracy.

Dataset
The dataset was taken from the Journal Rangkings in the Scimago Journal and Country Rank on November 5, 2018. The dataset specified in Computer Science as the subject area and only uses instances with journal types. The dataset consists of 1491 instances and 10 attributes show in Table I.
The classification output is determined based on a label class attribute. From the dataset, label class attribute is SJR Best Quartile show in Table II.
The proportion of each class member as output is shown in Table III.

Data preprocessing
Before the main classification process, the dataset needs to be prepared to optimize it. From the data above, there is an unbalanced proportion between the number of label instances of each class. Class Q1 has more sample proportions than classes Q2, Q3, Q4, and NQ. This study uses the Undersampling technique to balance data between class labels based on the smallest proportion. The implementation of the Undersampling technique shown in Table IV. The Undersampling technique in RapidMiner software can be applied with Sample operators. By applying the Undersampling concept where classes will be balanced based on minority classes. In Table IV, before Undersampling, the number of label class data is not balanced. Q1 label is the majority class with 407 instance while NQ is a minority class with 50 instance. By taking the Undersampling technique, the proportion of label class data will be balanced based on the minimum number of proportions. So, the proportion of the five label class data is balanced to 50 instances each class with a percentage of 20%.

Naive bayes algorithm
Naive Bayes is a classifier using probability and statistical methods proposed by a British scientist, Revered Thomas Bayes [10]. Naive Bayes often works much better in many complex real-world situations than might be expected [11].
Naïve Bayes is a popular model in Machine Learning applications because of its simplicity in allowing all attributes to contribute to the final decision equally. This simplicity is equivalent to computational efficiency, which makes the Naïve Bayes technique attractive and suitable for various fields [12]. The main element of Naïve Bayes Classifier is about three aspects, they are prior, posterior dan class conditional probability [13].
The advantages of the Naive Bayes algorithm as show in [10] and [14]- [16] are as follows: • Small training data • Simple computing • Easy to implement • Time efficiency • Can handle big data • Can handle incomplete data (missing value) • Not sensitive to irrelevant features • Not sensitive to data noise The Bayes Theorem formula is as follows [16]:

With
Data with unknown class Q The hypothesis is a specific class (Q| ) The probability of the Q hypothesis refers to (Q) Probability of the hypothesis Q (prior probability) ( |Q) Probability in the hypothesis Q ( ) Probability To explain the Naïve Bayes theorem, it must be known that the classification process requires various clues to determine the class according to the sample analyzed. Therefore, the Bayes theorem above is adjusted as follows: Where Q variable is a representation of class, while variable 1 … 3 represents the characteristics of the instructions needed for the classification process. Then the formula explains that the probability of the entry of certain characteristic samples in class Q (Posterior) is the opportunity for the emergence of the Q class (before the entry of the sample is called prior), multiplied by the probability of the appearance of sample characteristics in class Q (also called likelihood). Then divided by the chance of the appearance of sample characteristics globally (also called evidence). Therefore, the formula above can also be written simply as follows:

= ℎ
Evidence values are always fixed for each class in one sample. The value of the posterior will be compared with other class posterior values, to determine the class of a sample to be classified. Further explanation of the Bayes formula is done by describing ( | 1 … 3 ) using the multiplication rule as follows: It can be seen that the results of the translation cause more and more complex factors that affect the probability value, which is almost impossible to analyze one by one. As a result, the calculation becomes difficult to do. This is where the assumption of very high independence (naive) is used, that each of the instructions ( 1 , E … 3 ) are free from each other. With these assumptions, a formula applies as follows : From the above equation it can be concluded that the assumption of naïve independence makes the conditions of opportunity simpler, making the calculation possible to do. Next, the translation of P(Q| 1 , … , 3 ) can be simplified to become: The equation above is a model from Naïve Bayes which will be used in the classification process. For classification with numerical data can be handled using the Standard Probability Density function, where the function represents the distribution of known data [17]. The formula from Gauss Density show in equation (6).

Confusion Matrix
Confusion matrix contains information about predictable and actual classifications with a classification system. System performance is generally evaluated using data in the form of a matrix [18]. Through Confusion Matrix, the accuracy, error rate, precision and recall values can be known. Confusion Matrix show in Table V. Calculation of Confusion Matrix is done with TP, FN, FP and TN. TP (True Positive) is the amount of positive data that has a true truth value. FN (False Negative) is the amount of negative data that is considered by the system to have the truth value false.
FP (False Positive) is the amount of positive data that is considered by the system to have the truth value false.
TN (True Negative) is the amount of negative data that is considered by the system to have true truth value.

Result and Discussion
The use of Cross Validation with k-fold=5 show the best accuracy produce by Naïve Bayes Classifier which reaches 71.60%. Table VI shows the results of classification consisting of the value of accuracy, error, precision and recall from the Naïve Bayes Classifier. Naïve Bayes Classifiers' performance isn't too bad because the accuracy is still above 50%. This shows that the Naïve Bayes Classifier algorithm can be used to classify the quality of journals by only requiring a small amount of training data to determine the estimation of parameters needed in the classification process. According to Syarifah and Muslim (2015), Naïve Bayes Classifier has several advantages, such as fast in the calculation process, simple algorithms and high accuracy [19]. However, according to Muhammad (2017), the probability on the Naïve Bayes Classifier algorithm cannot measure the accuracy of a prediction. In addition, Naïve Bayes Classifier also has weaknesses in attribute selection that can affect on the value of accuracy [10]. Therefore, the researcher recommends using the Naïve Bayes Classifier algorithm with an optimization method to increase the accuracy of the Naïve Bayes Classifier algorithm.
In the next study, researchers will combine the Naïve Bayes Classifier algorithm with several optimization methods, such as Particle Swarm Optimization or Genetic Algorithms. In addition, researchers will also use other classification algorithms such as Support Vector Machine and k-Nearest Neighbor. Preprocessing techniques such as the imputation method can also be done by randomly removing data to compare the value of accuracy generated after the imputation method than without imputation method.

Conclusion
Based on the results of the discussion in this study, the data are classified into several labels, namely Q1, Q2, Q3, Q4 and NQ. The variable used in this study is H index, SJR, Total Docs. (2017), Total Docs. (3years), Total Refs, Total Cites (3years), Citable Docs. (3years), Cites / Doc. (2years), and Ref. / Doc. The classification of quality journals can make it easier for people to choose quality journals. In this study, the researchers also concluded that Naïve Bayes Classifier algorithm was able to classify the quality of journals, even though the value of accuracy is not too optimal. For better accuracy, journals quartile classification using the Naive Bayes Classifier algorithm needs to be optimized with other algorithms.