Students' Orientation Using Machine Learning and Big Data

—Students' orientation in public institutions and choosing their academic paths or their appropriate specialization is important to students to continue their studies Easily in their school career. Therefore, we decided to make the student's orientation process automatic and individual, relying on an information system that works on Big Data technology, that enables us to process the information collected for each student (Student's points and number of absences in each subject and also their tendencies). Then we used the algorithms of machine learning, that enable us to give the appropriate specialization to each student. In this paper, we compared the accuracy and execution time of the following algorithms (Naïve Bayes, SVM, Random Forest Tree and Neural Network), where we found that Naïve Bayes is the best for this system.


Introduction
Student's orientation is an important and difficult process because, taking the decision regarding the human being is very complicated, but it is a process that depends mainly on the learner's points in the first place and the desire in the second degree because if the student likes a specialty, but he does not have sufficient capabilities for him, he will not be able to keep pace with the study program for this specialty. This is why Educational orientation considered a crucial step in the curriculum of each student, more specifically high school students. Unfortunately, the students of the secondary school always find themselves facing this orientation problem, because when they are in the secondary school they cannot yet decide on their choices of orientation, which prevents everyone from going straight to his preferences orientation, then causes a feeling of injustice, which may result in dropping out of studies [1].
Multiple factors influence the orientation of students, mainly social data which does not influence academic results and are far from taking into account the characteristics of all students. Second, families know or hear that, depending on the specialty of the baccalaureate obtained, the possibilities for further studies and access to higher education and professional integration differ, especially when the labor market is tight. To solve this problem, we try to create a system to help the student achieve this. Considering the number of students and the need for time, we decided to use Big Data.
In literature some related work have compared general machine learning such as Sunita B. Aher and Lobo L.M.R.J. [2]compared seven Classification Algorithm (Naive Bays, Simple Cart, ZeroR, J48,Decision Table, ADTree and Random Forest) using weka, then they finded that ADTree works better for Moodle database then Seyed Reza Pakiz and Abolfazl Gandomi [3] compared four classification algorithms but using Mapreduce model with traditional models, they conclude that classification algorithms based on Mapreduce model work better in large datasets. Similarly, Wael Etaiwi, Mariam Biltawi and Ghazi Naymat [4] compared two machines learning classifiers Naïve Bayes and Support Vector Machine (SVM) classifiers using MLlib, of the Apache Spark, and They concluded that Naïve Bayes is more powerful than SVM for Big Data, in another work, Amine Rghioui, Jaime Lloret and Abedlmajid Oumnad [5] compare J48, Bayes Net, ZeroR and Naïve Bayes using healthcare data, and conclude that j48 better than the other classifier algorithm with 99.21% accuracy,another study [6] F.Ouatik ,M.Erritali,F.Ouatik and M. Jourhmane compare Naive Bayes, Neural Networks, and k-nearest-neighbors, by classification accuracy and speed up. They find that Naive Bayes algorithm work better, but in e-learning [7][8] [9] and exactly student orientation, there are no studies that could facilitate this process. For this we try to use big data to help students in their orientation.

Big Data
The term Big Data describes collections of very large volumes of data -both structured and unstructured -that can be processed and exploited to generate intelligible and relevant information [10].
Big Data is characterized by the "3V" rule: Volume (Big Data refers to very large volumes of raw data), Variety (a Big Data set is typically composed of heterogeneous data, structured or not) and Speed (or Velocity, Big Data is generated at "high speed" or even continuously, which also means processing it quickly, even in real time).

Hadoop
Hadoop can be considered as a scalable data processing system for the storage and batch processing of very large amounts of data. Its principle is based on multi-node distributed processing to drastically increase the computing and storage capacities in order to process very large amounts of data [11]. Hadoop's business benefits are numerous. With this software framework, it is possible to store and process vast amounts of data quickly. Faced with the increase in the volume of data and their diversification, mainly related to social networks and the Internet of Things, this is a significant advantage: The distributed computing model of Hadoop allows you to quickly process Big Data. The greater the number of calculation nodes used, the higher the processing power. The processed data and applications are protected against hardware failures. If a node fails, the tasks are directly redirected to other nodes to ensure that the distributed computation does not fail. Multiple copies of all data are stored automatically.
Unlike traditional relational databases, you do not need to process the data before storing it. You can store as much data as you want and decide later how to use it. This groups unstructured data like text, images and videos.
The open-source framework is therefore free and relies on standard machines to store large amounts of data. Finally, it is possible to adapt the system to support more data by simply adding nodes. The required administration is minimal.

Mapreduce
Mapreduce [12] is a patron of IT development architecture, invented by Google1, in which parallel, and often distributed, calculations of potentially very large data are carried out. MapReduce runs on a large machine cluster and is highly scalable. It can be implemented in several forms thanks to different programming languages like Java, C # and C++. For novice developers, the Framework is useful because library routines can be used to create parallel programs without worrying about infra-cluster communications, task monitoring, or error handling. Programmers with no experience in parallel and distributed systems can easily use large system resources distributed. In order to distribute the input data and weld the results, it operates in parallel on massive clusters. The size of a cluster has no impact on data processing. In fact, tasks can be spread across any number of servers. That's why MapReduce and Hadoop simplify software development. It is available in several languages including C, C ++, Java, Ruby, Pearl and Python. Programmers can use MapReduce libraries especially based on Java 8 to create tasks without worrying about communication or coordination between nodes. This is the representation of MapReduce.

Fig. 1. Representation of MapReduce
This figurine represents MapReduce work with map step and reduce step. [13] is the Hadoop component responsible for storing data in a Hadoop cluster. HDFS can be launched on commodity hardware, which makes it very tolerant to errors. Each piece of data is stored in several places, and can be retrieved under any circumstances. In the same way, this replication makes it possible to fight against the potential corruption of the data. However, HDFS stands out from a typical file system for the following main reasons:

HDFS (Hadoop Distributed File System)
 HDFS is optimized to maximize data rates. The size of a data block is thus 64 MB in HDFS against 512 bytes at 4 KB in most traditional file systems, which reduces the seek time. (It is possible, however, to increase the size of a block to 128 MB or 256 MB as needed).  HDFS is a file management system, type of Write Once Read Many (WORM) file management system, then it is accessed several times.  HDFS provides a block replication system with a configurable number of replications (3 by default). During the writing phase, each block corresponding to the file is replicated to separate nodes in the cluster, which helps to ensure reliability and readability when reading data. If a block is unavailable on one node, copies of that block will be available on other nodes.  HDFS relies on the native OS of the file system to present a unified storage system based on a set of heterogeneous disk and file systems.

Classification
This data analysis method brings together supervised learning algorithms adapted to qualitative data. The objective is to learn (in other words, to find) the relation which links a variable of interest, of qualitative type, to the other observed variables, possibly for the purpose of prediction [14]. We use classification when the variable of interest is qualitative, i.e. it takes its values in a space that does not have natural metrics. For example, we can try to predict the literary genre of a book; this variable is discrete (genre "detective", genre "science fiction", etc.) and there is no relation between the genres, it is difficult to define a distance between them. The simplest classification algorithms are logistic regression, the k-nearest neighbor, the most complex are the neural networks, the vector machine supports, the mixture model (mixture models), the Bayesian classifier, Random Forest Tree, OneR, etc.

Naïve bayes
Naive bayes [15], commonly used in machine learning, is a collection of classification algorithms based on Bayes' theorem. It is not a single algorithm, but a family of algorithms. All these algorithms share a common principle, namely that each classified characteristic is independent of the value of any other characteristic. Even though it is a relatively simple concept, Naive Bayes can often outperform the most complex algorithms and is extremely useful in common applications such as spam detection and classification. They allow us to predict the probability of an event occurring based on the conditions we know for the events in question. The name comes from Bayes' theorem.

SVM
Support Vector Machine (SVM): SVMs [16] are a family of machine learning algorithms that solve classification, regression, and anomaly detection problems. They are known for their solid theoretical guarantees, their great flexibility and their ease of use even without great knowledge of data mining. Their principle is simple: its purpose is to separate data into classes using a border as "simple" as possible, so that the distance between the different groups of data and the border that separates them is maximum. This distance is also called "margin" and the SVMs are thus qualified as "wide margin separators", the "support vectors" being the data closest to the border.

Random forest tree
Random forest tree [17] is a supervised machine learning method that can perform both classification and regression tasks. His principle is to use many decision trees, each one is constructed with a different subsample of the training set, and for each construction of a tree, the decision at a node is made according to a subset of variables drawn randomly. Then, we use all the decision trees produced to make the prediction, with a majority vote for the classification (the predicted variable is of factor type), or an average for the regression (the predicted variable of numeric type).

Neural network
A Neural network [18] is a calculation model whose design is very schematically inspired by the functioning of real neurons (human or not). Neural networks are generally optimized by statistical learning methods thanks to their capacity for classification and generalization, such as automatic classification of postal codes or decisionmaking regarding a stock purchase according to the evolution of Classes. They enrich with a set of paradigms allowing to generate vast functional, flexible and partially structured spaces. They belong on the other hand to the family of the methods of artificial intelligence which they enrich by allowing to make decisions based more on perception than on formal logical reasoning.
There are many tools use these algorithms, in this work we used Weka to compare the Performance of these classifiers based on accuracy and execution time in order to choose the best.

Weka
Weka (Waikato Environment for Knowledge Analysis) [19] is a set of tools for manipulating and analyzing data files, implementing most artificial intelligence algorithms, inter alia, decision trees and neural networks. It is written in Java. It mainly consists of:  Java classes for loading and manipulating data.  Classes for the main supervised or unsupervised classification algorithms.  Attribute selection tools, statistics on these attributes.  Classes allowing to visualize the results.
Large data volumes linked to Big data quickly lead to memory saturation problems when using data mining software. Weka implements a set of techniques and architecture allowing to circumvent these limits and to successfully manage these Big Data such as MOA, [20] that is an open-source framework, which contains a set of learning algorithms and assessment tools. it is used for big data flow exploration and also boosts bidirectional interaction with Weka.

Results
As Figure 2 shows, the Naïve Bayes classifier is the most precise, with an accuracy of 92.10%, then Neural Network with 90.37%, SVM gives 88.13% of accuracy, followed by Random Forest lends an accuracy of 86.22%.  Figure 3 illustrates the data processing time for the classification algorithms. The Naïve Bayes classifier was found the most accurate between all the classifiers used in this article. Here too, we can see that the execution time of Naïve Bayes is adequate for this use.

Conclusion
In this study, we compare four classification algorithms, to find the right algorithm for student orientation, using student grades and also the number of absences for each subject These four classification algorithms, are Neural Network, Naïve Bayes, SVM, Random Forest Tree. We use Weka with MOA package to test the result. After the test we find that Naïve Bayes is better for students' orientation.