Dengue Risk Mapping from Geospatial Data Using GIS and Data Mining Techniques

Dengue fever is a major public health problem and has been an epidemic in Thailand for a long time. Therefore, there is a need to find a way to prevent the disease. This research aimed to explore the important factors of dengue fever, to study the factors affecting dengue hemorrhagic fever in Surat Thani Province, and to map the potential outbreak of dengue fever. Collecting patient information was done including, Rainfall, Digital Elevation Model (DEM), Land Use and Land Cover (LULC), Population Density, and Patients in Surat Thani Province, which was analyzed using data mining techniques involving analysis using 3 algorithms comprising Random Forest, J48, and Random Tree. The correct result is Random Forest since the accuracy of the data is 96.7 percent followed by J48 with accuracy of 95.9 percent. The final sequence is Random Tree with accuracy of 93.5 percent. Then, using the information can be displayed through ArcGIS program to see the risk points that are compared to the risk areas that have been previously done. The results can be very risky in Mueang District, Kanchanadit District, and Don Sak District, corresponding to the information obtained from the Public Health Office and the risk map created from the patient information. Keywords—Dengue risk mapping, Geospatial data, Data mining, GIS.


Introduction
Dengue fever is identified as a disease caused by the Dengue virus. There are around 2.5 billion people worldwide who have been infected with dengue fever [1][2][3][4]. Outbreak can occur in the rainy season from Lai mosquitoes, which like to be active in the daytime in homes and schools that have dense populations. Areas with water sources have Lai mosquitoes because they have water that is watery and clear. After the rain, because the temperature and humidity are suitable for breeding in other seasons, the prevalence of striped mosquitoes decreases slightly [5]. Most will occur in tropical countries [6][7]. Dengue infection can result in very mild illness or death. The symptoms often begin with headaches, muscle pain, and bone pain, which can be divided according to the 3 phases of the illness, namely fever phase, shock phase and recovery period [8][9].
Since 1953-1964, dengue fever has spread in many countries in Southeast Asia and Pacific Asia, namely the Philippines, Thailand, Vietnam, Singapore, and Kolkata in 2 Materials and Methods

Study area
Surat Thani is a province in the upper southern region. It has the largest area in the south and the 6th largest in Thailand. It has the 59th highest density population in the country. Surat Thani Province is located on the eastern side of the southern region. There is a variety of terrain, including flat terrain, coastline, plateau, as well as mountainous terrain. Thailand has reported dengue fever for over 60 years. Currently, dengue fever is spread throughout the country in every province and district. The spread of the disease has changed over time. Surat Thani is a province that ranks first. That is found in the most dengue patients from 1 January to 29 September 2018, a total of 939 patients were added from the previous year and are likely to increase in 2019. From the forecasting report Dengue fever 2019 by the Department of Disease Control, including Bureau of Communicable Diseases by Insects, Bureau of Epidemiology, Office of Disease Prevention and Control 1-12 and, Urban Disease Prevention and Control Institutions stated that the southern region is a risk area for dengue fever epidemic. The provinces that are expected to have the most severe outbreak include Nakhon Si Thammarat, Krabi and Surat Thani. In Surat Thani, the number of patients is expected to reach 1,140 [12]. Figure 1 shows the boundaries of the study area, which is every district in Surat Thani Province, consisting of 131 districts together.  Table 1 reveals the abbreviation that we need to use along the research paper. Though, these abbreviations will be defined in the text.

Data collection
Study relevant factors from various researches both in Thailand and abroad then bring to compare to see what factors are appropriate and popular in the analysis for the mapping of dengue fever. Table 2 shows a comparison of factors such as season, Rainfall, Temperature, Humidity, DEM, Slope, LULC, River, Water Source, Gross Domestic Product (GDP), Economy, and Population Density. The factors are chosen from the highest total score of three, i.e. 6, 5 and 4, which has all 6 factors from 12 factors including Rainfall, Temperature, Humidity, DEM, LULC, and Population Density.
When all the factors data is collected by requesting information from various agencies, the data that we use is data in 2018 for Surat Thani Province. Both spatial data and quantitative information exist as follows. Rainfall data, Temperature data, Humidity data is the data obtained from the Meteorological Department [22] DEM data, LULC data are the information obtained from Land Development Department [22]. Population Density data is information obtained from the National Statistical Office [23], while Patient Data is information obtained from the Surat Thani Provincial Health Office [24].

Data preprocessing
The factors data were derived from various sources and different in units. To ensure homogeneity, each factor was prepared by first rate its raw value into discrete level as in Table 3. The sampling was made uniformly 1000 instances (points). In the raining and predicting in this research, the K-fold Cross Validation was employed as a metric to evaluate the merit of each configuration. The data ware divided equally in to K groups. For each round, one group was selected for training the model, while the remaining K-1 groups were used testing, from which the results were compared with the actual result, and therefore the predicting error was calculated. In order to minimize data dependent biases, the process was repeated K times, each with training data iterating through a different group of datasets. In this research, the value K was set to 10. To address over-fitting issues 10-fold cross validation was used to assess all abovementioned data mining algorithms, in the experiments.

Spatial analysis
ArcGIS is used for data pre-processing, spatial analysis and visualization [27]. Starting, we will take the information on the factors that we have chosen, including Rainfall, Temperature, Humidity, DEM, LULC. Population Density and the number of Patients. Come to Interval -Ratio Level divide the data group or separate the categories to indi-cate the level of difference between groups or categories. Use the comparison of characteristics in each factor. Once the information of all factors has been obtained, this experiment, use 1000 points to be able as a representative for each area to create a map for dengue fever monitoring. However, more than 1000 points have been used in testing. Which in analyzing the data for use in real situations, the size of datasets can be increased without decreasing the accuracy. The class which will be used to see the correctness of the model is patient data. Used to Interval -Ratio Level data to see which areas are at risk of outbreaks of disease-causing patients to occur, which will be divided into 4 classes include, 4 layers, shown on the map to see which areas are at risk of the dengue outbreak include, Level 4 high risk is red area, Level 3 moderate risk is orange area, Level 2 mild risk is light green and Level 1 low risk is dark green.
Rainfall for the amount of rainfall data used in this research, the monthly rainfall is used May to September of 2018. Surat Thani has set rain all year round, causing a puddle of floods. Rainfall is, therefore, a very important factor used in the analysis, which has a total of 9 stations including Chaiya, Bannasan, Kanchanadit, Phunphin, Tha-Chana, Phanom, Khun-Thale, Surat Thani Rubber Research Center Station, and Surat Thani City Station. Therefore, bringing all the data to estimate the value during (Interpolate) by IDW (Inverse Distance Weight) technique is estimation by randomly sampling each sample point from a position that can affect the cells to be estimated. This will have less impact according to the distance [28][29]. Figure 2 by the red area is the area with the highest amount of rainfall. This research selected using IDW precipitation estimation because this method provides the least average error, especially the monthly and annual rainfall estimation. In addition, the estimation of water values in this research found that the IDW method of precipitation is able to distinguish the rainfall data for each area the best.

Fig. 2. Rainfall-Level Map in Surat Thani Province
DEM identify the source of mosquito problem (i.e. vector breeding areas). It was found that more than ninety percent of the case samples were in the 'High' and 'Very High' categories [30]. This research is divided into 10 levels of altitude, with the height of the area starting at 200 meters from the mean sea level. The west side of the area begins to rise gradually until the east side of the area looks like a ridge.

Fig. 3. DEM-Level Map in Surat Thani Province
Population Density is a measure of the Population Density in a given area, depending on the sample chosen to be surveyed. We choose the Population Density in the residential area, which the data used from the National Statistics Department, as shown in Figure 4. Land use and land covering data in Surat Thani Province can be classified into 6 groups, as in Figure 5, as follows: Green area is the Other utility space (O), Light blue area is the Water source area (W), Yellow area is the Agriculture (A), Pink area is the Mix can't distinguish what type of area (M), Red area is the Urban community area (U), and Blue is the Forest area.

Fig. 5. LULC Map in Surat Thani Province
Patient Data used in this research comprised district-level patient data in 2018, with a total of 939 patients in that year. It was found that the top 5 patients with dengue fever were Ma Kham Tia 96 cases, Bang Kung 32 cases, Talad 29 cases, Tha Thong Mai 27 cases, and Ban Na San 22 cases. The study will divide the data into 4 groups as in Figure 6: dark green is a group of 0-5 patients, Light green is a group of 6-16 patients, Orange is a group of 17-37 patients, and red is Group of 38-102 patients. This research uses data mining techniques to analyze the accuracy of the data because it is a fast and searchable method. Relationships that are hidden in that data set easily [31][32]. It does not require the WLC method to analyze the suitability of the factor because it is a delayed and multi-step because of the need to apply the scoring criteria obtained from the classification of each factor (Rating) in the analysis together with the weight factor obtained from experts. That is to say, the WLC method must evaluate the value multiplied by the weighting of each factor and find the sum. After that, the method of dividing into a class of security is used in analyzing the level of suitability. In the case of this method, if there is a change or addition of other factors incoming, they must be considered by all new experts. In this research, we propose a method for analysis using data mining techniques with data obtained from Interval -Ratio Level to lead the analysis shown in Table 3: Rainfall, DEM, LULC, Population Density, and Patients factor. When dividing all Interval -Ratio Level, we will export the data in the form of .pdf files leading to the process of data analysis to see the risk areas and random points that we have randomly matched. We will verify the accuracy of data analysis using data mining techniques.

Data mining
The learning and testing process were performed using WEKA. However, other tools such as Python, Scikit-Learn, R Studio and RapidMiner can be used to analyze. The data mining process consists of sub-work flows that turn raw data into knowledge. The steps are shown in Figure 7: Data Cleaning is a procedure for eliminating unrelated data, while Data Integration is the process of combining data with multiple sources into one set of data. Data Transformation is a data conversion procedure that is suitable for use, while Data Reduction is the process to reduce the complexity of data [31][32]. We choose to use the Decision Tree Algorithm as it is the algorithm that is widely known and suitable for solving complex problems. It is a model that is easy to understand, consisting of Random Forest, J48 and Random Tree. The reason for choosing the 3 algorithms is Random Forest is popular, has excellent performance and is accurate in classification tasks. It even outperforms its counterparts such as discriminant analysis, neural networks and support vector machines [33]. For J48, the study found that it is applied to other fields. It is not related to health, but is used in the analysis of credit, in which the research is compared with other algorithms. The results obtained from the analysis are better than Naïve Bayes and PART [34], so it is chosen to apply for health research. Random Tree is an algorithm that is rarely seen in analysis. It is interesting that the results will be different from the algorithm that is popularly used and will give good results in health research.
All 3 algorithms are used in instructional learning. It is a learning method that is not very complicated builds classification or regression models are in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets, while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. Leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor, called a root node. Decision trees can handle both categorical and numerical data [35][36][37]. Its work has to be divided into two parts: training data and testing data. Training data will be a guide for teaching the machine to learn the data that is analyzed first. Then the information that we want is added to put in to see how the data is correct and how reliable the models are in our analysis to predict which areas are at risk.
We use patient data for consideration together with the other 4 factors, which are Rainfall, DEM, LULC, and Population Density. When the results of the analysis are required, Pattern Evaluation is the process of evaluating patterns obtained from data mining. Knowledge Representation is the process of presenting knowledge that has been discovered [31][32].

Fig. 7. Workflow for Data Mining Technique
Random forest: Random forest is a popular machine learning procedure which can be used to develop prediction models [38]. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set [39]. The input of each tree is sampled data from the original dataset. In addition, a subset of features is randomly selected from the optional features to grow the tree at each node. Each tree is grown without pruning. Essentially, a random forest enables a large number of weak or weakly-correlated classifiers to form a strong classifier [40][41].
Similarly, the expected prediction error of at = can be expressed as In regression, for the squared error loss, this latter form of the expected prediction error additively decomposes into bias and variance terms, which together constitutes a very useful framework for diagnosing the prediction error of a model. In classification, a similar decomposition is more difficult to obtain for the zero-one loss. Yet, the concepts of bias and variance can be transposed in several ways to classification, thereby providing comparable frameworks for studying the prediction errors of classifiers [42].
In the proposed model, Random Forest consisted of four parameters, i.e., numClasses, maxDepth, numFeatures and Iteration, as shown in Table 4. Random tree: Random Trees is a supervised Classifier; it is an ensemble learning algorithm that generates lots of individual learners. It employs a bagging idea to construct a random set of data for constructing a decision tree. In a standard tree, every node is split using the best split among all variables and outputs the class label that received the majority of "votes" [43]. This method is called Random Trees because you are actually classifying the dataset a number of times based on a random sub-selection of training pixels, thus resulting in many decision trees. To make a final decision, each tree has a vote. This process works to mitigate over-fitting. Random Trees is a supervised machine learning classifier based on constructing a multitude of decision trees, then choosing random subsets of variables for each tree [44].
Decision tree (J48): Classification is the process of building a model of classes from a set of records that contain class labels. Decision Tree Algorithm finds out the way the attributes-vector behaves for a number of instances. Also, the classes for newly-generated instances are found on the basis of the training instances. This algorithm generates the rules for the prediction of the target variable. With the help of the tree classification algorithm, the critical distribution of the data is easily understandable [45]. This process uses "Entropy", which is a measure of the data disorder. Entropy is calculated by: The objective is to maximize the Gain, dividing by overall entropy due to split argument ⃗ ⃗ by value j.

Accuracy assessment
This paper employed standard accuracy metrics which were Accuracy, Kappa, Root Mean Square Error (RMSE). Particularly, RMSE were defined as follow: where n was the number of data samples, x_i and x_i^' were the actual and predicted values, respectively.

Result
The application of various techniques to study the factors affecting the outbreak of dengue fever in Surat Thani Province can be divided into 2 main parts: Spatial Analysis and Data Mining Analysis. By compiling a variety of data such as Rainfall, DEM, LULC, Population Density, and Patients, which is the data in the year 2018, all the data is taken through the Interval -Ratio Level to determine the relationship between each factor. The data obtained from the analysis using data mining techniques such as Random Forest, Random Tree, and J48 were utilized to find the most suitable model for creating dengue risk maps. Figures 8, 9, and 10 show the risk points on the Surat Thani map, which divides the risk areas for dengue fever into 4 levels including no risk (green), low risk (yellow), moderate risk (orange) and high risk (red). Due to the model obtained not being very different, the risk point is similar. This can be explained as follows: Very risk areas have a lot of distribution in Surat Thani, which includes Makham Tia Sub-district, which is an area that is high from sea level with a large urban area and many people. For moderate risk areas, there is a lot of distribution in Kanchanadit District, including Pa Ron, Chang Sai, Krut Sub district and also found distribution in Phunphin district including, Tha Kham district also. These two areas are elevated high away from sea level. Most of the usable areas are urban community, agricultural, and forest areas, consistent with the data we have studied because more space is used for living and agriculture. This part will stimulate the amount of waste and additional equipment for daily use. The usual breeding places are the roof gutters, flower pots, flower pot plates, and roadside drains. The breeding places may also be in unexpected places such as plant axils, tree holes, air-conditioners, canvas sheets, and discarded receptacles in the area. There is a risk of outbreak for dengue fever [46]. Low risk and no risk areas are not too far from sea level. Most of these areas are agricultural areas, forests, and water resources, meaning there is not high density.    Figure 11 shows model checking with data to there are any points on the map that the model has examined correctly or which ones are wrong in Random Forest from data 1000 points, correct 967 points and error 33 points. The point where the model analyzes has the most data errors. The area has agriculture, urban community area, and water sources. With an area of more than 821 meters above sea level, there are 6-102 patients in the area. Further, there is high population density in the area. From the comparison of model analysis results in Figure 8, the error point will be an area that has no risk or an area that is low risk. However, if considering from that information, then it must be a moderate risk or high-risk area. Figure 12 shows model checking with data to there are any points on the map that the model has examined correctly or which ones are wrong in Random Tree from data 1000 points, correct 959 points and error 41 points. The point where the model analyzes has the most data errors. The area has agriculture and water sources with an area more than 821 meters above sea level. There are 6-102 patients in the area. Further, there is high population density in the area. From the comparison of model analysis results in Figure 9, the error point will be an area that has no risk or an area that is low risk. However, if considering from that information, then it must be a moderate risk or highrisk area. Figure 13 shows model checking with data to any points on the map that the model has examined correctly or which ones are wrong in J48 from data 1000 points, correct 935 points and error 65 points. The point where the model analyzes has the most data errors. The area that has agriculture with an area of more than 821 meters is above sea level. There are 6-102 patients in the area. Further, there is high population density in the area. From the comparison of model analysis results in Figure 9, the error point will be an area that has no risk or an area that is low risk. However, if considering from that information, then it must be a moderate risk or high-risk area.    Table 5 and Figure 14 show the comparison of different algorithms on the base of Accuracy, Root means square error (RMSE) and kappa statics. It can be seen that the data values for all 3 models are consistent and in the same direction as follows. The proposed algorithm has a significant accuracy difference compared to other algorithms. It has the maximum for Random Forest accuracy rate of 96.7%, which is the most accurate, while second is J48 accuracy rate of 95.9% and last is Random Tree with an accuracy rate of 93.5%.
When the accuracy is high, the value of the RMSE must be small because RMSE is a statistical measurement of volumes that are constantly changing. Calculations can be made to any series of values or any function that continuously fluctuates and used to compare the prediction accuracy of each model. The model which has the lowest RMSE is the best model [47]. Similarly, Kappa statistic is used to measure inter-rater reliability (and also Intra-rater reliability) for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of agreement occurring by chance.  Additionally, in this research, the accuracy of the model is evaluated when increasing the number of datasets used in learning and testing. It is found that when increasing the number of datasets, the accuracy will be higher. The error decreased. Random Forest algorithm is preferred, due to its consistent favorable performance. Accuracy assessment was performed on its RMSE and Accuracy measure. These values versus the number of datasets is plotted in Figure 15. It was evident from these graph that RMSE and Accuracy significantly improved up to approximately 2000th point. After that, it started to converge until approximately 4000th point, when no improvement was noticed. This result serves as a preliminary guideline on training the Random Forest model. Moreover, accuracy assessment on other parameter settings can follow the same suite, given a new set of areas, that may differ in terms of temperature, rainfall, and geospatial characteristics than the Surat Thani province, considered in this study. To Accuracy assessment of the model, 10 folds cross validation was used. The 10 folds cross validation results corresponded well with the above hypothesis: learning based on similar characteristic factors as the validating data results in better accuracy.

Conclusion
From the analysis and the importance of factors affecting the outbreak of dengue fever in Surat Thani Province by using GIS and Data Mining techniques by using the data of each factor, the results are displayed in the form of geographic information data. It is able to determine which factors are suitable and at what level. When the data is analyzed with the Decision Tree and determined from the data of the results that show the map results, the most affective factors are LULC in the event that the area is an urban community area. This will result in a dengue epidemic outbreak due to the large number of people, which is consistent with the data for Population Density and the number of dengue patient's existent. These two factors are the second-most important factors. Next is the DEM; if the altitude is high compared to sea level, there is a greater risk of an outbreak. Finally, the level of rainfall is a factor. Since Surat Thani is a province with rain all year round, the amount of rainfall in every area is not very different (The first factors that were eliminated when analyzing data were temperature and humidity because the data was analyzed using data mining techniques. By reducing the value of these two factors, this did not result in the model's reliability changing, thus reducing these two factors). The models with the most accuracy are Random Forest (96.7%), J48 (95.9%), and Random Tree (93.5%). All three models do not have much difference and the accuracy of all 3 models is reliable because it is more than 90% accurate. We can either use these 3 models or choose the most accurate model, Random Forest, to support or consider when making a campaign planning decision to reduce the number of outbreaks of dengue fever in Surat Thani Province.
A suggestion from this research is that it may be presented by using different models in order to see the comparative errors more clearly. For future research, other factors should be studied in terms of the cases of dengue fever in an outbreak. For example, they may be used as a container factor, which will be detailed at the level of habitat, spawning, liking or attractiveness, causing mosquitoes to propagate at a place or container.