An Efficient Covid19 Epidemic Analysis and Prediction Model Using Machine Learning Algorithms

The whole world is experiencing a novel infection called Coronavirus brought about by a Covid since 2019. The main concern about this disease is the absence of proficient authentic medicine The World Health Organization (WHO) proposed a few precautionary measures to manage the spread of illness and to lessen the defilement in this manner decreasing cases. In this paper, we analyzed the Coronavirus dataset accessible in Kaggle. The past contributions from a few researchers of comparative work covered a limited number of days. Our paper used the covid19 data till May 2021. The number of confirmed cases, recovered cases, and death cases are considered for analysis. The corona cases are analyzed in a daily, weekly manner to get insight into the dataset. After extensive analysis, we proposed machine learning regressors for covid 19 predictions. We applied linear regression, polynomial regression, Decision Tree Regressor, Random Forest Regressor. Decision Tree and Random Forest given an r-square value of 0.99. We also predicted future cases with these four algorithms. We can able to predict future cases better with the polynomial regression technique. This prediction can help to take preventive measures to control covid19 in near future. All the experiments are conducted with python language. Keywords—Covid19, Kaggle, machine learning, regression


Introduction
The novel corona Covid 2019 (COVID-19) pandemic began in Wuhan, China, in December 2019 and is a real broad clinical issue throughout the world. Corona Viruses are a colossal class of contaminations that cause afflictions achieved by cold, for instance, the Middle East respiratory condition Covid and serious in-tense respiratory disorder Covid. The COVID-19 is another type of Coronavirus family found in Wuhan in the year 2019. Studies show that the SARS-CoV dis-ease defiles civet individuals and the MERS-CoV contamination pollute dromedary individuals. The COVID-19 disease is acknowledged to be shrunk by individuals from bats. The disease spreads very fastly from one person to another person. There are also some studies saying that this virus can be transmitted through the air also. Some variants of corona also affecting the animals [1]. Although most of the countries suffered a lot in the first wave of covid-19, the transmission rate and death rate of the second wave of covid-19 is very dangerous when compared to the first phase. New coronavirus variants (like delta variants) are generated and struggling the world.

Literature review
Machine Learning and Deep learning playing a vital role in the health sector [2]. Applying ML models for disease prediction is not new. Several authors also applied ML models to covid-19. Degadwala, S [3]. et al applied Convolutional Neural Networks for the classification of covid-19 cases. They collected X-ray chest images of 1560 covid patients and with their model they achieved an accuracy of 90%. Prathyusha K. [4] et.al applied various machine learning regression algorithms like linear regressor, polynomial regressor and achieved the best results with the polynomial regression technique. A. Lakshmanarao [5] et.al applied various regression techniques for analyzing and predicting corona disease and achieved good results with linear regression. Sumayh S. Aljameel [6] et.al applied three classification algorithms random forest, Gradient Boosting, and Logistic Regression. As they have taken an unbalanced dataset, first they applied SMOTE sampling technique, later they achieved an accuracy of 99% with random forest classification. Yazeed Zoabi [7] et.al applied a machine learning model for predicting covid-19 with eight binary features and achieved good accuracy.S. Dhamodharavadhani [8] et.al applied a Neural Network-based method for the prediction of the mortality rate of corona disease and achieved good results Archana Kalidindi [9] et.al applied deep learning for classification of human brain and achieved good results. Malki Z. [10] et.al proposed a machine learning oriented covid19 prediction model and predicted that this pandemic decline in September 2021. They applied four classifiers namely K-NN, support vector machine, random forest, decision trees. Nishitha [11] et.al applied machine learning techniques for chest im-ages and achieved good results. Sanjay Kumar [12] et.al applied machine learning for analyzing the vaccination process in India. Barmparis

Proposed methodology
The proposed methodology was depicted in Figure 1. First, we collected a dataset from Kaggle. Then analyze the dataset to find daily cases, weekly cases country-wise. Later, we applied regression techniques, holt's method to predict future trends.

Dataset
The covid dataset was collected from Kaggle [15]. The dataset contains features namely "country", "State", "Date", "Confirmed cases", "Recovered cases", and "Death cases". The data was arranged in the date-wise style. From that, we extracted the number of confirmed, death, and recovered cases, active & closed cases. (Shown in Table  1).

Analysis of active cases
The active cases number is always increasing drastically from the identifying first few cases. The distribution plot for active cases was shown in Figure 2.
From Figure 2, it is observed that active covid cases are always increasing. From Jana 2020 to May 2021, the cases are increasing drastically even though precautions are taken.

Analysis of weekly cases
Weekly progress of cases is depicted in Figure 3. (Red-Death cases, blue-confirmed cases, green-recovered cases.)

Fig. 3. Weekly progress of cases worldwide
From Figure 3, it is observed that the number of death cases is proportional to total cases. Total confirmed cases are very high around the 20 th week (after jan2020). After that, confirmed and recovered cases are following the same trend.

Analyzing mortality rate
The mortality data of the top 10 countries are shown in Figure 4 (Until May 29th, 2021). The top three countries in confirmed cases are the US, India, Brazil. The top three countries in death cases are the US, Brazil, India. So, these three countries facing a health emergency with covid-19.

Experimentation and results
Predicting number of cases is a regression problem. We applied ML regression algorithms for covid19 prediction. We applied several regression algorithms on the dataset to predict covid19.The day number is considered as the independent variable and the number of cases is considered as the dependent variable. We applied several regression models, but only four of them done well for this covid prediction.

Linear regression
Linear Regression is a basic algorithm where output variable is predicted based on the input variable. Here day number is the input variable and the number of cases is the output variable. The dataset contains 494 samples (494 days). The dataset is divided into training and testing sets in a 70%:30% split. Training set contains 345 days cases and the testing set contains 149 days cases. Later we applied linear regression and achieved an r-square value of 0.90. Figure 5 shows test set results after applying linear regression.

Polynomial regression
In polynomial regression, nth degree polynomial relation is established between independent and dependent variables. As the covid cases are not increasing linearly, polynomial regression applicable to predict cases. We applied polynomial regression with several degrees and find best solution with a degree of 4. Figure 6 shows test set results after applying polynomial regression. With polynomial regression,.098 r-squared value is achieved.

Decision tree regression
Decision Tree regression is a tree-based ML model. Based on the given input, output value at leaf node is predicted as output. We applied decision tree regression with entropy and achieved an r-squared value of 0.99.

Random forest regression
Random Forest is an ensemble model. It combines several decision trees. We applied random forest with 60 decision trees and achieved 0.99 r-squared value.

Comparison of regression algorithms
We applied four regression methods namely linear regression, polynomial regression, decision tree regression, random forest regression for three different cases namely confirmed cases, recovered cases, death cases. The results of the performance are given in Table 2. From Table 2, it is observed that Random Forest and Decision Tree performed well for covid-19 prediction. Although, four algorithms given good r-squared values, we further compared the regressors with respect to future confirmed cases. The dataset contains the number of confirmed cases up to 29-5-2021.So, we predicted number of cases on 1-7-2021(as on date) and checking which algorithm is doing good. For this we collected number of confirmed cases from [16] as on date. The comparison of all these algorithms for predicting number of cases on 1-7-2021 are shown in Table 3.  From Table 3, it is observed that polynomial regression performed well for future predictions. Although DTR, RF given good r-squared values, they are unable to predict future cases. Conclusion In this paper, we collected a covid-19 dataset from Kaggle and analyzed the number of confirmed, recovered, death cases in a daily and weekly manner. Later we applied four regression algorithms on the dataset and achieved a good r-squared of 0.99 with decision tree and random forest. Later, we tried to predict the number of future cases with all four algorithms. Polynomial Regression achieved good results while predicting future cases.