Analysis of Time-Series Trends and ARIMA models to Forecast COVID – 19 cases

COVID-19 a novel corona virus originated from Wuhan China. It turned into a pandemic resulting in a large number of deaths and loss of livelihood. It is vital to determine the manner in which the number of cases propagates so that future pandemics can be tackled scientifically. However the pandemic can be controlled systematically using efficient health care systems. It is difficult to predict the pandemic propagation over a large period of time due to various factors. In this paper an analysis is made for short periods using statistical tools like predicting the probability curve, probability density function. Forecasting of Covid-19 cases is done using time series trend analysis and ARIMA models. The test of hypothesis for difference of means and standard deviations of the actual and forecasted values with 99% CI showed no significant difference between them.


Introduction
The pandemic of COVID-19 originated in Wuhan, China and has caused a heavy loss in lives, lockdowns and loss of livelihood etc. Data sets are available for this pandemic in the official website of Johns Hopkins University. Data set for India is considered for statistical analysis for this pandemic to predict the propagation of the disease and control the same scientifically. This must be modeled scientifically to assist policy makers and healthcare community to be prepared for future consequences to help control the problem.

Dataset
For this analysis Data set for India for a period of first April 2020 to fifteenth June 2020) was obtained from the official website of Johns Hopkins University (https://gisanddata.maps.arcgis.com/apps/opsdashboar d/index.html). From this the 1st April 2020 to 31st May is data is analyzed statistically to predict the number of cases for the period from June 1 to June 15, 2020 and compared with the actual data.

Model development
The data for India for the above period is considered. A probability distribution is fitted to the data, which is a best fit based on Kolmogorov Smirnov ranking test. Time series trend analysis is used to find the parameters of various models like MAPE, MAD and MSD values. An efficient model is the one which has the lowest value for the above measures. For forecasting a time series, ARIMA modeling is an efficient method. ARIMA procedure analyzes and forecasts equally spaced univariate time series data, transfer function data, and intervention data using the www.psychologyandeducation.net

Measures of Accuracy
The accuracy of fitted time series values as a percentage is given by the Mean Absolute Percentage Error (MAPE).The accuracy of fitted time series values is given by Mean Absolute Deviation (MAD). Mean Squared Deviation (MSD) is applied to determine the overall deviation of data set from the mean.

Probability Distribution
The given data is input into Math wave software to predict the best fit distribution and its parameters. It was found that the Johnson SB distribution was the best fit by Kolmogorov Smirnov and χ 2 ranking test (Table   1). The probability distribution function for Johnson SB distribution is given by Z is a standard normal random variable,  and  is the shape parameters; λ is a scale parameter and  is a location parameter. The parameters for this data set are given in Table 2. A prediction can be attempted for future cases applying the parameters obtained. Figure 2 gives the probability distribution function.

ARIMA Method
The data for the months of April and May were normalized using the Box Cox plot method.

Figure 3. Box -Cox Plot
In Box-Cox plot a rounded value of λ =1 was obtained after two transformations as shown in figure  3. Table 5 shows the measures of ARIMA. It was found that the types AR (2) and MA (2) gave the least values for measure of accuracy.  Analysis of the Modified Box-Pierce (Ljung -Box) χ2 statistic for Differencing: Two regular differences. Table 6 shows that different values of lag (12, 24, 36 and 48), the p -value is greater than the significant level of 0.05 when compared to Chi-Square and DF values.

Test of Hypothesis
Test of hypothesis was conducted for the 2 samples actual and forecasted. It was found that there was no significant difference in means of the two. If it is not possible to compare the two values of predicted and actual for a large prediction, this method can be applied to infer that the means of actual and predicted values are not significantly different. An average prediction may be sufficient.
The two samples consisting of actual and predicted values are tested for significant difference between means and standard deviations using test of hypothesis.
Ho: No significant difference between two means. H1: The two means differ significantly.
The null hypothesis is accepted. It is evident from Figure 6 and Figure 7.that there is no significant difference in the two means and standard deviations at 99% confidence intervals. www.psychologyandeducation.net

2-Sample Standard Deviation Test for Actual cases and Predicted va Summary Report
Distribution of Data Compare the spread of the samples.

Conclusion
Data related to number of cases of COVID-19 was taken for India for the months of April and May for time series trend analysis and ARIMA forecasting analysis from the official website of Johns Hopkins University. A probability distribution was fitted and the best fit parameters found using Kolmogorov Smirnov ranking method. Time series trend analysis and ARIMA methods were applied to forecast data for the first half of the month of June using the best fit parameters. The actual and predicted values were compared using test of hypothesis for significant difference in mean and standard deviation and found that there is no significant difference.