Introduction

This is not a game: predicting the number of deaths or related quantities due to Covid-19 should not be addressed without considering serious ethical concerns. We do not have the ambition to tell you how the virus will be spreading out in the future, we want just try to understand the trend across Italian regions. Take this analysis critically. We agree with the statistician George Box, who said: ‘All the models are wrong, but some are useful’. Hopefully, some of them may be useful, and this is our hope for this work.

We do not want to play with the life.

Main aims

The main aims of this analysis are the following ones:

  • study the trend of both total hospitalized people in each region and intensive care units (ICU);

  • provide 3/4 days-forward predictions immediately available for the regional health care systems to ease the patients’ allocation;

  • evaluate the effect of the following factors/covariates on the spreading outbreak of the disease: time, lockdown measures adopted by the Italian Government (see here for more details https://en.wikipedia.org/wiki/2020_Italy_coronavirus_lockdown), number of medical swabs, regional membership;

  • evaluate the appropriateness of the policies adopted by the Italian Government to alleviate the disease’s spreading outbreak.

Methodology

We develop Bayesian hierarchical Poisson regression models for the number of total hospitalized and ICU, by including the covariates listed above and considering possible interactions between the lockdown measures and the exposition times. Model comparison has been fulfilled by use of predictive information criteria based on leave-one-out cross-validation.
Out-of-sample predictions (95\(\%\) credible intervals) for the next days have been obtained by simulating from the posterior predictive distribution. Predictions for the future medical swabs have been generated through local polynomial regression (LOESS).

Moreover, we consider distinct starting points of exposition for each region, by acknowledging for the distinct intensities registered in each region:

  • February 24th: Piemonte, Valle d’ Aosta, Abruzzo, P.A. Bolzano, Emilia Romagna, Friuli Venezia Giulia, Lazio, Liguria, Lombardia, Marche, P.A. Trento, Toscana, Umbria, Veneto;

  • February 25th: Sicilia;

  • February 27th: Campania, Puglia, Basilicata, Molise, Calabria;

  • March 3rd: Sardegna.

The fit of the model is performed using the Stan Ecosystem softwares (see https://mc-stan.org/) enjoying Hamiltonian Monte Carlo sampling, an effective extension of the Markov Chain Monte Carlo methods. We used the statistical software R 3.6.3.

This is a dynamic modelling framework: with the same code, we can fit every day the models and update the predictions for the upcoming days.

Limitations of this work

  1. As mentioned above, we feel that the number of hospitalized people and ICU are the most effective and reliable data in this moment, being likely less associated than the number of the new positive cases with the medical swabs. Moreover, they can be fruitfully used to predict patients’ allocation according to the strength of the health care systems. Though, they offer a systematic lagged screenshot of the pandemic trend, if compared with the number of new cases.

  2. The policies for the swabs are not open-access and clearly defined in the different regions and. Thus, underlying on the swabs covariates for the out-of-sample days may yield very unstable predictions. For such a reason, we want to invite the readers to carefully analyze and criticize the assumption of exponentially increasing swabs.

  3. The use of the Poisson distribution could not take care of possible under/over-dispersion in the national data. Moreover, the Poisson regression model depending only on time covariate implies a monotonic trend, and this may be realistic only for the first period of outbreak.

  4. In this first step, we are not accounting for the regions’ populations. Though, we feel this is an important feature. We are going to include them as further covariates.

Data

We analyze the data freely available at “https://github.com/pcm-dpc/COVID-19/tree/master/dati-regioni”, in which the date, the total number of positive people, hospitalized people, intensive care and other relevant quantities are provided for each day and each Italian region from February 24th.

# loading packages
library(rstanarm)
library(ggplot2)
library(bayesplot)
library(tidyverse)
library(lme4)
library(arm)
library(loo)
library(kableExtra)

# daily data
regioni <- read.csv("https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv")
head(regioni)
##                  data stato codice_regione denominazione_regione      lat
## 1 2020-02-24T18:00:00   ITA             13               Abruzzo 42.35122
## 2 2020-02-24T18:00:00   ITA             17            Basilicata 40.63947
## 3 2020-02-24T18:00:00   ITA              4          P.A. Bolzano 46.49933
## 4 2020-02-24T18:00:00   ITA             18              Calabria 38.90598
## 5 2020-02-24T18:00:00   ITA             15              Campania 40.83957
## 6 2020-02-24T18:00:00   ITA              8        Emilia-Romagna 44.49437
##       long ricoverati_con_sintomi terapia_intensiva totale_ospedalizzati
## 1 13.39844                      0                 0                    0
## 2 15.80515                      0                 0                    0
## 3 11.35662                      0                 0                    0
## 4 16.59440                      0                 0                    0
## 5 14.25085                      0                 0                    0
## 6 11.34172                     10                 2                   12
##   isolamento_domiciliare totale_positivi variazione_totale_positivi
## 1                      0               0                          0
## 2                      0               0                          0
## 3                      0               0                          0
## 4                      0               0                          0
## 5                      0               0                          0
## 6                      6              18                          0
##   nuovi_positivi dimessi_guariti deceduti totale_casi tamponi note_it note_en
## 1              0               0        0           0       5                
## 2              0               0        0           0       0                
## 3              0               0        0           0       1                
## 4              0               0        0           0       1                
## 5              0               0        0           0      10                
## 6             18               0        0          18     148
n_osp_giornalieri <- length(regioni$totale_ospedalizzati)

# storing variables
## number of distinc times
l_times <- length(unique(regioni$data))
times <- 1:l_times
## number of regions
n_reg <- length(unique(regioni$denominazione_regione))
n <- dim(regioni)[1]
## lag hospitalized
osp_lag <- c()
initialize <- rep(0, n_reg)
osp_lag <- initialize
for (j in (n_reg+1):n){
  osp_lag[j] <- regioni$totale_ospedalizzati[j-n_reg]
}
icu_lag <- c()
initialize <- rep(0, n_reg)
icu_lag <- initialize
for (j in (n_reg+1):n){
  icu_lag[j] <- regioni$terapia_intensiva[j-n_reg]
}
## recodifying distinct times
times <- as.factor(regioni$data)
levels(times) <- c(1:l_times)
times <- as.numeric(times)
regioni$codice_regione <- as.factor(regioni$codice_regione)
## tamponi_ritardati
tamp_lag <- c()
initialize <- rep(0, n_reg)
tamp_lag <- initialize
for (j in (n_reg+1):n){
  tamp_lag[j] <- regioni$tamponi[j-n_reg]
}
# if launched before 0.00 AM
yesterday_day <- substr(format(Sys.Date(), format="%d%b-%Y"),2,5)
exact_day <- substr(format(Sys.Date()+1, format="%d%b-%Y"),2,5)

yesterday_day_long <- Sys.Date()
exact_day_long <- Sys.Date()+1

Model’s choice for the number of total hospitalized people

We start investigating the relationship between the number of hospitalized people in each region and some covariates. We will end up to select the best model in terms of both Bayesian and classical goodness measures of fit, based on leave-one-out cross-validation and deviance information criteria, respectively. Fit of Bayesian models is performed using Hamiltonian Monte Carlo sampling underlying the rstanarm package, belonging to the Stan Ecosystem. Unless otherwise stated, we use weakly-informative prior distributions for the regression coefficients.

Here we depict the observed total hospitalized and ICU collected until today in each Italian region (black points):



Poisson hierarchical regression

We move into a hierarchical modeling scenario, by adopting the following model for the total hospitalized number of people in each region, \(Y_{ij}, \ i =1,\ldots,n; \ j=1,\ldots, \mathsf{n\_reg}\):

\[\begin{align} Y_{ij} \sim & \mathsf{Poisson}(\lambda_{ij})\\ \lambda_{ij} = & \exp(\mu + \alpha_{j(i)} + \beta_1\mathsf{times}_i)\\ \alpha_j \sim & \mathsf{Normal}(\mu_{\alpha}, \sigma^2_{\alpha}), \end{align}\]

where \(\mu\) is the global intercept and \(\alpha_{j(i)}\) is the random intercept for the region \(j(i)\). \(\mu_{\alpha}\) and \(\sigma^2_{\alpha}\) are the hyperparameters for the random intercepts prior. Analogously as what we have done for the previous four models, we fit distinct hierarchical models, by first adding the squared times and the medical swabs. We estimate the hierarchical models by using the function glmer in the lme4 package, and the function stan_glmer in the rstanarm package.

As clearly emerges from the previous reports, hierarchical models drammatically improve the fit, with the latter being preferable to the model without medical swabs.

However, estimated curves and predictions for Marche and Friuli Venezia Giulia were not very good: for such a reason, we choose to fit them separately, through a GLM.

Furthermore, we extend our hierarchical models by adopting the following choices:

  • consider squared and cubic effects in the times, \(\mathsf{times}^2, \ \mathsf{times}^3\);

  • consider the \(\mathsf{swabs}\) and the lagged swabs, \(\mathsf{lag\_swabs}\);

  • consider the lagged hospitalized, \(\mathsf{lag\_hosp}\);

  • add the \(\mathsf{lockdown}\) covariate, a numerical non-decreasing function which should try to mimic the lockdown measures adopted by the Italian Government; the function value is zero before March 8th, and then assumes increasing values, with upper bound 1;

  • cut the interval for which each region is observed, in order to acknowledge for the different starting points of Covid-19 registered in each region;

  • evaluate a possible interaction between lockdown measures and exposition times;

  • considering a further lockdown dummy variable, \(\mathsf{reg\_eff}\) taking value 1 for some regions (Puglia, Marche, Lazio, Campania, Calabria, Basilicata), and zero for the others.

Then, the linear predictor for the completed model is then defined as:

\[\begin{align} \lambda_{ij} = & \exp(\mu + \alpha_{j(i)} + \beta_1\mathsf{times}_i + \beta_2 \mathsf{times}^2_i + \beta_3 \mathsf{times}^3_i + \beta_4\mathsf{lockdown}_i + \\ & \beta_5 \mathsf{swabs}_i + \beta_6 \mathsf{lag\_swabs}_i + \beta_7 \mathsf{lag\_hosp}_i + \beta_8 \mathsf{times}_i\times \mathsf{lockdown}_i +\\ & \beta_9 \mathsf{times}^2_i\times \mathsf{lockdown}_i + \beta_{10} \mathsf{times}^3_i\times \mathsf{lockdown}_i + \beta_{11}\mathsf{reg\_eff}_i). \end{align}\]
resp <- regioni$totale_ospedalizzati 
n.iter <- 3000
## data redefinition
regioni_c <- as_tibble(cbind(times, osp_lag, icu_lag, regioni, tamp_lag))
regioni_ristr <- cbind(times, osp_lag, icu_lag, regioni, tamp_lag)
regioni_tibble <- as_tibble(regioni_ristr)
regioni_ristr<- regioni_tibble %>%
  filter( (times > 3 & (denominazione_regione =="Campania"|
      denominazione_regione =="Puglia" |
      denominazione_regione =="Basilicata"|
      denominazione_regione =="Molise" |
      denominazione_regione =="Calabria")) |
  (times > 1 & denominazione_regione == "Sicilia")|
  (times > 8 & denominazione_regione == "Sardegna")|
   (times >0 & (denominazione_regione =="Piemonte"|
      denominazione_regione =="Valle d'Aosta" |
      denominazione_regione =="Abruzzo"|
      denominazione_regione =="P.A. Bolzano" |
      denominazione_regione =="Emilia-Romagna"|
      denominazione_regione =="Friuli Venezia Giulia"|
          denominazione_regione =="Lazio" |
          denominazione_regione =="Liguria"|
          denominazione_regione =="Lombardia" |
          #denominazione_regione =="Marche"|
          denominazione_regione =="P.A. Trento"|
      denominazione_regione =="Toscana"|
      denominazione_regione =="Umbria"|
      denominazione_regione =="Veneto")))
# lockdown variable
 lockdown <- c()
 eff<-function(xx) { vec= rep(0, length(xx))
for (i in 1:length(xx)) {
   if (xx[i] <= 15)
   {vec[i]=xx[i]^2/225}
   else
   {vec[i]=1}}
vec
 }
# defined to be zero before march 8
lockdown <- c(rep(0, 14), eff(1:(l_times-14)))
resp <- regioni_ristr$totale_ospedalizzati
times <- regioni_ristr$times
lockdown_rep <- c()
for (j in 1: length(resp)){
   lockdown_rep[j] <- lockdown[times[j]]
}
n_reg <- length(unique(regioni_ristr$denominazione_regione))
times_scaled <- times - mean(times)
dummy_reg <- c(0, 0, 0, 1, 0, 0, 0, 0, 0,
  0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0)
dummy_reg_rep <- c()
veri_codici <- unique(regioni_ristr$codice_regione)

for (j in 1: dim(regioni_ristr)[1]){
  
  dummy_reg_rep[j] <- dummy_reg[
    (1:(n_reg))[regioni_ristr$codice_regione[j]] ]
}

# glm model for marche
regioni_marche <- regioni_c%>%
  filter(denominazione_regione=="Marche")
resp_marche <- regioni_marche$totale_ospedalizzati
times_marche <- c(1:l_times)
tamp_lag_marche <- tamp_lag[regioni$denominazione_regione==
    "Marche"]
osp_lag_marche <- osp_lag[regioni$denominazione_regione==
    "Marche"]
tamponi_marche <- regioni$tamponi[regioni$denominazione_regione=="Marche"]

mod.glm.marche <- stan_glm( resp_marche ~  
    #as.factor(dummy_reg_rep) +
    lockdown*I(times_marche) + 
    lockdown*I(times_marche^2) + 
    lockdown*I(times_marche^3) + 
    tamp_lag_marche  + osp_lag_marche,  data = regioni_marche, 
  family = poisson, iter = n.iter, cores = 4)


# glm model for fvg
regioni_fvg <- regioni_c%>%
  filter(denominazione_regione=="Friuli Venezia Giulia")
resp_fvg <- regioni_fvg$totale_ospedalizzati
times_fvg <- c(1:l_times)
tamp_lag_fvg <- tamp_lag[regioni$denominazione_regione==
    "Friuli Venezia Giulia"]
osp_lag_fvg <- osp_lag[regioni$denominazione_regione==
    "Friuli Venezia Giulia"]
tamponi_fvg <- regioni$tamponi[regioni$denominazione_regione=="Friuli Venezia Giulia"]

mod.glm.fvg <- stan_glm( resp_fvg ~  
    #as.factor(dummy_reg_rep) +
    lockdown*I(times_fvg) + 
    lockdown*I(times_fvg^2) + 
    lockdown*I(times_fvg^3) + 
    tamp_lag_fvg  + osp_lag_fvg,  data = regioni_fvg, 
  family = poisson, iter = n.iter, cores = 4)


  # + double interaction between lockdown and time
  # cubic times effect
  # regional lockdown


mod.hier.5.int <- stan_glmer( resp ~  
    (1 |codice_regione)+ 
    as.factor(dummy_reg_rep)*tamp_lag  +
    as.factor(dummy_reg_rep)*tamponi  +
    as.factor(dummy_reg_rep)*osp_lag  +
    #as.factor(dummy_reg_rep) +
    lockdown_rep*I(times_scaled) + 
    lockdown_rep*I(times_scaled^2) + 
    lockdown_rep*I(times_scaled^3) + 
    tamp_lag  + osp_lag + 
    tamponi,  data = regioni_ristr, 
  family = poisson, iter = n.iter, cores = 4)

loo.hier.5.int <- loo(mod.hier.5.int)$estimates[3,1]
loo.hier.5.int

In the previous reports, we found that the model with medical swabs and lagged hospitalized as covariates along with interaction between lockdown and time, was the best in terms of goodness of fit measures. In general, the interaction between lockdown and time improves the fit. Here we improved the model fit by adding more covariates and more interactions.

Model estimates

Let’s retrieve some posterior estimates for the selected model:


There is much between-group variability (look at the parameter \(\sigma_{\alpha}\)) between the regions, whereas \(\beta_1,\ldots,\beta_5\) appear to be not much pretty different than zero.

We can check our model via posterior predictive checks: for instance, comparing the empirical disribution function of the true data with the empirical distribution functions from the replications under the model, or some statistics such as the mean and the standard deviation between replicated data and true data.




There could be some issues with the number of zeros (here not addressed).

Graphical analysis and predictions

We are going to use our models to make predictions for the next days. One difficulty regards the simulation of out-of-sample covariates, such as the medical swabs or the lagged hospitalized. The best fitting is obtained by the model with the lagged hospitalized covariate, mod.hier.5.int, thus we use this to make predictions.

Though, we need to sample new medical swabs for each region in the out-of-sample days, and this is drammatically difficult, because the swabs policies are not clear and known, and the number of swabs is a biased measure for sampling the entire population.

For such a reason, for each region we fit a local polynomial surface determined only by the time and squared time covariates:





The red points represent the local polynomial predictions for the swabs: again, these predictions underly on a very simplified assumption!

Now we can try to predict the number of total hospitalized for the 21 regions in the next day (summary tables are reported at the end of the document). Red ribbons represent 95% predictive intervals for out-of-sample predictions, whereas blue ribbons represent 95% predictive intervals for in-sample data. First of all, we depict the posterior intervals for Marche, whose fitting is a GLM. We report also the GLM fitting for Friuli Venezia Giulia, whose curve is better estimated than hierarchical fitting.




Models for intensive care units (ICU)

After similar analysis based on leave-one-out cross-validation to account for models’ goodness of fit, we detected a final model for the total intensive care, \(\mathsf{icu}\), as well. We can make predictions for each region, similarly to the previous framework:

\[\begin{align} \mathsf{icu}_{ij} \sim & \mathsf{Poisson}(\lambda_{ij})\\ \lambda_{ij} = & \exp(\mu + \alpha_{j(i)} + \beta_1\mathsf{times}_i + \beta_2 \mathsf{times}^2_i + \beta_3 \mathsf{times}^3_i + \beta_4\mathsf{lockdown}_i + \\ & \beta_5 \mathsf{lag\_swabs}_i + \beta_6 \mathsf{lag\_icu}_i + \beta_7 \mathsf{times}_i\times \mathsf{lockdown}_i +\\ & \beta_8 \mathsf{times}^2_i\times \mathsf{lockdown}_i + \beta_9 \mathsf{times}^3_i\times \mathsf{lockdown}_i + \\ & \beta_{10}\mathsf{reg\_eff}_i +\beta_{11}\mathsf{reg\_eff}_i \mathsf{lag\_swabs}_i). \end{align}\]
# glm for Marche
icu_marche <- regioni$terapia_intensiva[regioni$denominazione_regione == "Marche"]
mod.glm.marche.icu <- stan_glm( icu_marche ~  
    #as.factor(dummy_reg_rep) +
    lockdown*I(times_marche) + 
    lockdown*I(times_marche^2) + 
    lockdown*I(times_marche^3) + 
    tamp_lag_marche  + osp_lag_marche,  data = regioni_marche, 
  family = poisson, iter = n.iter, cores = 4)

# hierarchical for other regions

icu <- regioni_ristr$terapia_intensiva
mod.hier.icu.2.int <- stan_glmer( icu ~  
    (1 |codice_regione)+
    as.factor(dummy_reg_rep)*tamp_lag +
    lockdown_rep*times_scaled+ 
    lockdown_rep*I(times_scaled^2) +
    lockdown_rep*I(times_scaled^3) + icu_lag + tamp_lag,
      data = regioni_ristr,
      family = poisson, iter = n.iter, cores = 4)
print(mod.hier.icu.2.int)

We retrieve posterior parameters estimations, and then we attempt one-day predictions: Marche is separated from the other regions, estimated according to a GLM.




Predictive tables for ICU and hospitalized

Here there are the tables for the predicted total hospitalized and ICU in the next day. Points predictions are expressed as medians, whereas Lower and Upper are the lower and the upper bounds of the 95\(\%\) credible intervals, respectively. We report also the observed previous values, for the day before:

14 apr 2020: Predicted number of tot. hospitalized
14 Apr 2020
Regione Prediction Lower Upper
Abruzzo 410 369 452
Basilicata 55 40 70
P.A. Bolzano 333 297 371
Calabria 167 142 193
Campania 633 581 685
Emilia-Romagna 4373 4231 4518
Lazio 1369 1287 1452
Liguria 1297 1221 1371
Lombardia 12355 12086 12622
Molise 41 29 54
Piemonte 3668 3541 3793
Puglia 652 602 707
Sardegna 139 116 163
Sicilia 546 499 593
Toscana 1373 1299 1450
P.A. Trento 372 334 411
Umbria 206 178 236
Valle d’Aosta 124 102 147
Veneto 1931 1838 2023
Friuli Venezia Giulia 176 142 212
Marche 956 488 1939
14 apr 2020: Predicted number of ICU
14 Apr 2020
Regione Prediction Lower Upper
Abruzzo 51 38 67
Basilicata 12 5 19
P.A. Bolzano 56 42 72
Calabria 16 9 25
Campania 115 94 137
Emilia-Romagna 272 238 307
Friuli Venezia Giulia 38 26 51
Lazio 187 159 218
Liguria 131 108 154
Lombardia 1044 975 1121
Molise 6 2 12
Piemonte 320 283 357
Puglia 84 66 103
Sardegna 17 9 25
Sicilia 51 37 65
Toscana 206 178 236
P.A. Trento 51 37 66
Umbria 33 22 46
Valle d’Aosta 16 8 25
Veneto 242 211 275
Marche 84 15 463

Predictive ability

We compute the predictive ability of our models until April, 9th: blue points in the plots below are the observed points, whereas red ribbons represent 95% predictive interval.



## [1] 0.1146594
## [1] 0.1910487
## [1] 0.1271562
## [1] 0.1946323
## [1] 0.5380952
## [1] 0.8333333
## [1] 0.5238095
## [1] 0.9047619