Introduction

This is not a game: predicting the number of deaths or related quantities due to Covid-19 should not be addressed without considering serious ethical concerns. We do not have the ambition to tell you how the virus will be spreading out in the future, we want just try to understand the trend across Italian regions. Take this analysis critically. We agree with the statistician George Box, who said: ‘All the models are wrong, but some are useful’. Hopefully, some of them may be useful, and this is our hope for this work.

We do not want to play with the life.

Main aims

The main aims of this analysis are the following ones:

  • study the trend of both total hospitalized people in each region and intensive care units (ICU);

  • provide 3/4 days-forward predictions immediately available for the regional health care systems to ease the patients’ allocation;

  • evaluate the effect of the following factors/covariates on the spreading outbreak of the disease: time, lockdown measures adopted by the Italian Government (see here for more details https://en.wikipedia.org/wiki/2020_Italy_coronavirus_lockdown), number of medical swabs, regional membership;

  • evaluate the appropriateness of the policies adopted by the Italian Government to alleviate the disease’s spreading outbreak.

Methodology

We develop Bayesian hierarchical Poisson regression models for the number of total hospitalized and ICU, by including the covariates listed above and considering possible interactions between the lockdown measures and the exposition times. Model comparison has been fulfilled by use of predictive information criteria based on leave-one-out cross-validation.
Out-of-sample predictions (95\(\%\) credible intervals) for the next days have been obtained by simulating from the posterior predictive distribution. Predictions for the future medical swabs have been generated through local polynomial regression (LOESS).

Moreover, we consider distinct starting points of exposition for each region, by acknowledging for the distinct intensities registered in each region:

  • February 24th: Piemonte, Valle d’ Aosta, Abruzzo, P.A. Bolzano, Emilia Romagna, Friuli Venezia Giulia, Lazio, Liguria, Lombardia, Marche, P.A. Trento, Toscana, Umbria, Veneto;

  • February 25th: Sicilia;

  • February 27th: Campania, Puglia, Basilicata, Molise, Calabria;

  • March 3rd: Sardegna.

The fit of the model is performed using the Stan Ecosystem softwares (see https://mc-stan.org/) enjoying Hamiltonian Monte Carlo sampling, an effective extension of the Markov Chain Monte Carlo methods. We used the statistical software R 3.6.3.

This is a dynamic modelling framework: with the same code, we can fit every day the models and update the predictions for the upcoming days.

Limitations of this work

  1. As mentioned above, we feel that the number of hospitalized people and ICU are the most effective and reliable data in this moment, being likely less associated than the number of the new positive cases with the medical swabs. Moreover, they can be fruitfully used to predict patients’ allocation according to the strength of the health care systems. Though, they offer a systematic lagged screenshot of the pandemic trend, if compared with the number of new cases.

  2. The policies for the swabs are not open-access and clearly defined in the different regions and. Thus, underlying on the swabs covariates for the out-of-sample days may yield very unstable predictions. For such a reason, we want to invite the readers to carefully analyze and criticize the assumption of exponentially increasing swabs.

  3. The use of the Poisson distribution could not take care of possible under/over-dispersion in the national data. Moreover, the Poisson regression model depending only on time covariate implies a monotonic trend, and this may be realistic only for the first period of outbreak.

  4. In this first step, we are not accounting for the regions’ populations. Though, we feel this is an important feature. We are going to include them as further covariates.

Data

We analyze the data freely available at “https://github.com/pcm-dpc/COVID-19/tree/master/dati-regioni”, in which the date, the total number of positive people, hospitalized people, intensive care and other relevant quantities are provided for each day and each Italian region from February 24th.

# loading packages
library(rstanarm)
library(ggplot2)
library(bayesplot)
library(tidyverse)
library(lme4)
library(arm)
library(loo)
library(kableExtra)

# daily data
regioni <- read.csv("https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv")
head(regioni)
##                  data stato codice_regione denominazione_regione      lat
## 1 2020-02-24T18:00:00   ITA             13               Abruzzo 42.35122
## 2 2020-02-24T18:00:00   ITA             17            Basilicata 40.63947
## 3 2020-02-24T18:00:00   ITA              4          P.A. Bolzano 46.49933
## 4 2020-02-24T18:00:00   ITA             18              Calabria 38.90598
## 5 2020-02-24T18:00:00   ITA             15              Campania 40.83957
## 6 2020-02-24T18:00:00   ITA              8        Emilia-Romagna 44.49437
##       long ricoverati_con_sintomi terapia_intensiva totale_ospedalizzati
## 1 13.39844                      0                 0                    0
## 2 15.80515                      0                 0                    0
## 3 11.35662                      0                 0                    0
## 4 16.59440                      0                 0                    0
## 5 14.25085                      0                 0                    0
## 6 11.34172                     10                 2                   12
##   isolamento_domiciliare totale_positivi variazione_totale_positivi
## 1                      0               0                          0
## 2                      0               0                          0
## 3                      0               0                          0
## 4                      0               0                          0
## 5                      0               0                          0
## 6                      6              18                          0
##   nuovi_positivi dimessi_guariti deceduti totale_casi tamponi note_it note_en
## 1              0               0        0           0       5                
## 2              0               0        0           0       0                
## 3              0               0        0           0       1                
## 4              0               0        0           0       1                
## 5              0               0        0           0      10                
## 6             18               0        0          18     148
n_osp_giornalieri <- length(regioni$totale_ospedalizzati)

# storing variables
## number of distinc times
l_times <- length(unique(regioni$data))
times <- 1:l_times
## number of regions
n_reg <- length(unique(regioni$denominazione_regione))
n <- dim(regioni)[1]
## lag hospitalized
osp_lag <- c()
initialize <- rep(0, n_reg)
osp_lag <- initialize
for (j in (n_reg+1):n){
  osp_lag[j] <- regioni$totale_ospedalizzati[j-n_reg]
}
icu_lag <- c()
initialize <- rep(0, n_reg)
icu_lag <- initialize
for (j in (n_reg+1):n){
  icu_lag[j] <- regioni$terapia_intensiva[j-n_reg]
}
## recodifying distinct times
times <- as.factor(regioni$data)
levels(times) <- c(1:l_times)
times <- as.numeric(times)
regioni$codice_regione <- as.factor(regioni$codice_regione)
## tamponi_ritardati
tamp_lag <- c()
initialize <- rep(0, n_reg)
tamp_lag <- initialize
for (j in (n_reg+1):n){
  tamp_lag[j] <- regioni$tamponi[j-n_reg]
}

Model’s choice for the number of total hospitalized people

We start investigating the relationship between the number of hospitalized people in each region and some covariates. We will end up to select the best model in terms of both Bayesian and classical goodness measures of fit, based on leave-one-out cross-validation and deviance information criteria, respectively. Fit of Bayesian models is performed using Hamiltonian Monte Carlo sampling underlying the rstanarm package, belonging to the Stan Ecosystem. Unless otherwise stated, we use weakly-informative prior distributions for the regression coefficients.

Here we depict the observed total hospitalized and ICU collected until today in each Italian region (black points):



Poisson hierarchical regression

We move into a hierarchical modeling scenario, by adopting the following model for the total hospitalized number of people in each region, \(Y_{ij}, \ i =1,\ldots,n; \ j=1,\ldots, \mathsf{n\_reg}\):

\[\begin{align} Y_{ij} \sim & \mathsf{Poisson}(\lambda_{ij})\\ \lambda_{ij} = & \exp(\mu + \alpha_{j(i)} + \beta_1\mathsf{times}_i)\\ \alpha_j \sim & \mathsf{Normal}(\mu_{\alpha}, \sigma^2_{\alpha}), \end{align}\]

where \(\mu\) is the global intercept and \(\alpha_{j(i)}\) is the random intercept for the region \(j(i)\). \(\mu_{\alpha}\) and \(\sigma^2_{\alpha}\) are the hyperparameters for the random intercepts prior. Analogously as what we have done for the previous four models, we fit distinct hierarchical models, by first adding the squared times and the medical swabs. We estimate the hierarchical models by using the function glmer in the lme4 package, and the function stan_glmer in the rstanarm package.

As clearly emerges from the previous reports, hierarchical models drammatically improve the fit, with the latter being preferable to the model without medical swabs.

However, estimated curves and predictions for Marche’s region were not very good: for such a reason, we choose to fit Marche separately, through a GLM.

Furthermore, we extend our hierarchical models by adopting the following choices:

  • consider squared and cubic effects in the times, \(\mathsf{times}^2, \ \mathsf{times}^3\);

  • consider the \(\mathsf{swabs}\) and the lagged swabs, \(\mathsf{lag\_swabs}\);

  • consider the lagged hospitalized, \(\mathsf{lag\_hosp}\);

  • add the \(\mathsf{lockdown}\) covariate, a numerical non-decreasing function which should try to mimic the lockdown measures adopted by the Italian Government; the function value is zero before March 8th, and then assumes increasing values, with upper bound 1;

  • cut the interval for which each region is observed, in order to acknowledge for the different starting points of Covid-19 registered in each region;

  • evaluate a possible interaction between lockdown measures and exposition times;

  • considering a further lockdown dummy variable, \(\mathsf{reg\_eff}\) taking value 1 for some regions (Puglia, Marche, Lazio, Campania, Calabria, Basilicata), and zero for the others.

Then, the linear predictor for the completed model is then defined as:

\[\begin{align} \lambda_{ij} = & \exp(\mu + \alpha_{j(i)} + \beta_1\mathsf{times}_i + \beta_2 \mathsf{times}^2_i + \beta_3 \mathsf{times}^3_i + \beta_4\mathsf{lockdown}_i + \\ & \beta_5 \mathsf{swabs}_i + \beta_6 \mathsf{lag\_swabs}_i + \beta_7 \mathsf{lag\_hosp}_i + \beta_8 \mathsf{times}_i\times \mathsf{lockdown}_i +\\ & \beta_9 \mathsf{times}^2_i\times \mathsf{lockdown}_i + \beta_{10} \mathsf{times}^3_i\times \mathsf{lockdown}_i + \beta_{11}\mathsf{reg\_eff}_i). \end{align}\]
resp <- regioni$totale_ospedalizzati 
n.iter <- 2500
## data redefinition
regioni_c <- as_tibble(cbind(times, osp_lag, icu_lag, regioni, tamp_lag))
regioni_ristr <- cbind(times, osp_lag, icu_lag, regioni, tamp_lag)
regioni_tibble <- as_tibble(regioni_ristr)
regioni_ristr<- regioni_tibble %>%
  filter( (times > 3 & (denominazione_regione =="Campania"|
      denominazione_regione =="Puglia" |
      denominazione_regione =="Basilicata"|
      denominazione_regione =="Molise" |
      denominazione_regione =="Calabria")) |
  (times > 1 & denominazione_regione == "Sicilia")|
  (times > 8 & denominazione_regione == "Sardegna")|
   (times >0 & (denominazione_regione =="Piemonte"|
      denominazione_regione =="Valle d'Aosta" |
      denominazione_regione =="Abruzzo"|
      denominazione_regione =="P.A. Bolzano" |
      denominazione_regione =="Emilia-Romagna"|
      denominazione_regione =="Friuli Venezia Giulia"|
          denominazione_regione =="Lazio" |
          denominazione_regione =="Liguria"|
          denominazione_regione =="Lombardia" |
          #denominazione_regione =="Marche"|
          denominazione_regione =="P.A. Trento"|
      denominazione_regione =="Toscana"|
      denominazione_regione =="Umbria"|
      denominazione_regione =="Veneto")))
# lockdown variable
 lockdown <- c()
 eff<-function(xx) { vec= rep(0, length(xx))
for (i in 1:length(xx)) {
   if (xx[i] <= 15)
   {vec[i]=xx[i]^2/225}
   else
   {vec[i]=1}}
vec
 }
# defined to be zero before march 8
lockdown <- c(rep(0, 14), eff(1:(l_times-14)))
resp <- regioni_ristr$totale_ospedalizzati
times <- regioni_ristr$times
lockdown_rep <- c()
for (j in 1: length(resp)){
   lockdown_rep[j] <- lockdown[times[j]]
}
n_reg <- length(unique(regioni_ristr$denominazione_regione))
times_scaled <- times - mean(times)
dummy_reg <- c(0, 0, 0, 1, 0, 0, 0, 0, 0,
  0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0)
dummy_reg_rep <- c()
veri_codici <- unique(regioni_ristr$codice_regione)

for (j in 1: dim(regioni_ristr)[1]){
  
  dummy_reg_rep[j] <- dummy_reg[
    (1:(n_reg))[regioni_ristr$codice_regione[j]] ]
}

# glm model for marche
regioni_marche <- regioni_c%>%
  filter(denominazione_regione=="Marche")
resp_marche <- regioni_marche$totale_ospedalizzati
times_marche <- c(1:l_times)
tamp_lag_marche <- tamp_lag[regioni$denominazione_regione==
    "Marche"]
osp_lag_marche <- osp_lag[regioni$denominazione_regione==
    "Marche"]
tamponi_marche <- regioni$tamponi[regioni$denominazione_regione=="Marche"]

mod.glm.marche <- stan_glm( resp_marche ~  
    #as.factor(dummy_reg_rep) +
    lockdown*I(times_marche) + 
    lockdown*I(times_marche^2) + 
    lockdown*I(times_marche^3) + 
    tamp_lag_marche  + osp_lag_marche,  data = regioni_marche, 
  family = poisson, iter = n.iter, cores = 4)


  # + double interaction between lockdown and time
  # cubic times effect
  # regional lockdown


mod.hier.5.int <- stan_glmer( resp ~  
    (1 |codice_regione)+ 
    as.factor(dummy_reg_rep)*tamp_lag  +
    as.factor(dummy_reg_rep)*tamponi  +
    as.factor(dummy_reg_rep)*osp_lag  +
    #as.factor(dummy_reg_rep) +
    lockdown_rep*I(times_scaled) + 
    lockdown_rep*I(times_scaled^2) + 
    lockdown_rep*I(times_scaled^3) + 
    tamp_lag  + osp_lag + 
    tamponi,  data = regioni_ristr, 
  family = poisson, iter = n.iter, cores = 4)

loo.hier.5.int <- loo(mod.hier.5.int)$estimates[3,1]
loo.hier.5.int

In the previous reports, we found that the model with medical swabs and lagged hospitalized as covariates along with interaction between lockdown and time, was the best in terms of goodness of fit measures. In general, the interaction between lockdown and time improves the fit. Here we improved the model fit by adding more covariates and more interactions.

Model estimates

Let’s retrieve some posterior estimates for the selected model:


There is much between-group variability (look at the parameter \(\sigma_{\alpha}\)) between the regions, whereas \(\beta_1,\ldots,\beta_5\) appear to be not much pretty different than zero.

We can check our model via posterior predictive checks: for instance, comparing the empirical disribution function of the true data with the empirical distribution functions from the replications under the model, or some statistics such as the mean and the standard deviation between replicated data and true data.




There could be some issues with the number of zeros (here not addressed).

Graphical analysis and predictions

We are going to use our models to make predictions for the next days. One difficulty regards the simulation of out-of-sample covariates, such as the medical swabs or the lagged hospitalized. The best fitting is obtained by the model with the lagged hospitalized covariate, mod.hier.5.int, thus we use this to make predictions.

Though, we need to sample new medical swabs for each region in the out-of-sample days, and this is drammatically difficult, because the swabs policies are not clear and known, and the number of swabs is a biased measure for sampling the entire population.

For such a reason, for each region we fit a local polynomial surface determined only by the time and squared time covariates:





The red points represent the local polynomial predictions for the swabs: again, these predictions underly on a very simplified assumption!

Now we can try to predict the number of total hospitalized for the 21 regions in the next day (summary tables are reported at the end of the document). Red ribbons represent 95% predictive intervals for out-of-sample predictions, whereas blue ribbons represent 95% predictive intervals for in-sample data. First of all, we depict the posterior intervals for Marche, whose fitting is a GLM.



Models for intensive care units (ICU)

After similar analysis based on leave-one-out cross-validation to account for models’ goodness of fit, we detected a final model for the total intensive care, \(\mathsf{icu}\), as well. We can make predictions for each region, similarly to the previous framework:

\[\begin{align} \mathsf{icu}_{ij} \sim & \mathsf{Poisson}(\lambda_{ij})\\ \lambda_{ij} = & \exp(\mu + \alpha_{j(i)} + \beta_1\mathsf{times}_i + \beta_2 \mathsf{times}^2_i + \beta_3 \mathsf{times}^3_i + \beta_4\mathsf{lockdown}_i + \\ & \beta_5 \mathsf{lag\_swabs}_i + \beta_6 \mathsf{lag\_icu}_i + \beta_7 \mathsf{times}_i\times \mathsf{lockdown}_i +\\ & \beta_8 \mathsf{times}^2_i\times \mathsf{lockdown}_i + \beta_9 \mathsf{times}^3_i\times \mathsf{lockdown}_i + \\ & \beta_{10}\mathsf{reg\_eff}_i +\beta_{11}\mathsf{reg\_eff}_i \mathsf{lag\_swabs}_i). \end{align}\]
# glm for Marche
icu_marche <- regioni$terapia_intensiva[regioni$denominazione_regione == "Marche"]
mod.glm.marche.icu <- stan_glm( icu_marche ~  
    #as.factor(dummy_reg_rep) +
    lockdown*I(times_marche) + 
    lockdown*I(times_marche^2) + 
    lockdown*I(times_marche^3) + 
    tamp_lag_marche  + osp_lag_marche,  data = regioni_marche, 
  family = poisson, iter = n.iter, cores = 4)

# hierarchical for other regions

icu <- regioni_ristr$terapia_intensiva
mod.hier.icu.2.int <- stan_glmer( icu ~  
    (1 |codice_regione)+
    as.factor(dummy_reg_rep)*tamp_lag +
    lockdown_rep*times_scaled+ 
    lockdown_rep*I(times_scaled^2) +
    lockdown_rep*I(times_scaled^3) + icu_lag + tamp_lag,
      data = regioni_ristr,
      family = poisson, iter = n.iter, cores = 4)
print(mod.hier.icu.2.int)

We retrieve posterior parameters estimations, and then we attempt one-day predictions: Marche is separated from the other regions, estimated according to a GLM.




Predictive tables for ICU and hospitalized

Here there are the tables for the predicted total hospitalized and ICU in the next day. Points predictions are expressed as medians, whereas Lower and Upper are the lower and the upper bounds of the 95\(\%\) credible intervals, respectively. We report also the observed previous values, for the day before:

04-apr-2020: Predicted number of tot. hospitalized
3 Apr. 2020
4 Apr 2020
Regione Yesterday-true Yesterday-pred Prediction Lower Upper
Abruzzo 437 443 465 420 511
Basilicata 60 46 49 35 65
P.A. Bolzano 351 389 357 319 397
Calabria 200 164 164 139 194
Campania 647 626 609 553 663
Emilia-Romagna 4279 4363 4537 4385 4699
Friuli Venezia Giulia 262 344 357 319 401
Lazio 1382 1238 1204 1121 1288
Liguria 1320 1387 1451 1366 1535
Lombardia 13183 12792 12911 12568 13234
Molise 39 48 50 36 67
Piemonte 3752 3703 3881 3740 4042
Puglia 771 637 731 676 791
Sardegna 146 144 152 127 178
Sicilia 608 509 536 490 590
Toscana 1437 1514 1551 1460 1637
P.A. Trento 423 392 443 401 489
Umbria 213 225 236 206 271
Valle d’Aosta 110 128 134 111 161
Veneto 2049 2118 2084 1976 2183
Marche 1150 1643 1033 1131 1236
04-apr-2020: Predicted number of ICU
3 Apr. 2020
4 Apr 2020
Regione Yesterday-true Yesterday-pred Prediction Lower Upper
Abruzzo 76 77 80 62 99
Basilicata 19 16 17 9 26
P.A. Bolzano 60 67 75 58 94
Calabria 17 25 27 17 39
Campania 115 145 167 139 195
Emilia-Romagna 364 390 395 352 440
Friuli Venezia Giulia 61 61 62 47 79
Lazio 188 167 209 176 244
Liguria 173 194 199 171 231
Lombardia 1381 1226 1156 1069 1247
Molise 8 11 11 5 19
Piemonte 452 438 449 403 496
Puglia 123 87 102 81 123
Sardegna 24 22 23 14 34
Sicilia 73 74 77 59 95
Toscana 288 286 294 257 332
P.A. Trento 80 62 67 51 85
Umbria 48 51 52 38 68
Valle d’Aosta 25 24 25 16 36
Veneto 335 340 339 298 381
Marche 164 231 117 148 185