Introduction

This is not a game: predicting the number of deaths or related quantities due to Covid-19 should not be addressed without considering serious ethical concerns. We do not have the ambition to tell you how the virus will be spreading out in the future, we want just try to understand the trend across Italian regions. Take this analysis critically. We agree with the statistician George Box, who said: ‘All the models are wrong, but some are useful’. Hopefully, some of them may be useful, and this is our hope for this work.

We do not want to play with the lifes.

Main aims

The main aims of this analysis are the following ones:

  • study the trend of both total hospitalized people in each region and intensive care units (ICU);

  • provide 3/4 days-forward predictions immediately available for the regional health care systems to ease the patients’ allocation;

  • evaluate the effect of the following factors/covariates on the spreading outbreak of the disease: time, lockdown measures adopted by the Italian Government (see here for more details https://en.wikipedia.org/wiki/2020_Italy_coronavirus_lockdown), number of medical swabs, regional membership;

  • evaluate the appropriateness of the policies adopted by the Italian Government to alleviate the disease’s spreading outbreak.

Methodology

We develop Bayesian hierarchical Poisson regression models for the number of total hospitalized and ICU, by including the covariates listed above and considering possible interactions between the lockdown measures and the exposition times. Model comparison has been fulfilled by use of predictive information criteria based on leave-one-out cross-validation.
Out-of-sample predictions (95\(\%\) credible intervals) for the next days have been obtained by simulating from the posterior predictive distribution. Predictions for the future medical swabs have been generated through local polynomial regression (LOESS).

Moreover, we consider distinct starting points of exposition for each region, by acknowledging for the distinct intensities registered in each region:

  • February 24th: Piemonte, Valle d’ Aosta, Abruzzo, P.A. Bolzano, Emilia Romagna, Friuli Venezia Giulia, Lazio, Liguria, Lombardia, Marche, P.A. Trento, Toscana, Umbria, Veneto;

  • February 25th: Sicilia;

  • February 27th: Campania, Puglia, Basilicata, Molise, Calabria;

  • March 3rd: Sardegna.

The fit of the model is performed using the Stan Ecosystem softwares (see https://mc-stan.org/) enjoying Hamiltonian Monte Carlo sampling, an effective extension of the Markov Chain Monte Carlo methods. We used the statistical software R 3.6.3.

This is a dynamic modelling framework: with the same code, we can fit every day the models and update the predictions for the upcoming days.

Limitations of this work

  1. As mentioned above, we feel that the number of hospitalized people and ICU are the most effective and reliable data in this moment, being likely less associated than the number of the new positive cases with the medical swabs. Moreover, they can be fruitfully used to predict patients’ allocation according to the strength of the health care systems. Though, they offer a systematic lagged screenshot of the pandemic trend, if compared with the number of new cases.

  2. The policies for the swabs are not open-access and clearly defined in the different regions and. Thus, underlying on the swabs covariates for the out-of-sample days may yield very unstable predictions. For such a reason, we want to invite the readers to carefully analyze and criticize the assumption of exponentially increasing swabs.

  3. The use of the Poisson distribution could not take care of possible under/over-dispersion in the national data. Moreover, the Poisson regression model depending only on time covariate implies a monotonic trend, and this may be realistic only for the first period of outbreak.

  4. In this first step, we are not accounting for the regions’ populations. Though, we feel this is an important feature. We are going to include them as further covariates.

Data

We analyze the data freely available at “https://github.com/pcm-dpc/COVID-19/tree/master/dati-regioni”, in which the date, the total number of positive people, hospitalized people, intensive care and other relevant quantities are provided for each day and each Italian region from February 24th.

# loading packages
library(rstanarm)
library(ggplot2)
library(bayesplot)
library(tidyverse)
library(lme4)
library(arm)
library(loo)
library(kableExtra)

# daily data
regioni <- read.csv("https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv")
head(regioni)
##                  data stato codice_regione denominazione_regione      lat
## 1 2020-02-24T18:00:00   ITA             13               Abruzzo 42.35122
## 2 2020-02-24T18:00:00   ITA             17            Basilicata 40.63947
## 3 2020-02-24T18:00:00   ITA              4          P.A. Bolzano 46.49933
## 4 2020-02-24T18:00:00   ITA             18              Calabria 38.90598
## 5 2020-02-24T18:00:00   ITA             15              Campania 40.83957
## 6 2020-02-24T18:00:00   ITA              8        Emilia Romagna 44.49437
##       long ricoverati_con_sintomi terapia_intensiva totale_ospedalizzati
## 1 13.39844                      0                 0                    0
## 2 15.80515                      0                 0                    0
## 3 11.35662                      0                 0                    0
## 4 16.59440                      0                 0                    0
## 5 14.25085                      0                 0                    0
## 6 11.34172                     10                 2                   12
##   isolamento_domiciliare totale_attualmente_positivi
## 1                      0                           0
## 2                      0                           0
## 3                      0                           0
## 4                      0                           0
## 5                      0                           0
## 6                      6                          18
##   nuovi_attualmente_positivi dimessi_guariti deceduti totale_casi tamponi
## 1                          0               0        0           0       5
## 2                          0               0        0           0       0
## 3                          0               0        0           0       1
## 4                          0               0        0           0       1
## 5                          0               0        0           0      10
## 6                         18               0        0          18     148
##   note_it note_en
## 1                
## 2                
## 3                
## 4                
## 5                
## 6
n_osp_giornalieri <- length(regioni$totale_ospedalizzati)

# storing variables
## number of distinc times
l_times <- length(unique(regioni$data))
times <- 1:l_times
## number of regions
n_reg <- length(unique(regioni$denominazione_regione))
n <- dim(regioni)[1]
## lag hospitalized
osp_lag <- c()
initialize <- rep(0, n_reg)
osp_lag <- initialize
for (j in (n_reg+1):n){
  osp_lag[j] <- regioni$totale_ospedalizzati[j-n_reg]
}
## recodifying distinct times
times <- as.factor(regioni$data)
levels(times) <- c(1:l_times)
times <- as.numeric(times)
regioni$codice_regione <- as.factor(regioni$codice_regione)

Model’s choice for the number of total hospitalized people

We start investigating the relationship between the number of hospitalized people in each region and some covariates. We will end up to select the best model in terms of both Bayesian and classical goodness measures of fit, based on leave-one-out cross-validation and deviance information criteria, respectively. Fit of Bayesian models is performed using Hamiltonian Monte Carlo sampling underlying the rstanarm package, belonging to the Stan Ecosystem. Unless otherwise stated, we use weakly-informative prior distributions for the regression coefficients.

df <- data.frame(times = times,
        reg = unlist(list(regioni$denominazione_regione)),
        y = regioni$totale_ospedalizzati)
ggplot(df, aes(x=times, y =y))+
  geom_point(aes(x=times, y =y))+
   xlab("Days")+
    ylab("Total Hospitalized")+
    scale_x_discrete(limit = c( 8, 15, 22, 29, 36 ), 
                     labels = c( "2-3", "9-3", "16-3", 
                                 "23-3", "30-3") )+
    facet_wrap("reg", scales ="free")+
    theme(strip.text.x = element_text(size = 12, 
                                      colour = "black"),
          axis.text.x = element_text(face="bold", 
                                     color="#993333", 
                                      angle=45, size =9),
          axis.title.x = element_text(size=22),
          axis.title.y = element_text(size = 22))

Poisson GLM regression

We start by assuming a simple Poisson regression for the number of hospitalized people \(Y_i, \ i=1,\ldots,n\):

\[\begin{align} Y_i \sim & \mathsf{Poisson}(\lambda_i)\\ \lambda_i = & \exp(\alpha + \beta_1\mathsf{times}_i), \end{align}\] where \(\mathsf{times}\) is the temporal variable expressing day when the measurement was recorded. We may extend the linear predictor by adding the squared times:

\[\begin{align} \lambda_i = & \exp(\alpha + \beta_1\mathsf{times}_i + \beta_2 \mathsf{times}^2_i). \end{align}\]

We estimate the models with the basic function glm, and with the stan_glm function contained in the R package rstanarm. Then, we compare the goodness of fit of the Bayesian models by using the loo function of the loo package: better models are associated with lower LOOIC values. In what follows, we will use classical models fit only as a further check: though, analysis, plots and predictions will be framed in a Bayesian perspective.

resp <- regioni$totale_ospedalizzati
n.iter <- 2000

### 1) modello glm

  # only times

mod.1 <- stan_glm(resp ~ times, family = poisson, data = 
                    regioni, iter = n.iter)
mod.1.cl <- glm(resp ~ times, family = poisson, data = 
                  regioni)

loo.glm.1 <- loo(mod.1)$estimates[3,1]
summary.glm.1 <- summary(mod.1.cl)

  # squared times 

mod.2 <- stan_glm(resp ~ times + I (times^2), family = 
                    poisson, data = regioni, iter = n.iter)
mod.2.cl <- glm(resp ~ times + I (times^2), family = poisson, data = regioni)

loo.glm.2 <- loo(mod.2)$estimates[3,1]
summary.glm.2 <- summary(mod.2.cl)
c(loo.glm.1, loo.glm.2)
## [1] 989981.7 963142.3

Including the squared times improves the model fit. We can further extend our linear predictor by adding the number of medical swabs applied every day and the lagged number of hospitalized people:

\[\begin{align} \lambda_i = & \exp(\alpha + \beta_1\mathsf{times}_i + \beta_2 \mathsf{times}^2_i + \beta_3 \mathsf{swabs}_i+\beta_4 \mathsf{lag\_hosp}_i). \end{align}\]

  # + medical swabs

mod.3 <- stan_glm(resp ~  times + I (times^2) + tamponi, 
            family = poisson, data = regioni, iter = n.iter)
mod.3.cl <- glm(resp ~ times + I (times^2) + tamponi, 
            family = poisson, data = regioni)

loo.glm.3 <- loo(mod.3)$estimates[3,1]
summary.glm.3 <- summary(mod.3.cl)


  # medical swabs and  osp lag

mod.4 <- stan_glm(resp ~  times + I (times^2) + tamponi +  
      osp_lag, family = poisson, data = regioni, 
      iter = n.iter)
mod.4.cl <- glm(resp ~  times + I (times^2) + tamponi +  
      osp_lag, family = poisson, data = regioni)

loo.glm.4 <- loo(mod.4)$estimates[3,1]
summary.glm.4 <- summary(mod.4.cl)
c(loo.glm.3, loo.glm.4)
## [1] 411070.6 317511.2

The fourth model including medical swabs and lagged hospitalized is the best according to the GLMs here tested.

Poisson hierarchical regression

We move into a hierarchical modeling scenario, by adopting the following model for the total hospitalized number of people in each region, \(Y_{ij}, \ i =1,\ldots,n; \ j=1,\ldots, \mathsf{n\_reg}\):

\[\begin{align} Y_{ij} \sim & \mathsf{Poisson}(\lambda_{ij})\\ \lambda_{ij} = & \exp(\mu + \alpha_{j(i)} + \beta_1\mathsf{times}_i)\\ \alpha_j \sim & \mathsf{Normal}(\mu_{\alpha}, \sigma^2_{\alpha}), \end{align}\] where \(\mu\) is the global intercept and \(\alpha_{j(i)}\) is the random intercept for the region \(j(i)\). \(\mu_{\alpha}\) and \(\sigma^2_{\alpha}\) are the hyperparameters for the random intercepts prior. Analogously as what we have done for the previous four models, we fit distinct hierarchical models, by first adding the squared times and the medical swabs. We estimate the hierarchical models by using the function glmer in the lme4 package, and the function stan_glmer in the rstanarm package.

  # + squared times

mod.hier.1 <- stan_glmer(resp ~ (1|codice_regione)+
      times + I(times^2),family = poisson, data = regioni, 
      iter = n.iter)
mod.hier.1.cl <- glmer(resp ~ (1|codice_regione)+
      times + I(times^2),family = poisson, data = regioni)

loo.hier.1 <- loo(mod.hier.1)$estimates[3,1]
d1 <- display(mod.hier.1.cl)
dic.hier.1 <- as.double(d1$DIC)

  # + medical swabs

mod.hier.2 <- stan_glmer(resp ~ (1|codice_regione)+
      times + I(times^2) + tamponi, 
      data = regioni, family = poisson, 
      iter = n.iter)
mod.hier.2.cl <- glmer(resp ~  (1|codice_regione)+
      times + I(times^2) + tamponi,
      data = regioni, family = poisson)

loo.hier.2 <- loo(mod.hier.2)$estimates[3,1]
d2 <- display(mod.hier.2.cl)
dic.hier.2 <- as.double(d2$DIC)
c(loo.hier.1, loo.hier.2)
## [1] 22591.92 13794.48

As is clear, hierarchical models drammatically improve the fit, with the latter being preferable to the model without medical swabs.

Furthermore, we extend our hierarchical models by considering three more assumptions:

  • add the \(\mathsf{lockdown}\) covariate, a numerical non-decreasing function which should try to mimic the lockdown measures adopted by the Italian Government; the function value is zero before March 8th, and then assumes increasing values, with upper bound 1;

  • cut the interval for which each region is observed, in order to acknowledge for the different starting points of Covid-19 registered in each region;

  • evaluate a possible interaction between lockdown measures and exposition times.

Then, the linear predictor for the completed model is then defined as:

\[\begin{align} \lambda_{ij} = & \exp(\mu + \alpha_{j(i)} + \beta_1\mathsf{times}_i + \beta_2 \mathsf{times}^2_i + \beta_3\mathsf{lockdown}_i + \\ & \beta_4 \mathsf{swabs}_i + \beta_5 \mathsf{lag\_hosp}_i + \beta_6 \mathsf{times}_i\times \mathsf{lockdown}_i +\\ & \beta_7 \mathsf{times}^2_i\times \mathsf{lockdown}_i). \end{align}\]

## data redefinition
regioni_ristr <- cbind(times, osp_lag, regioni)
regioni_tibble <- as_tibble(regioni_ristr)
regioni_ristr<- regioni_tibble %>%
  filter( (times > 3 & (denominazione_regione =="Campania"|
      denominazione_regione =="Puglia" |
      denominazione_regione =="Basilicata"|
      denominazione_regione =="Molise" |
      denominazione_regione =="Calabria")) |
  (times > 1 & denominazione_regione == "Sicilia")|
  (times > 8 & denominazione_regione == "Sardegna")|
   (times >0 & (denominazione_regione =="Piemonte"|
      denominazione_regione =="Valle d'Aosta" |
      denominazione_regione =="Abruzzo"|
      denominazione_regione =="P.A. Bolzano" |
      denominazione_regione =="Emilia Romagna"|
      denominazione_regione =="Friuli Venezia Giulia"|
          denominazione_regione =="Lazio" |
          denominazione_regione =="Liguria"|
          denominazione_regione =="Lombardia" |
          denominazione_regione =="Marche"|
          denominazione_regione =="P.A. Trento"|
      denominazione_regione =="Toscana"|
      denominazione_regione =="Umbria"|
      denominazione_regione =="Veneto")))
# lockdown variable
 lockdown <- c()
 eff<-function(xx) { vec= rep(0, length(xx))
for (i in 1:length(xx)) {
   if (xx[i] <= 15)
   {vec[i]=xx[i]^2/225}
   else
   {vec[i]=1}}
vec
 }
# defined to be zero before march 8
lockdown <- c(rep(0, 14), eff(1:(l_times-14)))
resp <- regioni_ristr$totale_ospedalizzati
times <- regioni_ristr$times
lockdown_rep <- c()
for (j in 1: length(resp)){
   lockdown_rep[j] <- lockdown[times[j]]
}
n_reg <- length(unique(regioni_ristr$denominazione_regione))
times_scaled <- times - mean(times)

  # + dummy lockdown; distinct starting points; scaled times

mod.hier.3 <- stan_glmer( resp ~  (1|codice_regione)+ 
          times_scaled + I(times_scaled^2) + lockdown_rep, 
              data = regioni_ristr, family = poisson, 
              iter = n.iter)
mod.hier.3.cl <- glmer( resp ~  (1|codice_regione)+ 
            times_scaled + I(times_scaled^2) + lockdown_rep, 
              data = regioni_ristr, family = poisson)
loo.hier.3 <- loo(mod.hier.3)$estimates[3,1]
d3 <- display(mod.hier.3.cl)
dic.hier.3 <- as.double(d3$DIC)

 # + swabs

mod.hier.4 <- stan_glmer( resp ~  (1|codice_regione)+  
    times_scaled + I(times_scaled^2) +
    tamponi + lockdown_rep, 
  data = regioni_ristr, family = poisson, iter = n.iter)
mod.hier.4.cl <- glmer( resp ~  (1|codice_regione)+ 
    times_scaled + I(times_scaled^2) +
    tamponi + lockdown_rep, 
  data = regioni_ristr, family = poisson)

loo.hier.4 <- loo(mod.hier.4)$estimates[3,1]
d4 <- display(mod.hier.4.cl)
dic.hier.4 <- as.double(d4$DIC)

  # + interaction between lockdown and time

mod.hier.4.int <- stan_glmer( resp ~ (1|codice_regione) +
       lockdown_rep*times_scaled + 
        I(times_scaled^2) + tamponi,
    data = regioni_ristr, 
    family = poisson,  iter = n.iter)
mod.hier.4.cl.int <- glmer( resp ~  (1|codice_regione)+ 
       lockdown_rep*times_scaled +
      I(times_scaled^2) + tamponi , data = regioni_ristr, 
    family = poisson)

loo.hier.4.int <- loo(mod.hier.4.int)$estimates[3,1]
d4.int <- display(mod.hier.4.cl.int)
dic.hier.4.int <- as.double(d4.int$DIC)

  # + double interaction between lockdown and time

mod.hier.4.int.2 <- stan_glmer( resp ~ (1|codice_regione) +
       lockdown_rep*times_scaled +
       lockdown_rep*I(times_scaled^2) + tamponi,
       data = regioni_ristr,
       family = poisson,  iter = n.iter)
mod.hier.4.cl.int.2 <- glmer( resp ~  (1|codice_regione)+
       lockdown_rep*times_scaled +
       lockdown_rep*I(times_scaled^2) + tamponi ,
       data = regioni_ristr,
       family = poisson)

loo.hier.4.int.2 <- loo(mod.hier.4.int.2)$estimates[3,1]
d4.int.2 <- display(mod.hier.4.cl.int.2)
dic.hier.4.int.2 <- as.double(d4.int.2$DIC)

  # + lag hospitalized

mod.hier.5 <- stan_glmer( resp ~  (1|codice_regione)+
      times_scaled + I(times_scaled^2) + 
      tamponi + lockdown_rep +
      osp_lag, data = regioni_ristr, family = poisson, 
    iter = n.iter)
mod.hier.5.cl <- glmer( resp ~  (1|codice_regione)+  
      times_scaled + I(times_scaled^2) + 
      tamponi + lockdown_rep +                   
      osp_lag, data = regioni_ristr, family = poisson)

loo.hier.5 <- loo(mod.hier.5)$estimates[3,1]
d5 <- display(mod.hier.5.cl)
dic.hier.5 <- as.double(d5$DIC)

  # + interaction between lockdown and time

mod.hier.5.int <- stan_glmer( resp ~  (1|codice_regione)+ 
      lockdown_rep*times_scaled + 
      I(times_scaled^2) + tamponi + 
      osp_lag, data = regioni_ristr, 
      family = poisson,  iter = n.iter)
mod.hier.5.cl.int <- glmer( resp ~  (1|codice_regione)+ 
      lockdown_rep*times_scaled + 
      I(times_scaled^2) + tamponi  + osp_lag, 
      data = regioni_ristr, 
      family = poisson)

loo.hier.5.int <- loo(mod.hier.5.int)$estimates[3,1]
d5.int <- display(mod.hier.5.cl.int)
dic.hier.5.int <- as.double(d5.int$DIC)

  # + double interaction between lockdown and time

mod.hier.5.int.2 <- stan_glmer( resp ~  (1|codice_regione)+
      lockdown_rep*times_scaled +
      lockdown_rep*I(times_scaled^2) + tamponi +
      osp_lag, data = regioni_ristr,
      family = poisson,  iter = n.iter)
mod.hier.5.cl.int.2 <- glmer( resp ~  (1|codice_regione)+
      lockdown_rep*times_scaled +
      lockdown_rep*I(times_scaled^2) + tamponi  + osp_lag,
      data = regioni_ristr,
      family = poisson)

loo.hier.5.int.2 <- loo(mod.hier.5.int.2)$estimates[3,1]
d5.int.2 <- display(mod.hier.5.cl.int.2)
dic.hier.5.int.2 <- as.double(d5.int.2$DIC)
c(loo.hier.3, loo.hier.4, loo.hier.4.int,
  loo.hier.4.int.2, loo.hier.5, loo.hier.5.int,
  loo.hier.5.int.2)
## [1] 22345.19 13590.99 13458.34 13485.01 11870.41 11838.03 11860.90
c(dic.hier.3, dic.hier.4, dic.hier.4.int, dic.hier.4.int.2, 
  dic.hier.5, dic.hier.5.int, dic.hier.5.int.2)
## [1] 13129.485  4652.788  4494.606  4493.260  3030.830  2978.791  2977.587

The model with medical swabs and lagged hospitalized as covariates, and interaction between lockdown and time, is the best in terms of goodness of fit measures. In general, the interaction between lockdown and time improves the fit. Though, interaction between lockdown and squared times do not yield better predictive performance.

Model comparisons in terms of predictive information criteria

Let’s have a precise glimpse in terms of model comparisons (LOOIC and DIC) for the hierarchical models:

#plots for looic and dic
loo.v <- c(loo.hier.1, loo.hier.2, loo.hier.3,
           loo.hier.4, loo.hier.4.int, loo.hier.4.int.2, 
           loo.hier.5, loo.hier.5.int, loo.hier.5.int.2)
dic.v <- c(dic.hier.1, dic.hier.2, dic.hier.3,
            dic.hier.4, dic.hier.4.int, dic.hier.4.int.2, 
           dic.hier.5, dic.hier.5.int, dic.hier.5.int.2)
sort.loo.v <- sort.int(loo.v, index.return = TRUE)$x
sort.dic.v <- sort.int(dic.v, index.return = TRUE)$x


par(xaxt="n", mfrow=c(1,2))
plot(sort.loo.v, type="b", xlab="", ylab="LOOIC")
par(xaxt="s")
axis(1, c(1:9), c("hier1", "hier2", "hier3", 
                  "hier4","hier4int", "hier4int2",  "hier5",
                  "hier5int", "hier5int2")[sort.int(loo.v,
                    index.return = TRUE)$ix],
                    las=2)
par(xaxt="n")
plot(sort.dic.v, type="b", xlab="", ylab="DIC")
par(xaxt="s")
axis(1, c(1:9), c("hier1", "hier2", "hier3", 
                  "hier4","hier4int", "hier4int2",  "hier5",
                  "hier5int", "hier5int2")[sort.int(dic.v,
                    index.return = TRUE)$ix], las=2)

The completed model (without interaction between lockdown and squared times) seems to be the best model in terms of predictive information criteria.

Model estimates

Let’s retrieve some posterior estimates for the best model, the hierarchical Poisson model with five predictors and interaction between lockdown and time:

# labelling
alpha_names <- paste0("alpha[", n_reg:1, "]")
beta_names <- paste0("beta[", 5:1, "]")
new_names <- c(expression(sigma[alpha]), alpha_names, beta_names, expression(mu))
posterior <- as.array(mod.hier.5.int)
mcmc_intervals(posterior)+
  xaxis_text(on =TRUE, size=rel(1.9))+
  yaxis_text(on =TRUE, size=rel(1.9))+
  scale_y_discrete(labels = rev((parse(text = new_names))))+
  ggtitle("Parameter estimation")+
  theme(plot.title = element_text(hjust = 0.5, size =rel(2)))

There is much between-group variability (look at the parameter \(\sigma_{\alpha}\)) between the regions, whereas \(\beta_1,\ldots,\beta_5\) appear to be not much pretty different than zero.

We can check our model via posterior predictive checks: for instance, comparing the empirical disribution function of the true data with the empirical distribution functions from the replications under the model, or some statistics such as the mean and the standard deviation between replicated data and true data.

# ecdf
color_scheme_set("blue")
ppc_ecdf_overlay(y = resp,
    yrep = posterior_predict(mod.hier.5.int, 
    draws = 200))

# stats
ppc_stat(y = resp,
         yrep = posterior_predict(mod.hier.5.int, 
         draws = 200), stat = "mean")

ppc_stat(y = resp,
         yrep = posterior_predict(mod.hier.5.int, 
         draws = 200), stat = "sd")

There could be some issues with the number of zeros (here not addressed).

Graphical analysis and predictions

We are going to use our models to make predictions for the next days. One difficulty regards the simulation of out-of-sample covariates, such as the medical swabs or the lagged hospitalized. Although the best fitting is obtained by the model with the lagged hospitalized covariate, mod.hier.5.int, we end up then to choose mod.hier.4.int due to a greater simplicity in dealing with the covariates.

Though, we need to sample new medical swabs for each region in the out-of-sample days, and this is drammatically difficult, because the swabs policies are not clear and known, and the number of swabs is a biased measure for sampling the entire population.

For such a reason, for each region we fit a local polynomial surface determined only by the time and squared time covariates:

# data manipulation for out-of-sample
mod <- mod.hier.4.int
data <- regioni_ristr
prev <- 4
times <- as.factor(data$data)
levels(times) <- c(1:l_times)
times <- as.numeric(times)
new_times <- c((l_times):(l_times+prev))
pred <- posterior_predict(mod, type ="response")
starting <- rep(l_times, n_reg)
  for (j in 1:prev){
    times_out <- c(starting,
                   rep(l_times+j, n_reg ))
    starting <- times_out
  }
  
times_tot <- c(times, times_out)
centered_times_tot <- times_tot-mean(times)
stringa_regioni <- data$denominazione_regione[502:522]
stringa_codici <- data$codice_regione[502:522]
tamponi <- regioni_ristr$tamponi
new_tamponi <- matrix(NA, n_reg, prev+1)
se_new <- matrix(NA, n_reg, prev+1)
  
#loess prediction for the swabs
  par(mfrow=c(2,3))
  for (j in 1:n_reg){
    obj <- loess(tamponi[denominazione_regione==
              stringa_regioni[j]] ~
      times[denominazione_regione==stringa_regioni[j]]+
      I(times[denominazione_regione==stringa_regioni[j]])^2,
        data = regioni_ristr, 
        control = loess.control(surface = "direct"))
    new_tamponi[j, 1] <- obj$fitted[length(obj$fitted)]
    
    for (t in (new_times-1)){
      fit.loess <- predict(obj, 
                newdata = data.frame(times = t+1,           
             denominazione_regione = stringa_regioni[j]), 
             se = TRUE)
      new_tamponi[j, t-l_times+2] <- fit.loess$fit
      se_new[j, 1] <- predict(obj, 
              newdata = data.frame(times = l_times,         
              denominazione_regione = stringa_regioni[j]), 
              se = TRUE)$se
      se_new[j,t-l_times+2] <- fit.loess$se
    }
    plot(c(times[regioni_ristr$denominazione_regione==
      stringa_regioni[j]], new_times[2]:(l_times + prev)),  
      c(tamponi[regioni_ristr$denominazione_regione==  
      stringa_regioni[j]], rep(NA, length(2: (prev+1)))),
                   xlab = "Days", ylab ="swabs", main =
        stringa_regioni[j], ylim=c(0, 
      new_tamponi[j, prev+1] +0.1*new_tamponi[j, prev+1]))
    points(new_times[1]:(l_times + prev), 
           new_tamponi[j, 1: (prev+1)], col ="red", lwd=2 )
  }




The red points represent the local polynomial predictions for the swabs: again, these predictions underly on a very simplified assumption!

Now we can try to predict the number of total hospitalized for the 21 regions in the next 4 days (summary tables are reported at the end of the document). Red ribbons represent 95% predictive intervals for out-of-sample predictions, whereas blue ribbons represent 95% predictive intervals for in-sample data.

# data for predictions
newdata <- data.frame(times_scaled = 
               centered_times_tot[(length(times)+1):(length(times)+length(times_out))],
             denominazione_regione = rep(stringa_regioni, prev+1),
             codice_regione = rep(stringa_codici, prev+1),
             lockdown_rep = rep(1, n_reg*(prev+1)),
             tamponi = as.vector(new_tamponi))
  
regioni_plot <- function(mod, data, prev, newdata, 
                           new_points){
  
  pred.out <- posterior_predict(mod,
                                newdata = newdata ,
                                type = "response")
  
  pred_0025 <- apply(pred, 2, function(x) quantile(x,0.025))
  pred_025 <- apply(pred, 2, function(x) quantile(x,0.25))
  pred_med<- apply(pred, 2, function(x) quantile(x,0.5))
  pred_075<- apply(pred, 2, function(x) quantile(x,0.75))
  pred_0975<- apply(pred, 2, function(x) quantile(x,0.975))
  
  pred_0025_out <- apply(pred.out, 2, function(x) quantile(x,0.025))
  pred_025_out <- apply(pred.out, 2, function(x) quantile(x,0.25))
  pred_med_out <- apply(pred.out, 2, function(x) quantile(x,0.5))
  pred_075_out <- apply(pred.out, 2, function(x) quantile(x,0.75))
  pred_0975_out <- apply(pred.out, 2, function(x) quantile(x,0.975))
  
  if (missing(new_points)){
    new_points <- rep(NA, length(new_times)*(n_reg))
  }else{
    new_points <- c(rep(NA, n_reg), new_points)
  }
  
  df.regioni <- data.frame(
    y = c(resp, new_points),
    times = c(times, times_out), 
    med = c(pred_med, rep(NA, length(new_times)*n_reg)),
    lo = c(pred_0025, rep(NA, length(new_times)*n_reg)),
    lo.2 = c(pred_025, rep(NA, length(new_times)*n_reg)),
    hi= c(pred_0975, rep(NA, length(new_times)*n_reg)), 
    hi.2 = c(pred_075, rep(NA, length(new_times)*n_reg)),
    reg = unlist(list(data$denominazione_regione, rep(stringa_regioni, prev+1))),
    med.out = c(rep(NA, length(times)), pred_med_out),
    lo.out = c(rep(NA, length(times)), pred_0025_out),
    lo.2.out = c(rep(NA, length(times)), pred_025_out),
    hi.out= c(rep(NA, length(times)),
              pred_0975_out),
    hi.2.out = c(rep(NA, length(times)),pred_075_out)
  )
  
  if (prev <=7){
table_reg <- matrix(NA, n_reg, prev)
str_days <- Sys.Date()-1
  # c("30 mar", "31 mar",
  # "1 apr", "2 apr", "3 apr", "4 apr",
  # "5 apr", "6 apr")
options(knitr.table.format = "html")
for (j in 1:prev){
  str_days <- c(str_days, Sys.Date()+j-1)
  table_reg[,j] <-
  paste(round(pred_med_out[(n_reg*j+1):(n_reg*(j+1))]),
        " (",
  round(pred_0025_out[(n_reg*j+1):(n_reg*(j+1))]), ", ",
  round(pred_0975_out[(n_reg*j+1):(n_reg*(j+1))]), ")",
      sep="" )
}
str_days <- str_days[2:(prev+1)]
dimnames(table_reg) <- list(stringa_regioni,
                            format(str_days, 
                                   format = "%d-%b-%Y"))

ospt <- write.csv(table_reg, paste("osp", Sys.Date(),
                                   ".csv", sep=""))
}
  
  
  ggplot(df.regioni, aes(times, y))+
    geom_ribbon(aes(x=times, ymin=lo, ymax=hi, group=1),
                data=df.regioni,
                fill = color_scheme_get("blue")[[2]])+
    geom_ribbon(aes(x=times, ymin=lo.out, ymax=hi.out, group=1),
                data=df.regioni,
                fill = color_scheme_get("red")[[2]])+
    geom_line(aes(x= times, y= med),
              data=df.regioni,
              color = color_scheme_get("blue")[[4]],
              size =1.1)+
    geom_line(aes(x= times, y= med.out),
              data=df.regioni,
              color = color_scheme_get("red")[[4]],
              size =1.1)+
    geom_point(aes(x = times, y = y))+
    xlab("Days")+
    ylab("Total Hospitalized")+
    scale_x_discrete(limit = c( 8, 15, 22, 29, 36 ), 
        labels = c( "2-3", "9-3", "16-3", "23-3", "30-3") )+
    facet_wrap("reg", scales ="free")+
    theme(strip.text.x = element_text(size = 12, 
          colour = "black"),
          axis.text.x = element_text(face="bold", 
          color="#993333", 
          angle=45, size =9),
          axis.title.x = element_text(size=22),
          axis.title.y = element_text(size = 22))
  
  # return(list(pred_0025_out = pred_0025_out,
  #             pred_med_out = pred_med_out,
  #             pred_0975_out = pred_0975_out))
}

# 4-days predictions
regioni_plot(mod, data, prev, newdata)

We can also focus on a bunch of regions, say Friuli Venezia Giulia and Lombardia.



And in one month what happens, according to the model? Here there are the predicted swabs (solid red lines are the estimates, dashed red lines represent estimates \(\pm 2\)se):





And here the total hospitalized in 30-days (red ribbons):

# data for predictions
newdata <- data.frame(times_scaled = 
               centered_times_tot[(length(times)+1):(length(times)+length(times_out))],
             denominazione_regione = rep(stringa_regioni, prev+1),
             codice_regione = rep(stringa_codici, prev+1),
             lockdown_rep = rep(1, n_reg*(prev+1)),
             tamponi = as.vector(new_tamponi))
             
regioni_plot(mod, data, prev, newdata)

Models for intensive care units (ICU)

After similar analysis, we detected a final model for the total intensive care, \(\mathsf{icu}\), as well. We can make predictions for each region, similarly to the previous framework:

\[\begin{align} \mathsf{icu}_{ij} \sim & \mathsf{Poisson}(\lambda_{ij})\\ \lambda_{ij} = & \exp(\mu + \alpha_{j(i)} + \beta_1\mathsf{times}_i + \beta_2 \mathsf{times}^2_i + \beta_3\mathsf{lockdown}_i + \beta_4 \mathsf{times}_i\times \mathsf{lockdown}_i)\\ \alpha_j \sim & \mathsf{Normal}(\mu_{\alpha}, \sigma^2_{\alpha}), \end{align}\]

icu <- regioni_ristr$terapia_intensiva
mod.hier.icu.2.int <- stan_glmer( icu ~  
      lockdown_rep*times_scaled+ 
      (1|codice_regione)+ I(times_scaled^2),
      data = regioni_ristr,
      family = poisson, iter = n.iter)
print(mod.hier.icu.2.int)

We retrieve posterior parameters estimations, and then we attempt 4-days predictions:

# labelling
alpha_names <- paste0("alpha[", n_reg:1, "]")
beta_names <- paste0("beta[", 4:1, "]")
new_names <- c(expression(sigma[alpha]), alpha_names, beta_names, expression(mu))
posterior2 <- as.array(mod.hier.icu.2.int)
mcmc_intervals(posterior2)+
  xaxis_text(on =TRUE, size=rel(1.9))+
  yaxis_text(on =TRUE, size=rel(1.9))+
  scale_y_discrete(labels = rev((parse(text = new_names))))+
  ggtitle("Parameter estimation")+
  theme(plot.title = element_text(hjust = 0.5, size =rel(2)))


Tables for ICU and hospitalized

Here the tables for the predicted total hospitalized and ICU in the next 4 days. Points predictions are expressed as medians, whereas in parentheses there are the lower and the upper bounds of the 95\(\%\) credible intervals:

Predicted number of tot. hospitalized (in parentheses, the lower and the upper bound of the 95% credible intervals)
Regione 31-mar-2020 01-apr-2020 02-apr-2020 03-apr-2020
Abruzzo 468 (423, 514) 484 (438, 531) 497 (451, 544) 504 (457, 554)
Basilicata 45 (32, 60) 47 (34, 62) 49 (35, 64) 50 (36, 66)
P.A. Bolzano 396 (358, 438) 410 (371, 450) 420 (379, 461) 427 (384, 470)
Calabria 168 (141, 196) 173 (147, 201) 177 (151, 206) 179 (151, 209)
Campania 637 (585, 688) 654 (600, 706) 664 (611, 720) 669 (614, 726)
Emilia Romagna 4088 (3954, 4225) 4025 (3886, 4166) 3923 (3781, 4060) 3780 (3638, 3924)
Friuli Venezia Giulia 366 (327, 405) 376 (336, 417) 383 (343, 426) 387 (346, 428)
Lazio 1168 (1100, 1242) 1184 (1114, 1257) 1188 (1118, 1264) 1180 (1104, 1256)
Liguria 1496 (1414, 1580) 1549 (1466, 1635) 1591 (1506, 1680) 1618 (1529, 1705)
Lombardia 11793 (11535, 12044) 11286 (11020, 11545) 10681 (10413, 10953) 9996 (9716, 10273)
Marche 1820 (1729, 1912) 1887 (1795, 1980) 1936 (1840, 2032) 1972 (1875, 2070)
Molise 53 (39, 69) 56 (41, 72) 58 (43, 75) 60 (45, 76)
Piemonte 4017 (3884, 4151) 4087 (3945, 4223) 4115 (3966, 4259) 4107 (3957, 4258)
Puglia 626 (575, 679) 646 (593, 699) 660 (608, 712) 667 (612, 720)
Sardegna 151 (126, 176) 156 (131, 184) 162 (136, 187) 165 (137, 193)
Sicilia 493 (449, 538) 506 (460, 552) 513 (465, 559) 514 (467, 563)
Toscana 1453 (1375, 1533) 1463 (1384, 1543) 1457 (1377, 1539) 1435 (1354, 1516)
P.A. Trento 421 (382, 462) 439 (398, 482) 451 (409, 497) 460 (415, 505)
Umbria 232 (202, 265) 241 (209, 274) 246 (214, 280) 249 (218, 286)
Valle d’Aosta 140 (116, 165) 146 (121, 171) 151 (126, 177) 155 (131, 183)
Veneto 1505 (1425, 1587) 1439 (1360, 1519) 1361 (1283, 1439) 1271 (1195, 1348)
Predicted number of ICU (in parentheses, the lower and the upper bound of the 95% credible intervals)
Regione 31-mar-2020 01-apr-2020 02-apr-2020 03-apr-2020
Abruzzo 60 (45, 76) 59 (44, 75) 57 (42, 73) 55 (40, 71)
Basilicata 11 (5, 18) 11 (5, 17) 10 (5, 18) 10 (4, 17)
P.A. Bolzano 44 (32, 58) 43 (31, 57) 42 (29, 56) 40 (28, 54)
Calabria 18 (10, 28) 18 (10, 28) 18 (10, 27) 17 (9, 26)
Campania 98 (78, 119) 97 (78, 118) 94 (75, 114) 91 (72, 111)
Emilia Romagna 360 (322, 400) 354 (315, 396) 344 (305, 386) 331 (292, 373)
Friuli Venezia Giulia 47 (34, 62) 46 (33, 60) 45 (32, 58) 43 (30, 57)
Lazio 99 (79, 121) 98 (79, 119) 95 (76, 115) 91 (72, 112)
Liguria 155 (129, 182) 152 (127, 178) 148 (124, 175) 143 (119, 169)
Lombardia 1569 (1480, 1660) 1539 (1445, 1633) 1498 (1400, 1596) 1443 (1342, 1547)
Marche 190 (163, 221) 187 (159, 216) 182 (154, 210) 175 (148, 204)
Molise 9 (3, 15) 9 (3, 15) 8 (3, 15) 8 (3, 14)
Piemonte 396 (354, 437) 388 (348, 432) 378 (337, 421) 364 (322, 408)
Puglia 57 (42, 72) 56 (41, 72) 54 (40, 70) 52 (38, 68)
Sardegna 15 (8, 24) 15 (8, 23) 15 (8, 23) 14 (7, 23)
Sicilia 57 (43, 73) 56 (42, 72) 55 (40, 71) 53 (39, 68)
Toscana 251 (219, 284) 246 (213, 280) 239 (208, 273) 231 (199, 264)
P.A. Trento 44 (31, 58) 43 (31, 57) 42 (30, 56) 40 (29, 54)
Umbria 39 (27, 52) 38 (27, 51) 37 (25, 50) 36 (25, 49)
Valle d’Aosta 17 (10, 26) 17 (9, 26) 17 (9, 26) 16 (9, 25)
Veneto 332 (297, 372) 327 (289, 365) 317 (280, 356) 306 (267, 346)