Introduction
This is not a game: predicting the number of deaths or related quantities due to Covid-19 should not be addressed without considering serious ethical concerns. We do not have the ambition to tell you how the virus will be spreading out in the future, we want just try to understand the trend across Italian regions. Take this analysis critically. We agree with the statistician George Box, who said: ‘All the models are wrong, but some are useful’. Hopefully, some of them may be useful, and this is our hope for this work.
We do not want to play with the lifes.
Main aims
The main aims of this analysis are the following ones:
study the trend of both total hospitalized people in each region and intensive care units (ICU);
provide 3/4 days-forward predictions immediately available for the regional health care systems to ease the patients’ allocation;
evaluate the effect of the following factors/covariates on the spreading outbreak of the disease: time, lockdown measures adopted by the Italian Government (see here for more details https://en.wikipedia.org/wiki/2020_Italy_coronavirus_lockdown), number of medical swabs, regional membership;
evaluate the appropriateness of the policies adopted by the Italian Government to alleviate the disease’s spreading outbreak.
Methodology
We develop Bayesian hierarchical Poisson regression models for the number of total hospitalized and ICU, by including the covariates listed above and considering possible interactions between the lockdown measures and the exposition times. Model comparison has been fulfilled by use of predictive information criteria based on leave-one-out cross-validation.
Out-of-sample predictions (95\(\%\) credible intervals) for the next days have been obtained by simulating from the posterior predictive distribution. Predictions for the future medical swabs have been generated through local polynomial regression (LOESS).
Moreover, we consider distinct starting points of exposition for each region, by acknowledging for the distinct intensities registered in each region:
February 24th: Piemonte, Valle d’ Aosta, Abruzzo, P.A. Bolzano, Emilia Romagna, Friuli Venezia Giulia, Lazio, Liguria, Lombardia, Marche, P.A. Trento, Toscana, Umbria, Veneto;
February 25th: Sicilia;
February 27th: Campania, Puglia, Basilicata, Molise, Calabria;
March 3rd: Sardegna.
The fit of the model is performed using the Stan Ecosystem softwares (see https://mc-stan.org/) enjoying Hamiltonian Monte Carlo sampling, an effective extension of the Markov Chain Monte Carlo methods. We used the statistical software R 3.6.3.
This is a dynamic modelling framework: with the same code, we can fit every day the models and update the predictions for the upcoming days.
Limitations of this work
As mentioned above, we feel that the number of hospitalized people and ICU are the most effective and reliable data in this moment, being likely less associated than the number of the new positive cases with the medical swabs. Moreover, they can be fruitfully used to predict patients’ allocation according to the strength of the health care systems. Though, they offer a systematic lagged screenshot of the pandemic trend, if compared with the number of new cases.
The policies for the swabs are not open-access and clearly defined in the different regions and. Thus, underlying on the swabs covariates for the out-of-sample days may yield very unstable predictions. For such a reason, we want to invite the readers to carefully analyze and criticize the assumption of exponentially increasing swabs.
The use of the Poisson distribution could not take care of possible under/over-dispersion in the national data. Moreover, the Poisson regression model depending only on time covariate implies a monotonic trend, and this may be realistic only for the first period of outbreak.
In this first step, we are not accounting for the regions’ populations. Though, we feel this is an important feature. We are going to include them as further covariates.
Data
We analyze the data freely available at “https://github.com/pcm-dpc/COVID-19/tree/master/dati-regioni”, in which the date, the total number of positive people, hospitalized people, intensive care and other relevant quantities are provided for each day and each Italian region from February 24th.
# loading packages
library(rstanarm)
library(ggplot2)
library(bayesplot)
library(tidyverse)
library(lme4)
library(arm)
library(loo)
library(kableExtra)
# daily data
regioni <- read.csv("https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv")
head(regioni)
## data stato codice_regione denominazione_regione lat
## 1 2020-02-24T18:00:00 ITA 13 Abruzzo 42.35122
## 2 2020-02-24T18:00:00 ITA 17 Basilicata 40.63947
## 3 2020-02-24T18:00:00 ITA 4 P.A. Bolzano 46.49933
## 4 2020-02-24T18:00:00 ITA 18 Calabria 38.90598
## 5 2020-02-24T18:00:00 ITA 15 Campania 40.83957
## 6 2020-02-24T18:00:00 ITA 8 Emilia Romagna 44.49437
## long ricoverati_con_sintomi terapia_intensiva totale_ospedalizzati
## 1 13.39844 0 0 0
## 2 15.80515 0 0 0
## 3 11.35662 0 0 0
## 4 16.59440 0 0 0
## 5 14.25085 0 0 0
## 6 11.34172 10 2 12
## isolamento_domiciliare totale_attualmente_positivi
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 6 18
## nuovi_attualmente_positivi dimessi_guariti deceduti totale_casi tamponi
## 1 0 0 0 0 5
## 2 0 0 0 0 0
## 3 0 0 0 0 1
## 4 0 0 0 0 1
## 5 0 0 0 0 10
## 6 18 0 0 18 148
## note_it note_en
## 1
## 2
## 3
## 4
## 5
## 6
n_osp_giornalieri <- length(regioni$totale_ospedalizzati)
# storing variables
## number of distinc times
l_times <- length(unique(regioni$data))
times <- 1:l_times
## number of regions
n_reg <- length(unique(regioni$denominazione_regione))
n <- dim(regioni)[1]
## lag hospitalized
osp_lag <- c()
initialize <- rep(0, n_reg)
osp_lag <- initialize
for (j in (n_reg+1):n){
osp_lag[j] <- regioni$totale_ospedalizzati[j-n_reg]
}
## recodifying distinct times
times <- as.factor(regioni$data)
levels(times) <- c(1:l_times)
times <- as.numeric(times)
regioni$codice_regione <- as.factor(regioni$codice_regione)
Model’s choice for the number of total hospitalized people
We start investigating the relationship between the number of hospitalized people in each region and some covariates. We will end up to select the best model in terms of both Bayesian and classical goodness measures of fit, based on leave-one-out cross-validation and deviance information criteria, respectively. Fit of Bayesian models is performed using Hamiltonian Monte Carlo sampling underlying the rstanarm
package, belonging to the Stan Ecosystem. Unless otherwise stated, we use weakly-informative prior distributions for the regression coefficients.
df <- data.frame(times = times,
reg = unlist(list(regioni$denominazione_regione)),
y = regioni$totale_ospedalizzati)
ggplot(df, aes(x=times, y =y))+
geom_point(aes(x=times, y =y))+
xlab("Days")+
ylab("Total Hospitalized")+
scale_x_discrete(limit = c( 8, 15, 22, 29, 36 ),
labels = c( "2-3", "9-3", "16-3",
"23-3", "30-3") )+
facet_wrap("reg", scales ="free")+
theme(strip.text.x = element_text(size = 12,
colour = "black"),
axis.text.x = element_text(face="bold",
color="#993333",
angle=45, size =9),
axis.title.x = element_text(size=22),
axis.title.y = element_text(size = 22))

Poisson GLM regression
We start by assuming a simple Poisson regression for the number of hospitalized people \(Y_i, \ i=1,\ldots,n\):
\[\begin{align}
Y_i \sim & \mathsf{Poisson}(\lambda_i)\\
\lambda_i = & \exp(\alpha + \beta_1\mathsf{times}_i),
\end{align}\] where \(\mathsf{times}\) is the temporal variable expressing day when the measurement was recorded. We may extend the linear predictor by adding the squared times:
\[\begin{align}
\lambda_i = & \exp(\alpha + \beta_1\mathsf{times}_i + \beta_2 \mathsf{times}^2_i).
\end{align}\]
We estimate the models with the basic function glm
, and with the stan_glm
function contained in the R package rstanarm
. Then, we compare the goodness of fit of the Bayesian models by using the loo
function of the loo
package: better models are associated with lower LOOIC values. In what follows, we will use classical models fit only as a further check: though, analysis, plots and predictions will be framed in a Bayesian perspective.
resp <- regioni$totale_ospedalizzati
n.iter <- 2000
### 1) modello glm
# only times
mod.1 <- stan_glm(resp ~ times, family = poisson, data =
regioni, iter = n.iter)
mod.1.cl <- glm(resp ~ times, family = poisson, data =
regioni)
loo.glm.1 <- loo(mod.1)$estimates[3,1]
summary.glm.1 <- summary(mod.1.cl)
# squared times
mod.2 <- stan_glm(resp ~ times + I (times^2), family =
poisson, data = regioni, iter = n.iter)
mod.2.cl <- glm(resp ~ times + I (times^2), family = poisson, data = regioni)
loo.glm.2 <- loo(mod.2)$estimates[3,1]
summary.glm.2 <- summary(mod.2.cl)
c(loo.glm.1, loo.glm.2)
## [1] 989981.7 963142.3
Including the squared times improves the model fit. We can further extend our linear predictor by adding the number of medical swabs applied every day and the lagged number of hospitalized people:
\[\begin{align}
\lambda_i = & \exp(\alpha + \beta_1\mathsf{times}_i + \beta_2 \mathsf{times}^2_i + \beta_3 \mathsf{swabs}_i+\beta_4 \mathsf{lag\_hosp}_i).
\end{align}\]
# + medical swabs
mod.3 <- stan_glm(resp ~ times + I (times^2) + tamponi,
family = poisson, data = regioni, iter = n.iter)
mod.3.cl <- glm(resp ~ times + I (times^2) + tamponi,
family = poisson, data = regioni)
loo.glm.3 <- loo(mod.3)$estimates[3,1]
summary.glm.3 <- summary(mod.3.cl)
# medical swabs and osp lag
mod.4 <- stan_glm(resp ~ times + I (times^2) + tamponi +
osp_lag, family = poisson, data = regioni,
iter = n.iter)
mod.4.cl <- glm(resp ~ times + I (times^2) + tamponi +
osp_lag, family = poisson, data = regioni)
loo.glm.4 <- loo(mod.4)$estimates[3,1]
summary.glm.4 <- summary(mod.4.cl)
c(loo.glm.3, loo.glm.4)
## [1] 411070.6 317511.2
The fourth model including medical swabs and lagged hospitalized is the best according to the GLMs here tested.
Poisson hierarchical regression
We move into a hierarchical modeling scenario, by adopting the following model for the total hospitalized number of people in each region, \(Y_{ij}, \ i =1,\ldots,n; \ j=1,\ldots, \mathsf{n\_reg}\):
\[\begin{align}
Y_{ij} \sim & \mathsf{Poisson}(\lambda_{ij})\\
\lambda_{ij} = & \exp(\mu + \alpha_{j(i)} + \beta_1\mathsf{times}_i)\\
\alpha_j \sim & \mathsf{Normal}(\mu_{\alpha}, \sigma^2_{\alpha}),
\end{align}\] where \(\mu\) is the global intercept and \(\alpha_{j(i)}\) is the random intercept for the region \(j(i)\). \(\mu_{\alpha}\) and \(\sigma^2_{\alpha}\) are the hyperparameters for the random intercepts prior. Analogously as what we have done for the previous four models, we fit distinct hierarchical models, by first adding the squared times and the medical swabs. We estimate the hierarchical models by using the function glmer
in the lme4
package, and the function stan_glmer
in the rstanarm
package.
# + squared times
mod.hier.1 <- stan_glmer(resp ~ (1|codice_regione)+
times + I(times^2),family = poisson, data = regioni,
iter = n.iter)
mod.hier.1.cl <- glmer(resp ~ (1|codice_regione)+
times + I(times^2),family = poisson, data = regioni)
loo.hier.1 <- loo(mod.hier.1)$estimates[3,1]
d1 <- display(mod.hier.1.cl)
dic.hier.1 <- as.double(d1$DIC)
# + medical swabs
mod.hier.2 <- stan_glmer(resp ~ (1|codice_regione)+
times + I(times^2) + tamponi,
data = regioni, family = poisson,
iter = n.iter)
mod.hier.2.cl <- glmer(resp ~ (1|codice_regione)+
times + I(times^2) + tamponi,
data = regioni, family = poisson)
loo.hier.2 <- loo(mod.hier.2)$estimates[3,1]
d2 <- display(mod.hier.2.cl)
dic.hier.2 <- as.double(d2$DIC)
c(loo.hier.1, loo.hier.2)
## [1] 22591.92 13794.48
As is clear, hierarchical models drammatically improve the fit, with the latter being preferable to the model without medical swabs.
Furthermore, we extend our hierarchical models by considering three more assumptions:
add the \(\mathsf{lockdown}\) covariate, a numerical non-decreasing function which should try to mimic the lockdown measures adopted by the Italian Government; the function value is zero before March 8th, and then assumes increasing values, with upper bound 1;
cut the interval for which each region is observed, in order to acknowledge for the different starting points of Covid-19 registered in each region;
evaluate a possible interaction between lockdown measures and exposition times.
Then, the linear predictor for the completed model is then defined as:
\[\begin{align}
\lambda_{ij} = & \exp(\mu + \alpha_{j(i)} + \beta_1\mathsf{times}_i + \beta_2 \mathsf{times}^2_i +
\beta_3\mathsf{lockdown}_i + \\
& \beta_4 \mathsf{swabs}_i +
\beta_5 \mathsf{lag\_hosp}_i + \beta_6 \mathsf{times}_i\times \mathsf{lockdown}_i +\\
& \beta_7 \mathsf{times}^2_i\times \mathsf{lockdown}_i).
\end{align}\]
## data redefinition
regioni_ristr <- cbind(times, osp_lag, regioni)
regioni_tibble <- as_tibble(regioni_ristr)
regioni_ristr<- regioni_tibble %>%
filter( (times > 3 & (denominazione_regione =="Campania"|
denominazione_regione =="Puglia" |
denominazione_regione =="Basilicata"|
denominazione_regione =="Molise" |
denominazione_regione =="Calabria")) |
(times > 1 & denominazione_regione == "Sicilia")|
(times > 8 & denominazione_regione == "Sardegna")|
(times >0 & (denominazione_regione =="Piemonte"|
denominazione_regione =="Valle d'Aosta" |
denominazione_regione =="Abruzzo"|
denominazione_regione =="P.A. Bolzano" |
denominazione_regione =="Emilia Romagna"|
denominazione_regione =="Friuli Venezia Giulia"|
denominazione_regione =="Lazio" |
denominazione_regione =="Liguria"|
denominazione_regione =="Lombardia" |
denominazione_regione =="Marche"|
denominazione_regione =="P.A. Trento"|
denominazione_regione =="Toscana"|
denominazione_regione =="Umbria"|
denominazione_regione =="Veneto")))
# lockdown variable
lockdown <- c()
eff<-function(xx) { vec= rep(0, length(xx))
for (i in 1:length(xx)) {
if (xx[i] <= 15)
{vec[i]=xx[i]^2/225}
else
{vec[i]=1}}
vec
}
# defined to be zero before march 8
lockdown <- c(rep(0, 14), eff(1:(l_times-14)))
resp <- regioni_ristr$totale_ospedalizzati
times <- regioni_ristr$times
lockdown_rep <- c()
for (j in 1: length(resp)){
lockdown_rep[j] <- lockdown[times[j]]
}
n_reg <- length(unique(regioni_ristr$denominazione_regione))
times_scaled <- times - mean(times)
# + dummy lockdown; distinct starting points; scaled times
mod.hier.3 <- stan_glmer( resp ~ (1|codice_regione)+
times_scaled + I(times_scaled^2) + lockdown_rep,
data = regioni_ristr, family = poisson,
iter = n.iter)
mod.hier.3.cl <- glmer( resp ~ (1|codice_regione)+
times_scaled + I(times_scaled^2) + lockdown_rep,
data = regioni_ristr, family = poisson)
loo.hier.3 <- loo(mod.hier.3)$estimates[3,1]
d3 <- display(mod.hier.3.cl)
dic.hier.3 <- as.double(d3$DIC)
# + swabs
mod.hier.4 <- stan_glmer( resp ~ (1|codice_regione)+
times_scaled + I(times_scaled^2) +
tamponi + lockdown_rep,
data = regioni_ristr, family = poisson, iter = n.iter)
mod.hier.4.cl <- glmer( resp ~ (1|codice_regione)+
times_scaled + I(times_scaled^2) +
tamponi + lockdown_rep,
data = regioni_ristr, family = poisson)
loo.hier.4 <- loo(mod.hier.4)$estimates[3,1]
d4 <- display(mod.hier.4.cl)
dic.hier.4 <- as.double(d4$DIC)
# + interaction between lockdown and time
mod.hier.4.int <- stan_glmer( resp ~ (1|codice_regione) +
lockdown_rep*times_scaled +
I(times_scaled^2) + tamponi,
data = regioni_ristr,
family = poisson, iter = n.iter)
mod.hier.4.cl.int <- glmer( resp ~ (1|codice_regione)+
lockdown_rep*times_scaled +
I(times_scaled^2) + tamponi , data = regioni_ristr,
family = poisson)
loo.hier.4.int <- loo(mod.hier.4.int)$estimates[3,1]
d4.int <- display(mod.hier.4.cl.int)
dic.hier.4.int <- as.double(d4.int$DIC)
# + double interaction between lockdown and time
mod.hier.4.int.2 <- stan_glmer( resp ~ (1|codice_regione) +
lockdown_rep*times_scaled +
lockdown_rep*I(times_scaled^2) + tamponi,
data = regioni_ristr,
family = poisson, iter = n.iter)
mod.hier.4.cl.int.2 <- glmer( resp ~ (1|codice_regione)+
lockdown_rep*times_scaled +
lockdown_rep*I(times_scaled^2) + tamponi ,
data = regioni_ristr,
family = poisson)
loo.hier.4.int.2 <- loo(mod.hier.4.int.2)$estimates[3,1]
d4.int.2 <- display(mod.hier.4.cl.int.2)
dic.hier.4.int.2 <- as.double(d4.int.2$DIC)
# + lag hospitalized
mod.hier.5 <- stan_glmer( resp ~ (1|codice_regione)+
times_scaled + I(times_scaled^2) +
tamponi + lockdown_rep +
osp_lag, data = regioni_ristr, family = poisson,
iter = n.iter)
mod.hier.5.cl <- glmer( resp ~ (1|codice_regione)+
times_scaled + I(times_scaled^2) +
tamponi + lockdown_rep +
osp_lag, data = regioni_ristr, family = poisson)
loo.hier.5 <- loo(mod.hier.5)$estimates[3,1]
d5 <- display(mod.hier.5.cl)
dic.hier.5 <- as.double(d5$DIC)
# + interaction between lockdown and time
mod.hier.5.int <- stan_glmer( resp ~ (1|codice_regione)+
lockdown_rep*times_scaled +
I(times_scaled^2) + tamponi +
osp_lag, data = regioni_ristr,
family = poisson, iter = n.iter)
mod.hier.5.cl.int <- glmer( resp ~ (1|codice_regione)+
lockdown_rep*times_scaled +
I(times_scaled^2) + tamponi + osp_lag,
data = regioni_ristr,
family = poisson)
loo.hier.5.int <- loo(mod.hier.5.int)$estimates[3,1]
d5.int <- display(mod.hier.5.cl.int)
dic.hier.5.int <- as.double(d5.int$DIC)
# + double interaction between lockdown and time
mod.hier.5.int.2 <- stan_glmer( resp ~ (1|codice_regione)+
lockdown_rep*times_scaled +
lockdown_rep*I(times_scaled^2) + tamponi +
osp_lag, data = regioni_ristr,
family = poisson, iter = n.iter)
mod.hier.5.cl.int.2 <- glmer( resp ~ (1|codice_regione)+
lockdown_rep*times_scaled +
lockdown_rep*I(times_scaled^2) + tamponi + osp_lag,
data = regioni_ristr,
family = poisson)
loo.hier.5.int.2 <- loo(mod.hier.5.int.2)$estimates[3,1]
d5.int.2 <- display(mod.hier.5.cl.int.2)
dic.hier.5.int.2 <- as.double(d5.int.2$DIC)
c(loo.hier.3, loo.hier.4, loo.hier.4.int,
loo.hier.4.int.2, loo.hier.5, loo.hier.5.int,
loo.hier.5.int.2)
## [1] 22345.19 13590.99 13458.34 13485.01 11870.41 11838.03 11860.90
c(dic.hier.3, dic.hier.4, dic.hier.4.int, dic.hier.4.int.2,
dic.hier.5, dic.hier.5.int, dic.hier.5.int.2)
## [1] 13129.485 4652.788 4494.606 4493.260 3030.830 2978.791 2977.587
The model with medical swabs and lagged hospitalized as covariates, and interaction between lockdown and time, is the best in terms of goodness of fit measures. In general, the interaction between lockdown and time improves the fit. Though, interaction between lockdown and squared times do not yield better predictive performance.
Model comparisons in terms of predictive information criteria
Let’s have a precise glimpse in terms of model comparisons (LOOIC and DIC) for the hierarchical models:
#plots for looic and dic
loo.v <- c(loo.hier.1, loo.hier.2, loo.hier.3,
loo.hier.4, loo.hier.4.int, loo.hier.4.int.2,
loo.hier.5, loo.hier.5.int, loo.hier.5.int.2)
dic.v <- c(dic.hier.1, dic.hier.2, dic.hier.3,
dic.hier.4, dic.hier.4.int, dic.hier.4.int.2,
dic.hier.5, dic.hier.5.int, dic.hier.5.int.2)
sort.loo.v <- sort.int(loo.v, index.return = TRUE)$x
sort.dic.v <- sort.int(dic.v, index.return = TRUE)$x
par(xaxt="n", mfrow=c(1,2))
plot(sort.loo.v, type="b", xlab="", ylab="LOOIC")
par(xaxt="s")
axis(1, c(1:9), c("hier1", "hier2", "hier3",
"hier4","hier4int", "hier4int2", "hier5",
"hier5int", "hier5int2")[sort.int(loo.v,
index.return = TRUE)$ix],
las=2)
par(xaxt="n")
plot(sort.dic.v, type="b", xlab="", ylab="DIC")
par(xaxt="s")
axis(1, c(1:9), c("hier1", "hier2", "hier3",
"hier4","hier4int", "hier4int2", "hier5",
"hier5int", "hier5int2")[sort.int(dic.v,
index.return = TRUE)$ix], las=2)

The completed model (without interaction between lockdown and squared times) seems to be the best model in terms of predictive information criteria.
Model estimates
Let’s retrieve some posterior estimates for the best model, the hierarchical Poisson model with five predictors and interaction between lockdown and time:
# labelling
alpha_names <- paste0("alpha[", n_reg:1, "]")
beta_names <- paste0("beta[", 5:1, "]")
new_names <- c(expression(sigma[alpha]), alpha_names, beta_names, expression(mu))
posterior <- as.array(mod.hier.5.int)
mcmc_intervals(posterior)+
xaxis_text(on =TRUE, size=rel(1.9))+
yaxis_text(on =TRUE, size=rel(1.9))+
scale_y_discrete(labels = rev((parse(text = new_names))))+
ggtitle("Parameter estimation")+
theme(plot.title = element_text(hjust = 0.5, size =rel(2)))

There is much between-group variability (look at the parameter \(\sigma_{\alpha}\)) between the regions, whereas \(\beta_1,\ldots,\beta_5\) appear to be not much pretty different than zero.
We can check our model via posterior predictive checks: for instance, comparing the empirical disribution function of the true data with the empirical distribution functions from the replications under the model, or some statistics such as the mean and the standard deviation between replicated data and true data.
# ecdf
color_scheme_set("blue")
ppc_ecdf_overlay(y = resp,
yrep = posterior_predict(mod.hier.5.int,
draws = 200))

# stats
ppc_stat(y = resp,
yrep = posterior_predict(mod.hier.5.int,
draws = 200), stat = "mean")

ppc_stat(y = resp,
yrep = posterior_predict(mod.hier.5.int,
draws = 200), stat = "sd")

There could be some issues with the number of zeros (here not addressed).
Graphical analysis and predictions
We are going to use our models to make predictions for the next days. One difficulty regards the simulation of out-of-sample covariates, such as the medical swabs or the lagged hospitalized. Although the best fitting is obtained by the model with the lagged hospitalized covariate, mod.hier.5.int
, we end up then to choose mod.hier.4.int
due to a greater simplicity in dealing with the covariates.
Though, we need to sample new medical swabs for each region in the out-of-sample days, and this is drammatically difficult, because the swabs policies are not clear and known, and the number of swabs is a biased measure for sampling the entire population.
For such a reason, for each region we fit a local polynomial surface determined only by the time and squared time covariates:
# data manipulation for out-of-sample
mod <- mod.hier.4.int
data <- regioni_ristr
prev <- 4
times <- as.factor(data$data)
levels(times) <- c(1:l_times)
times <- as.numeric(times)
new_times <- c((l_times):(l_times+prev))
pred <- posterior_predict(mod, type ="response")
starting <- rep(l_times, n_reg)
for (j in 1:prev){
times_out <- c(starting,
rep(l_times+j, n_reg ))
starting <- times_out
}
times_tot <- c(times, times_out)
centered_times_tot <- times_tot-mean(times)
stringa_regioni <- data$denominazione_regione[502:522]
stringa_codici <- data$codice_regione[502:522]
tamponi <- regioni_ristr$tamponi
new_tamponi <- matrix(NA, n_reg, prev+1)
se_new <- matrix(NA, n_reg, prev+1)
#loess prediction for the swabs
par(mfrow=c(2,3))
for (j in 1:n_reg){
obj <- loess(tamponi[denominazione_regione==
stringa_regioni[j]] ~
times[denominazione_regione==stringa_regioni[j]]+
I(times[denominazione_regione==stringa_regioni[j]])^2,
data = regioni_ristr,
control = loess.control(surface = "direct"))
new_tamponi[j, 1] <- obj$fitted[length(obj$fitted)]
for (t in (new_times-1)){
fit.loess <- predict(obj,
newdata = data.frame(times = t+1,
denominazione_regione = stringa_regioni[j]),
se = TRUE)
new_tamponi[j, t-l_times+2] <- fit.loess$fit
se_new[j, 1] <- predict(obj,
newdata = data.frame(times = l_times,
denominazione_regione = stringa_regioni[j]),
se = TRUE)$se
se_new[j,t-l_times+2] <- fit.loess$se
}
plot(c(times[regioni_ristr$denominazione_regione==
stringa_regioni[j]], new_times[2]:(l_times + prev)),
c(tamponi[regioni_ristr$denominazione_regione==
stringa_regioni[j]], rep(NA, length(2: (prev+1)))),
xlab = "Days", ylab ="swabs", main =
stringa_regioni[j], ylim=c(0,
new_tamponi[j, prev+1] +0.1*new_tamponi[j, prev+1]))
points(new_times[1]:(l_times + prev),
new_tamponi[j, 1: (prev+1)], col ="red", lwd=2 )
}




The red points represent the local polynomial predictions for the swabs: again, these predictions underly on a very simplified assumption!
Now we can try to predict the number of total hospitalized for the 21 regions in the next 4 days (summary tables are reported at the end of the document). Red ribbons represent 95% predictive intervals for out-of-sample predictions, whereas blue ribbons represent 95% predictive intervals for in-sample data.
# data for predictions
newdata <- data.frame(times_scaled =
centered_times_tot[(length(times)+1):(length(times)+length(times_out))],
denominazione_regione = rep(stringa_regioni, prev+1),
codice_regione = rep(stringa_codici, prev+1),
lockdown_rep = rep(1, n_reg*(prev+1)),
tamponi = as.vector(new_tamponi))
regioni_plot <- function(mod, data, prev, newdata,
new_points){
pred.out <- posterior_predict(mod,
newdata = newdata ,
type = "response")
pred_0025 <- apply(pred, 2, function(x) quantile(x,0.025))
pred_025 <- apply(pred, 2, function(x) quantile(x,0.25))
pred_med<- apply(pred, 2, function(x) quantile(x,0.5))
pred_075<- apply(pred, 2, function(x) quantile(x,0.75))
pred_0975<- apply(pred, 2, function(x) quantile(x,0.975))
pred_0025_out <- apply(pred.out, 2, function(x) quantile(x,0.025))
pred_025_out <- apply(pred.out, 2, function(x) quantile(x,0.25))
pred_med_out <- apply(pred.out, 2, function(x) quantile(x,0.5))
pred_075_out <- apply(pred.out, 2, function(x) quantile(x,0.75))
pred_0975_out <- apply(pred.out, 2, function(x) quantile(x,0.975))
if (missing(new_points)){
new_points <- rep(NA, length(new_times)*(n_reg))
}else{
new_points <- c(rep(NA, n_reg), new_points)
}
df.regioni <- data.frame(
y = c(resp, new_points),
times = c(times, times_out),
med = c(pred_med, rep(NA, length(new_times)*n_reg)),
lo = c(pred_0025, rep(NA, length(new_times)*n_reg)),
lo.2 = c(pred_025, rep(NA, length(new_times)*n_reg)),
hi= c(pred_0975, rep(NA, length(new_times)*n_reg)),
hi.2 = c(pred_075, rep(NA, length(new_times)*n_reg)),
reg = unlist(list(data$denominazione_regione, rep(stringa_regioni, prev+1))),
med.out = c(rep(NA, length(times)), pred_med_out),
lo.out = c(rep(NA, length(times)), pred_0025_out),
lo.2.out = c(rep(NA, length(times)), pred_025_out),
hi.out= c(rep(NA, length(times)),
pred_0975_out),
hi.2.out = c(rep(NA, length(times)),pred_075_out)
)
if (prev <=7){
table_reg <- matrix(NA, n_reg, prev)
str_days <- Sys.Date()-1
# c("30 mar", "31 mar",
# "1 apr", "2 apr", "3 apr", "4 apr",
# "5 apr", "6 apr")
options(knitr.table.format = "html")
for (j in 1:prev){
str_days <- c(str_days, Sys.Date()+j-1)
table_reg[,j] <-
paste(round(pred_med_out[(n_reg*j+1):(n_reg*(j+1))]),
" (",
round(pred_0025_out[(n_reg*j+1):(n_reg*(j+1))]), ", ",
round(pred_0975_out[(n_reg*j+1):(n_reg*(j+1))]), ")",
sep="" )
}
str_days <- str_days[2:(prev+1)]
dimnames(table_reg) <- list(stringa_regioni,
format(str_days,
format = "%d-%b-%Y"))
ospt <- write.csv(table_reg, paste("osp", Sys.Date(),
".csv", sep=""))
}
ggplot(df.regioni, aes(times, y))+
geom_ribbon(aes(x=times, ymin=lo, ymax=hi, group=1),
data=df.regioni,
fill = color_scheme_get("blue")[[2]])+
geom_ribbon(aes(x=times, ymin=lo.out, ymax=hi.out, group=1),
data=df.regioni,
fill = color_scheme_get("red")[[2]])+
geom_line(aes(x= times, y= med),
data=df.regioni,
color = color_scheme_get("blue")[[4]],
size =1.1)+
geom_line(aes(x= times, y= med.out),
data=df.regioni,
color = color_scheme_get("red")[[4]],
size =1.1)+
geom_point(aes(x = times, y = y))+
xlab("Days")+
ylab("Total Hospitalized")+
scale_x_discrete(limit = c( 8, 15, 22, 29, 36 ),
labels = c( "2-3", "9-3", "16-3", "23-3", "30-3") )+
facet_wrap("reg", scales ="free")+
theme(strip.text.x = element_text(size = 12,
colour = "black"),
axis.text.x = element_text(face="bold",
color="#993333",
angle=45, size =9),
axis.title.x = element_text(size=22),
axis.title.y = element_text(size = 22))
# return(list(pred_0025_out = pred_0025_out,
# pred_med_out = pred_med_out,
# pred_0975_out = pred_0975_out))
}
# 4-days predictions
regioni_plot(mod, data, prev, newdata)

We can also focus on a bunch of regions, say Friuli Venezia Giulia and Lombardia.


And in one month what happens, according to the model? Here there are the predicted swabs (solid red lines are the estimates, dashed red lines represent estimates \(\pm 2\)se):




And here the total hospitalized in 30-days (red ribbons):
# data for predictions
newdata <- data.frame(times_scaled =
centered_times_tot[(length(times)+1):(length(times)+length(times_out))],
denominazione_regione = rep(stringa_regioni, prev+1),
codice_regione = rep(stringa_codici, prev+1),
lockdown_rep = rep(1, n_reg*(prev+1)),
tamponi = as.vector(new_tamponi))
regioni_plot(mod, data, prev, newdata)

Models for intensive care units (ICU)
After similar analysis, we detected a final model for the total intensive care, \(\mathsf{icu}\), as well. We can make predictions for each region, similarly to the previous framework:
\[\begin{align}
\mathsf{icu}_{ij} \sim & \mathsf{Poisson}(\lambda_{ij})\\
\lambda_{ij} = & \exp(\mu + \alpha_{j(i)} + \beta_1\mathsf{times}_i + \beta_2 \mathsf{times}^2_i +
\beta_3\mathsf{lockdown}_i + \beta_4 \mathsf{times}_i\times
\mathsf{lockdown}_i)\\
\alpha_j \sim & \mathsf{Normal}(\mu_{\alpha}, \sigma^2_{\alpha}),
\end{align}\]
icu <- regioni_ristr$terapia_intensiva
mod.hier.icu.2.int <- stan_glmer( icu ~
lockdown_rep*times_scaled+
(1|codice_regione)+ I(times_scaled^2),
data = regioni_ristr,
family = poisson, iter = n.iter)
print(mod.hier.icu.2.int)
We retrieve posterior parameters estimations, and then we attempt 4-days predictions:
# labelling
alpha_names <- paste0("alpha[", n_reg:1, "]")
beta_names <- paste0("beta[", 4:1, "]")
new_names <- c(expression(sigma[alpha]), alpha_names, beta_names, expression(mu))
posterior2 <- as.array(mod.hier.icu.2.int)
mcmc_intervals(posterior2)+
xaxis_text(on =TRUE, size=rel(1.9))+
yaxis_text(on =TRUE, size=rel(1.9))+
scale_y_discrete(labels = rev((parse(text = new_names))))+
ggtitle("Parameter estimation")+
theme(plot.title = element_text(hjust = 0.5, size =rel(2)))


Tables for ICU and hospitalized
Here the tables for the predicted total hospitalized and ICU in the next 4 days. Points predictions are expressed as medians, whereas in parentheses there are the lower and the upper bounds of the 95\(\%\) credible intervals:
Predicted number of tot. hospitalized (in parentheses, the lower and the upper bound of the 95% credible intervals)
Regione
|
31-mar-2020
|
01-apr-2020
|
02-apr-2020
|
03-apr-2020
|
Abruzzo
|
468 (423, 514)
|
484 (438, 531)
|
497 (451, 544)
|
504 (457, 554)
|
Basilicata
|
45 (32, 60)
|
47 (34, 62)
|
49 (35, 64)
|
50 (36, 66)
|
P.A. Bolzano
|
396 (358, 438)
|
410 (371, 450)
|
420 (379, 461)
|
427 (384, 470)
|
Calabria
|
168 (141, 196)
|
173 (147, 201)
|
177 (151, 206)
|
179 (151, 209)
|
Campania
|
637 (585, 688)
|
654 (600, 706)
|
664 (611, 720)
|
669 (614, 726)
|
Emilia Romagna
|
4088 (3954, 4225)
|
4025 (3886, 4166)
|
3923 (3781, 4060)
|
3780 (3638, 3924)
|
Friuli Venezia Giulia
|
366 (327, 405)
|
376 (336, 417)
|
383 (343, 426)
|
387 (346, 428)
|
Lazio
|
1168 (1100, 1242)
|
1184 (1114, 1257)
|
1188 (1118, 1264)
|
1180 (1104, 1256)
|
Liguria
|
1496 (1414, 1580)
|
1549 (1466, 1635)
|
1591 (1506, 1680)
|
1618 (1529, 1705)
|
Lombardia
|
11793 (11535, 12044)
|
11286 (11020, 11545)
|
10681 (10413, 10953)
|
9996 (9716, 10273)
|
Marche
|
1820 (1729, 1912)
|
1887 (1795, 1980)
|
1936 (1840, 2032)
|
1972 (1875, 2070)
|
Molise
|
53 (39, 69)
|
56 (41, 72)
|
58 (43, 75)
|
60 (45, 76)
|
Piemonte
|
4017 (3884, 4151)
|
4087 (3945, 4223)
|
4115 (3966, 4259)
|
4107 (3957, 4258)
|
Puglia
|
626 (575, 679)
|
646 (593, 699)
|
660 (608, 712)
|
667 (612, 720)
|
Sardegna
|
151 (126, 176)
|
156 (131, 184)
|
162 (136, 187)
|
165 (137, 193)
|
Sicilia
|
493 (449, 538)
|
506 (460, 552)
|
513 (465, 559)
|
514 (467, 563)
|
Toscana
|
1453 (1375, 1533)
|
1463 (1384, 1543)
|
1457 (1377, 1539)
|
1435 (1354, 1516)
|
P.A. Trento
|
421 (382, 462)
|
439 (398, 482)
|
451 (409, 497)
|
460 (415, 505)
|
Umbria
|
232 (202, 265)
|
241 (209, 274)
|
246 (214, 280)
|
249 (218, 286)
|
Valle d’Aosta
|
140 (116, 165)
|
146 (121, 171)
|
151 (126, 177)
|
155 (131, 183)
|
Veneto
|
1505 (1425, 1587)
|
1439 (1360, 1519)
|
1361 (1283, 1439)
|
1271 (1195, 1348)
|
Predicted number of ICU (in parentheses, the lower and the upper bound of the 95% credible intervals)
Regione
|
31-mar-2020
|
01-apr-2020
|
02-apr-2020
|
03-apr-2020
|
Abruzzo
|
60 (45, 76)
|
59 (44, 75)
|
57 (42, 73)
|
55 (40, 71)
|
Basilicata
|
11 (5, 18)
|
11 (5, 17)
|
10 (5, 18)
|
10 (4, 17)
|
P.A. Bolzano
|
44 (32, 58)
|
43 (31, 57)
|
42 (29, 56)
|
40 (28, 54)
|
Calabria
|
18 (10, 28)
|
18 (10, 28)
|
18 (10, 27)
|
17 (9, 26)
|
Campania
|
98 (78, 119)
|
97 (78, 118)
|
94 (75, 114)
|
91 (72, 111)
|
Emilia Romagna
|
360 (322, 400)
|
354 (315, 396)
|
344 (305, 386)
|
331 (292, 373)
|
Friuli Venezia Giulia
|
47 (34, 62)
|
46 (33, 60)
|
45 (32, 58)
|
43 (30, 57)
|
Lazio
|
99 (79, 121)
|
98 (79, 119)
|
95 (76, 115)
|
91 (72, 112)
|
Liguria
|
155 (129, 182)
|
152 (127, 178)
|
148 (124, 175)
|
143 (119, 169)
|
Lombardia
|
1569 (1480, 1660)
|
1539 (1445, 1633)
|
1498 (1400, 1596)
|
1443 (1342, 1547)
|
Marche
|
190 (163, 221)
|
187 (159, 216)
|
182 (154, 210)
|
175 (148, 204)
|
Molise
|
9 (3, 15)
|
9 (3, 15)
|
8 (3, 15)
|
8 (3, 14)
|
Piemonte
|
396 (354, 437)
|
388 (348, 432)
|
378 (337, 421)
|
364 (322, 408)
|
Puglia
|
57 (42, 72)
|
56 (41, 72)
|
54 (40, 70)
|
52 (38, 68)
|
Sardegna
|
15 (8, 24)
|
15 (8, 23)
|
15 (8, 23)
|
14 (7, 23)
|
Sicilia
|
57 (43, 73)
|
56 (42, 72)
|
55 (40, 71)
|
53 (39, 68)
|
Toscana
|
251 (219, 284)
|
246 (213, 280)
|
239 (208, 273)
|
231 (199, 264)
|
P.A. Trento
|
44 (31, 58)
|
43 (31, 57)
|
42 (30, 56)
|
40 (29, 54)
|
Umbria
|
39 (27, 52)
|
38 (27, 51)
|
37 (25, 50)
|
36 (25, 49)
|
Valle d’Aosta
|
17 (10, 26)
|
17 (9, 26)
|
17 (9, 26)
|
16 (9, 25)
|
Veneto
|
332 (297, 372)
|
327 (289, 365)
|
317 (280, 356)
|
306 (267, 346)
|