--- title: "Checking the Distribution and Dispersion of a Model" author: "Thierry Onkelinx" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_caption: yes vignette: > %\VignetteIndexEntry{Checking the Distribution and Dispersion of a Model} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} library(knitr) opts_chunk$set( fig.width = 6, collapse = TRUE, comment = "#>" ) set.seed(20181011) no_stored_data <- interactive() || Sys.getenv("WERCKER") == "true" || !file.exists("distribution.Rda") if (!no_stored_data) { load("distribution.Rda") } ``` ## Generate data We created `generate_data()` to manufacture a basic data set with variables from different distributions: Poisson, negative binomial, binomial, zero inflated Poisson and zero inflated negative binomial. All variables share the common latent variable `eta` which is defined as $\eta_{ij} = a + b_i$ where $b_i \sim \mathcal{N}(0, \sigma_r ^ 2)$. ```{r load-packages, message = FALSE} library(tibble) library(dplyr) library(ggplot2) library(INLA) library(inlatools) ``` ```{r generate-data, eval = no_stored_data} intercept <- 2 sigma_random <- 0.75 nb_size <- 0.5 dataset <- generate_data( a = intercept, sigma_random = sigma_random, nb_size = nb_size ) ``` ```{r dataset} glimpse(dataset) summary(dataset) ``` ## Overdispersion ### Fitting the models Data sets in ecology often exhibit overdispersion. Let's take the `negbin` variable from the simulate data. This one has a negative binomial distribution. First we fit the models. Each model has the same covariate structure and the covariate structure is a perfect match with the data generating process. Note that we have to add `control.compute = list(config = TRUE)` to the `inla()` call for `distribution_check()` and `control.predictor = list(compute = TRUE)` for the `fast_distribution_check()`. More details on the distinction between those functions will be in the following section. We fit the model with three different distributions: 1) a negative binomial distribution, 2) a Poisson distribution, and 3) a Poisson distribution and a so called "observation level random effect" (`OLRE`). ```{r negbin-fit, eval = no_stored_data} negbin_negbin_model <- inla( negbin ~ f(group_id, model = "iid"), data = dataset, family = "nbinomial", control.predictor = list(compute = TRUE) ) negbin_poisson_model <- inla( negbin ~ f(group_id, model = "iid"), data = dataset, family = "poisson", control.compute = list(config = TRUE), control.predictor = list(compute = TRUE) ) bind_rows( negbin_negbin_model$summary.fixed %>% rownames_to_column("parameter") %>% mutate(distribution = "negative binomial"), negbin_poisson_model$summary.fixed %>% rownames_to_column("parameter") %>% mutate(distribution = "Poisson") ) %>% select( distribution, parameter, mean, lcl = `0.025quant`, ucl = `0.975quant` ) -> summary_fixed bind_rows( negbin_negbin_model$summary.hyperpar %>% rownames_to_column("parameter") %>% mutate(distribution = "negative binomial"), negbin_poisson_model$summary.hyperpar %>% rownames_to_column("parameter") %>% mutate(distribution = "Poisson") ) %>% select( distribution, parameter, mean, lcl = `0.025quant`, ucl = `0.975quant` ) -> summary_hyper ``` Once a model is fit, we can look at the summary. Both the negative binomial and the Poisson model do a good job when we look at the fixed effects estimates. Note that the true intercept is `r intercept`. Adding an observation level random effect to the Poisson model to compensate for overdispersion badly affects the fixed effects estimates. ```{r negbin-print-fixed, echo = FALSE} kable( summary_fixed, digits = 2, caption = "Estimates of the fixed effect for the different models" ) ``` We have set $\sigma_r = `r sigma_random`$. This is equivalent with a precision of `r signif(sigma_random ^ -2, 4)`. All models seem to recover this precision reasonably well. ```{r negbin-print-hyperpar, echo = FALSE} kable( summary_hyper, digits = 2, caption = "Hyperparameters of the different models" ) ``` ### Distribution checks The main difference between `distribution_check()` and `fast_distribution_check()` is that the latter is based on the fitted values while the former takes the marginal distribution of the fitted values into account. Hence, `distribution_check()` is more accurate but takes about 10 to 20 times slower. Often the output of `fast_distribution_check()` is sufficient to highlight the worst problems. ```{r negbin-cal-fdc1, eval = no_stored_data} system.time( negbin_poisson_dc <- distribution_check(negbin_poisson_model) ) system.time( negbin_poisson_fdc <- fast_distribution_check(negbin_poisson_model) ) ``` Both `distribution_check()` and `fast_distribution_check` return a data.frame with the empirical cumulative distribution of the raw data (`ecdf`) and the envelope of empirical cumulative distribution from data simulated from the model (`mean`, `lcl` and `ucl`). `n` is the number of observations with response `x`. ```{r glimpse-fdc} glimpse(negbin_poisson_fdc) ``` There is a `plot()` method for the distribution check data. It shows the ratio of the `ecdf` of the raw data divided by the `ecdf` of the simulated data based on the model. If the model makes sense, then the observed `ecdf` is within the envelope of the simulated `ecdf`. Hence the ratio observed / expected would be near to 1. This is the reference line depicted in the plot. ```{r negbin-poisson-dc, fig.cap = "Distribution check on a negative binomial response which is modelled using a Poisson distribution"} plot(negbin_poisson_dc) ``` The plot from the `distribution_check()` is very similar to that of the `fast_distribution_check()`. Notice that the ratio observed / expected is (much) larger than 1 for low values of `x`. The model with Poisson distribution doesn't capture the low response values very well. ```{r negbin-poisson-fdc, fig.cap = "Fast distribution check on a negative binomial response which is modelled using a Poisson distribution"} plot(negbin_poisson_fdc) ``` Let's see what happens if we apply `fast_distribution_check()` on the model with negative binomial distribution and Poisson distribution with `OLRE`. The distribution check of the model with negative binomial distribution yields a plot where the envelope of the ratio observed versus expected is always very close to the reference (1). ```{r negbin-calc-fdc2, eval = no_stored_data} negbin_negbin_fdc <- fast_distribution_check(negbin_negbin_model) ``` ```{r negbin-negbin-fdc, fig.cap = "Fast distribution check on a negative binomial response which is modelled using a negative binomial distribution"} plot(negbin_negbin_fdc) ``` ### Dispersion checks The classics dispersion measure is to sum the squared Pearson residuals and divide them by the number of observations minus the degrees of freedom. This value should ideally be close to one. Values above one indicate overdispersion, while values below one indicate underdispersion. The only agreement there is on number the degrees of freedom of mixed models, is that is hard to define them in a generic way. Therefore we take the simulation approach. Suppose we calculate the dispersion $D$ as the sum of the squared Pearson residuals $r_i$ divided by the number of observations $N$ and some degrees of freedom $x$. We calculate this value for the observed data ($D|data$). $$D = \frac{\sum r_i^2}{N - x}$$ Instead of comparing $D|data$ with some value, we simulate data from the model and for each of the simulated data sets we calculate the dispersion $(D|model)$. Next we compare the value $D|data$ with the distribution of $D|model$. If there is no over- or underdispersion, $P(D|data < D|model) \approx 0.5$. When $P(D|data > D|model) \approx 0$ indicates underdispersion, while $P(D|data > D|model) \approx 1$ overdispersion. Both $D|data$ and $D|model$ use the same values for $N$ and $x$. Redefining $D$ as the average squared Pearson residuals $D = \sum r_i^2/N$, has no effect on the value of $P(D|data > D|model)$. Hence can omit $x$ from the formula and no longer need to worry about its "true" value. `dispersion_check()` returns a list with two items: the dispersion value based on the data (a single value in `data`) and a vector of dispersion values for a number of simulated data sets (`model`). The `plot()` function as a method to handle this dispersion check object. It displays the density of the dispersion of the simulated data sets. The vertical dashed line shows the dispersion of the original data. The title contains $P(D|data > D|model)$. ```{r negbin-negbin-disp-c, eval = no_stored_data} negbin_negbin_disp <- dispersion_check(negbin_negbin_model) ``` ```{r negbin-negbin-disp, fig.cap = "Dispersion check on a negative binomial response which is modelled using a negative binomial distribution."} glimpse(negbin_negbin_disp) plot(negbin_negbin_disp) ``` The dispersion check of the model with the negative binomial distribution indicates no over- or underdispersion. The model with the Poisson distribution has clearly overdispersion. ```{r negbin-poisson-disp-c, eval = no_stored_data} negbin_poisson_disp <- dispersion_check(negbin_poisson_model) ``` ```{r negbin-poisson-disp, fig.cap = "Dispersion check on a negative binomial response which is modelled using a Poisson distribution."} plot(negbin_poisson_disp) ``` ## Zero inflation Let's see what happens if we do the same things for a zero inflated Poisson variable. We'll fit the variable using a Poisson, negative binomial, zero inflated Poisson and zero inflated negative binomial distribution. ```{r zip-fit, eval = no_stored_data} pcprior <- list(theta = list(prior = "pc.prec", param = c(1, 0.01))) zip_poisson_model <- inla( zipoisson ~ f(group_id, model = "iid", hyper = pcprior), data = dataset, family = "poisson", control.predictor = list(compute = TRUE) ) zip_negbin_model <- inla( zipoisson ~ f(group_id, model = "iid", hyper = pcprior), data = dataset, family = "nbinomial", control.predictor = list(compute = TRUE) ) zip_zipoisson_model <- inla( zipoisson ~ f(group_id, model = "iid", hyper = pcprior), data = dataset, family = "zeroinflatedpoisson1", control.predictor = list(compute = TRUE) ) zip_zinegbin_model <- inla( zipoisson ~ f(group_id, model = "iid", hyper = pcprior), data = dataset, family = "zeroinflatednbinomial1", control.predictor = list(compute = TRUE) ) bind_rows( zip_negbin_model$summary.fixed %>% rownames_to_column("parameter") %>% mutate( distribution = "negative binomial", zeroinflation = FALSE ), zip_poisson_model$summary.fixed %>% rownames_to_column("parameter") %>% mutate( distribution = "Poisson", zeroinflation = FALSE ), zip_zinegbin_model$summary.fixed %>% rownames_to_column("parameter") %>% mutate( distribution = "negative binomial", zeroinflation = TRUE ), zip_zipoisson_model$summary.fixed %>% rownames_to_column("parameter") %>% mutate( distribution = "Poisson", zeroinflation = TRUE ) ) %>% select( distribution, zeroinflation, parameter, mean, lcl = `0.025quant`, ucl = `0.975quant` ) -> summary_zip_fixed bind_rows( zip_negbin_model$summary.hyperpar %>% rownames_to_column("parameter") %>% mutate( distribution = "negative binomial", zeroinflation = FALSE ), zip_poisson_model$summary.hyperpar %>% rownames_to_column("parameter") %>% mutate( distribution = "Poisson", zeroinflation = FALSE ), zip_zinegbin_model$summary.hyperpar %>% rownames_to_column("parameter") %>% mutate( distribution = "negative binomial", zeroinflation = TRUE ), zip_zipoisson_model$summary.hyperpar %>% rownames_to_column("parameter") %>% mutate( distribution = "Poisson", zeroinflation = TRUE ) ) %>% select( distribution, zeroinflation, parameter, mean, lcl = `0.025quant`, ucl = `0.975quant` ) %>% arrange(parameter, distribution, zeroinflation) -> summary_zip_hyper ``` Using a distribution without zero inflation adds a downward bias to the intercept. ```{r zip-print-fixed, echo = FALSE} kable( summary_zip_fixed, digits = 2, caption = "Estimates of the fixed effect for the different models" ) ``` We have set $\sigma_r = `r sigma_random`$. This is equivalent with a precision of `r signif(sigma_random ^ -2, 4)`. All distributions lead to a large precision, so they all shrink the group effects. Note that the analyses are based on a very small dataset of only 200 observations and about half of them stem from the zero inflation part. The zero inflation probability is recovered very well. The size parameter of the negative binomial is very small when assuming no zero inflation. This indicates that it handles the zero inflation by increasing the variance. The zero inflated negative binomial has a very large size indicating that there is no overdispersion. Which is the case with a zero inflated Poisson response. ```{r zip-print-hyperpar, echo = FALSE} kable( summary_zip_hyper, digits = 2, caption = "Hyperparameters of the different models" ) ``` ### Dispersion checks ```{r zip-calc-disp, eval = no_stored_data} dispersion_check(zip_poisson_model) -> zip_poisson_disp dispersion_check(zip_negbin_model) -> zip_negbin_disp dispersion_check(zip_zipoisson_model) -> zip_zipoisson_disp dispersion_check(zip_zinegbin_model) -> zip_zinegbin_disp ``` Using the Poisson distribution we get strong indications for overdispersion. ```{r zip-poisson-disp, fig.cap = "Dispersion check for a zero inflated Poisson reponse modelled with a Poisson distribution"} plot(zip_poisson_disp) ``` The negative binomial distribution has a weak indication for underdispersion. More important, the range of dispersion values based on the simulation data is quite large. ```{r zip-negbin-disp, fig.cap = "Dispersion check for a zero inflated Poisson reponse modelled with a negative binomial distribution"} plot(zip_negbin_disp) ``` The dispersion checks on both the zero inflation Poisson and the zero inflated negative binomial models give a go-ahead. The observed dispersion is well within the range of the dispersion of the simulated data and that range is quite narrow. ```{r zip-zipoisson-disp, fig.cap = "Dispersion check for a zero inflated Poisson reponse modelled with a zero inflated Poisson distribution"} plot(zip_zipoisson_disp) ``` ```{r zip-zinegbin-disp, fig.cap = "Dispersion check for a zero inflated Poisson reponse modelled with a zero inflated negative binomial distribution"} plot(zip_zinegbin_disp) ``` ### Distribution checks Another feature of the distribution checks is that they can handle a list of models. The resulting data.frame gains a `model` variables which contains the name of the list (or their index in case of an unnamed list). Plotting the resulting check will create a subplot for each model (using `facet_wrap()`). By default all subplots will have the same scale. The strong ratio observed / expected for the Poisson model will swamp all information of the other models. Therefore we used `scales = "free"`, resulting in a dedicated x and y axis for each subplot. ```{r zip-calc-fdc, eval = no_stored_data} list( poisson = zip_poisson_model, negbin = zip_negbin_model, zipoisson = zip_zipoisson_model, zinegbin = zip_zinegbin_model ) %>% fast_distribution_check() -> zip_fdc ``` The plot for the Poisson model clearly indicates a problem with the (near) zero values, the data has much more of those than the model predicts. The negative binomial models is better at modelling the large number of zero's but it fails on the small, non-zero values which are overrepresented by the model. Both the zero inflated Poisson and the zero inflated negative binomial models pass the distribution check with flying colours. ```{r zip-fdc, fig.cap = "Fast distribution check on a zero inflated Poisson response which is modelled using several distributions"} glimpse(zip_fdc) plot(zip_fdc, scales = "free") ``` ## Underdispersion `generate_data()` has a binomial variable. We use this as an example for underdispersed data. The dispersion check clearly indicates underdispersion. The distribution check has a ratio smaller than 1 has low values. ```{r binom-fit, eval = no_stored_data} binom_poisson_model <- inla( binom ~ f(group_id, model = "iid"), data = dataset, family = "nbinomial", control.predictor = list(compute = TRUE) ) binom_poisson_disp <- dispersion_check(binom_poisson_model) binom_poisson_fdc <- fast_distribution_check(binom_poisson_model) ``` ```{r binom-poisson-disp, fig.cap = "Dispersion check on a binomial response which is modelled using a Poisson distribution."} plot(binom_poisson_disp) ``` ```{r binom-poisson-fdc, fig.cap = "Fast distribution check on a binomial response which is modelled using a negative binomial distribution"} plot(binom_poisson_fdc) ``` ```{r eval = no_stored_data, echo = FALSE} save( dataset, summary_fixed, summary_hyper, sigma_random, intercept, nb_size, negbin_negbin_fdc, negbin_negbin_disp, negbin_poisson_dc, negbin_poisson_fdc, negbin_poisson_disp, summary_zip_fixed, summary_zip_hyper, zip_fdc, zip_poisson_disp, zip_negbin_disp, zip_zipoisson_disp, zip_zinegbin_disp, binom_poisson_fdc, binom_poisson_disp, file = "distribution.Rda") ```