Fitting a spatiotemporal model with a GAM and deriving an abundance index

Adapted from Lewis Barnett’s sdmTMB tutorial for the Fisheries Survey Course at UW: GitHub pbs-assess/sdmTMB-teaching 2023-02-14

Goals of the original exercise:

Practice fitting a basic spatiotemporal model.
Understand how to inspect the model output.
Practice predicting from the model on new data and making visualizations of those predictions.
Gain familiarity with fitting, comparing and interpreting different random field structures.
Calculate an area-weighted biomass index and compare how model structure can impact an index.

Our goals for this vignette:

Fit a GAM and obtain an area-weighted biomass index
Fit a basic spatiotemporal model (a GLMM!) using sdmTMB and compare GAM to GLMM

PKG <- c(
  "sdmTMB", # install.packages("sdmTMB", dependencies = TRUE)
  "mgcv", 
  "gratia",
  "visreg", 
  "gstat",
  "dplyr", 
  "ggplot2", 
  "INLA",
  "prediction",
  "inlabru", 
  "purrr")

for (p in PKG) {
  if(!require(p,character.only = TRUE)) {  
    install.packages(p)
    require(p,character.only = TRUE)}
}

options(ggplot2.continuous.colour = "viridis")
options(ggplot2.continuous.fill = "viridis")
theme_set(theme_light())

Check your system:

sessionInfo()

The data

We will work with CPUE data from yellowfin sole in the Eastern Bering Sea summer bottom trawl survey. This data is publicly available through FOSS.

The dataset contains sampling locations (longitude and latitude) and year. It also contains sampling depth in meters and sample CPUE density in units of tonnes/km².

ggplot(data = dat, 
       mapping = aes(x = longitude_dd, y = latitude_dd, size = cpue_kgha, color = bottom_temperature_c)) + 
  geom_point(alpha = 0.3)

Fit a GAM to spatial data (analogous to spatial-only model in sdmTMB)

Add UTM columns, log depth, and year as factor

dat <- add_utm_columns(dat, 
                       ll_crs = 4326,
                       ll_names = c("longitude_dd", "latitude_dd"))

dat$log_depth <- log(dat$depth_m)
dat$year_factor <- as.factor(dat$year)

dat[,c("X","Y")]
range(dat$X)
range(dat$Y)

ggplot(dat, aes(X, Y, size = cpue_kgkm2)) +
  geom_point(shape = 21) +
  coord_fixed()

Temporal effect plus spatial smoother, no covariates

First, fit a GAM without any covariates: just year as a factor and a spatial smoother \(s(X,Y)\). The as.factor(year) part is a common component of SDMs that are being used to generate indices.

start.time <- Sys.time()
fit_gam <- gam(
  formula = cpue_kgha ~ as.factor(year) +
    s(X, Y, k = 50),
  family = tw(link = "log"),
  data = dat
)

cat(
  "The GAM took ",
  difftime(Sys.time(), start.time, units = "mins"),
  " mins to run"
)

Get diagnostics and perform model checking:

gam.check(fit_gam)

Include a 2-D smooth over space, and depth as a covariate.

start.time <- Sys.time()
fit_gam_s <- gam(
  formula = cpue_kgha ~ s(depth_m) + as.factor(year) + 
    s(X,Y), 
  family = tw(link = "log"),
  data = dat
)

cat("The GAM took ", 
    difftime(Sys.time(),start.time, units = "mins"), 
    "mins to run")

gam.check(fit_gam_s)

Fit a GAM analogous to a spatiotemporal model in sdmTMB

NOTE: This takes a long time to fit. Include a 2-D smooth over space for each year. In this case, there a different 2-D smooth for each year. This will take much longer to fit than the other two.

start.time <- Sys.time()

fit_gam_st <- gam(
  formula = count ~ as.factor(year) +
    s(X, Y, by = as.factor(year)),
  family = tw(link = "log"),
  data = dat
)

cat(
  "The GAM took ",
  difftime(Sys.time(), start.time, units = "mins"),
  "mins to run"
)

Get diagnostics and perform model checking

gam.check(fit_gam_st)

Review console output to help verify convergence, and whether there were an adequate number of basis functions (k).

Examine the four diagnostic plots. Each of these gives a different way of looking at your model residuals. On the top-left is a Q-Q plot, which compares the model residuals to the expected/assumed distribution family. A well-fit model’s residuals will be close to the 1-1 line, otherwise there may be under- or over-dispersion present. On bottom left is a histogram of residuals. We want this to have a shape similar to the distribution family we specified. On top-right is a plot of residual values as a function of the linear predictor. These should be evenly distributed around zero in a well-fitted model. Finally, on the bottom-right is plot of response against fitted values. A well-fitted model would show values near the 1-1 line.

Predict to full survey area with new data

Load the grid

Extrapolation grid for the EBS.

# This is another form of the same grid:
load(here::here("data/pred_grid_ebs.rda")) # object: pred_grid_ebs
#pred_grid_ebs <- read.csv(here::here("data/ebs_2022_epsg3338.csv"),header = TRUE)

get_crs(dat = pred_grid_ebs,ll_names =c("lon","lat"))

grid <- add_utm_columns(pred_grid_ebs, 
                       #ll_crs = 32603, 
                       ll_names = c("lon", "lat"))
range(grid$X)

# Grid from example
#wcvi_grid <- readRDS(here::here("data/wcvi-grid.rds"))

When you have spatiotemporal data, you need a grid for each year. There is a nice tidy little chunk of purrr code that will do that for you!

grid <- purrr::map_dfr(unique(dat$year), ~ tibble(grid, year = .x))

Predict CPUE across the grid

To get predicted

pred_gam <- predict(fit_gam, type = "response", newdata = grid) #This takes a long time.
pred_gam_df <- cbind(grid, pred_gam)
pred_gam_df$Shape_Area_ha <- pred_gam_df$Shape_Area*0.0001 # original Shape_area is in m^2
pred_gam_df$predicted_tot_grid <- pred_gam_df$Shape_Area_ha*pred_gam_df$pred_gam

Plot predictions over survey area. Note: Because the EBS grid is irregularly spaced, you have to use geom_point() here instead of geom_tile() or geom_raster().

pred_gam_df |>
  filter(year==1999) |> # single year 
ggplot(aes(X, Y, color=pred_gam)) + 
  geom_point(size = 0.1) + 
  scale_fill_viridis_c() + 
  #facet_wrap(~year) + 
  coord_fixed() +
  labs(color = expression(CPUE (kg/km^2))) +
  theme_light() +
  theme(legend.position = "bottom")

Calculate biomass index from GAM via simulation

Generally, model-based indices of abundance use the area of each grid cell and the prediction for that area to calculate a total index across the survey area. The simplest way to do this is to sum the predicted biomasses across the full grid like so:

gam_idx_mt <- pred_gam_df |> 
  dplyr::group_by(year) |> 
  summarize(total_wt_mt = sum(predicted_tot_grid)/1000)

head(gam_idx_mt)

Get uncertainties for GAM-based index by sampling from posteriors

{gratia} is an R package for evaluating and displaying GAM fits. This script uses the fitted_samples() function, which draws fitted values from the posterior of a model using a Gaussian approximation.

sims <- gratia::fitted_samples(fit_gam, n=10, data=grid, 
                               scale="response", seed=9)
sims$year <- grid$year[sims$row]
sims$area <- rep(pred_gam_df$Shape_Area_ha, times = 10) # matching the # of draws
sims$biomass <- sims$fitted * sims$area # expand from density to biomass, given area

level <- 0.95 # specify probability for confidence interval

# Get sum of simulated biomass (density*area) across grid cells, with CI
lwr_fn <- function(x) {as.numeric(quantile(x, probs = (1 - level) / 2))}
upr_fn <- function(x) {as.numeric(quantile(x, probs = 1 - (1 - level) / 2))}

sims_sum <-  sims %>% 
  group_by(year,draw) %>% 
  summarise_at("biomass", list(biomass = sum)) %>%
  group_by(year) %>%
  summarise_at("biomass", list(est = median, # could use mean
                           lwr = lwr_fn,
                           upr = upr_fn))

Index standardization using `sdmTMB`

To calculate an index from any of these models, we need to run the predict.sdmTMB() function with the argument return_tmb_object = TRUE. We can then run the get_index() function to extract the total biomass calculations and standard errors.

We can set the area argument to our cell_area column in km². In this case the value is 4 km² for all of the cells, since our grid cells are 2 km x 2 km. If some grid cells were not fully in the survey domain (or were on land), we could feed a vector of grid areas to the area argument that matched the number of grid cells. Because the density units are tonnes per km² for this data, the index is in tonnes.

p <- predict(fit, newdata = grid, return_tmb_object = TRUE)
index <- get_index(p, area = grid$cell_area, bias_correct = FALSE)

ggplot(index, aes(year, est)) +
  geom_line() +
  geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.4) +
  xlab("Year") +
  ylab("Biomass estimate (tonnes)")

We used bias_correction = FALSE to speed things up, but for any final result you will want to use the bias correction. Let’s see how much the scale of the index changes with bias correction.

index_c <- get_index(p, area = grid$cell_area, bias_correct = TRUE)
index_c$Method <- "Bias correction"

bind_rows(index, index_c) %>%
  ggplot(aes(year, est, fill = Method)) +
  geom_line(aes(colour = Method)) +
  geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.4) +
  xlab("Year") +
  ylab("Biomass estimate (tonnes)")

Calculate biomass index from GAM via simulation

sims <- gratia::fitted_samples(fit_gam, n=10, newdata=grid, 
                               scale="response", seed=9)
sims$year <- grid$year[sims$row]
sims$biomass <- sims$fitted * 4 # expand from density to biomass, given area

level <- 0.95 # specify probability for confidence interval

# Get sum of simulated biomass (density*area) across grid cells, with CI
lwr_fn <- function(x) {as.numeric(quantile(x, probs = (1 - level) / 2))}
upr_fn <- function(x) {as.numeric(quantile(x, probs = 1 - (1 - level) / 2))}

sims_sum <-  sims %>% 
  group_by(year,draw) %>% 
  summarise_at("biomass", list(biomass = sum)) %>%
  group_by(year) %>%
  summarise_at("biomass", list(est = median, # could use mean
                           lwr = lwr_fn,
                           upr = upr_fn))

Note that this approach uses a Gaussian approximation to the posterior, which is all that is implemented currently in the gratia package. However, a better estimate of uncertainty could be derived from sampling from the actual posterior distribution. However, this is beyond the scope of today’s lesson.

Plot the biomass index:

ggplot(sims_sum, aes(year, est)) + geom_line() +
  geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.4) +
  xlab('Year') + ylab('Biomass estimate (kg)')

October 24 2023

Goals of the original exercise:

Our goals for this vignette:

The data

Fit a GAM to spatial data (analogous to spatial-only model in sdmTMB)

Add UTM columns, log depth, and year as factor

Temporal effect plus spatial smoother, no covariates

Fit a GAM analogous to a spatiotemporal model in sdmTMB

Predict to full survey area with new data

Load the grid

Predict CPUE across the grid

Calculate biomass index from GAM via simulation

Get uncertainties for GAM-based index by sampling from posteriors

Index standardization using sdmTMB

Calculate biomass index from GAM via simulation

Index standardization using `sdmTMB`