Fitting a spatiotemporal model with a GAM and deriving an abundance index
October 24 2023
B_gam_sdmTMB_example.Rmd
Adapted from Lewis Barnett’s sdmTMB tutorial for the Fisheries Survey Course at UW: GitHub pbs-assess/sdmTMB-teaching 2023-02-14
Goals of the original exercise:
- Practice fitting a basic spatiotemporal model.
- Understand how to inspect the model output.
- Practice predicting from the model on new data and making visualizations of those predictions.
- Gain familiarity with fitting, comparing and interpreting different random field structures.
- Calculate an area-weighted biomass index and compare how model structure can impact an index.
Our goals for this vignette:
- Fit a GAM and obtain an area-weighted biomass index
- Fit a basic spatiotemporal model (a GLMM!) using sdmTMB and compare GAM to GLMM
PKG <- c(
"sdmTMB", # install.packages("sdmTMB", dependencies = TRUE)
"mgcv",
"gratia",
"visreg",
"gstat",
"dplyr",
"ggplot2",
"INLA",
"prediction",
"inlabru",
"purrr")
for (p in PKG) {
if(!require(p,character.only = TRUE)) {
install.packages(p)
require(p,character.only = TRUE)}
}
options(ggplot2.continuous.colour = "viridis")
options(ggplot2.continuous.fill = "viridis")
theme_set(theme_light())
Check your system:
The data
We will work with CPUE data from yellowfin sole in the Eastern Bering Sea summer bottom trawl survey. This data is publicly available through FOSS.
The dataset contains sampling locations (longitude
and
latitude
) and year
. It also contains sampling
depth
in meters and sample CPUE density
in
units of tonnes/km2.
ggplot(data = dat,
mapping = aes(x = longitude_dd, y = latitude_dd, size = cpue_kgha, color = bottom_temperature_c)) +
geom_point(alpha = 0.3)
Fit a GAM to spatial data (analogous to spatial-only model in sdmTMB)
Add UTM columns, log depth, and year as factor
dat <- add_utm_columns(dat,
ll_crs = 4326,
ll_names = c("longitude_dd", "latitude_dd"))
dat$log_depth <- log(dat$depth_m)
dat$year_factor <- as.factor(dat$year)
dat[,c("X","Y")]
range(dat$X)
range(dat$Y)
ggplot(dat, aes(X, Y, size = cpue_kgkm2)) +
geom_point(shape = 21) +
coord_fixed()
Temporal effect plus spatial smoother, no covariates
First, fit a GAM without any covariates: just year as a factor and a
spatial smoother \(s(X,Y)\). The
as.factor(year)
part is a common component of SDMs that are
being used to generate indices.
start.time <- Sys.time()
fit_gam <- gam(
formula = cpue_kgha ~ as.factor(year) +
s(X, Y, k = 50),
family = tw(link = "log"),
data = dat
)
cat(
"The GAM took ",
difftime(Sys.time(), start.time, units = "mins"),
" mins to run"
)
Get diagnostics and perform model checking:
gam.check(fit_gam)
Include a 2-D smooth over space, and depth as a covariate.
Fit a GAM analogous to a spatiotemporal model in sdmTMB
NOTE: This takes a long time to fit. Include a 2-D smooth over space for each year. In this case, there a different 2-D smooth for each year. This will take much longer to fit than the other two.
start.time <- Sys.time()
fit_gam_st <- gam(
formula = count ~ as.factor(year) +
s(X, Y, by = as.factor(year)),
family = tw(link = "log"),
data = dat
)
cat(
"The GAM took ",
difftime(Sys.time(), start.time, units = "mins"),
"mins to run"
)
Get diagnostics and perform model checking
gam.check(fit_gam_st)
Review console output to help verify convergence, and whether there were an adequate number of basis functions (k).
Examine the four diagnostic plots. Each of these gives a different way of looking at your model residuals. On the top-left is a Q-Q plot, which compares the model residuals to the expected/assumed distribution family. A well-fit model’s residuals will be close to the 1-1 line, otherwise there may be under- or over-dispersion present. On bottom left is a histogram of residuals. We want this to have a shape similar to the distribution family we specified. On top-right is a plot of residual values as a function of the linear predictor. These should be evenly distributed around zero in a well-fitted model. Finally, on the bottom-right is plot of response against fitted values. A well-fitted model would show values near the 1-1 line.
Predict to full survey area with new data
Load the grid
Extrapolation grid for the EBS.
# This is another form of the same grid:
load(here::here("data/pred_grid_ebs.rda")) # object: pred_grid_ebs
#pred_grid_ebs <- read.csv(here::here("data/ebs_2022_epsg3338.csv"),header = TRUE)
get_crs(dat = pred_grid_ebs,ll_names =c("lon","lat"))
grid <- add_utm_columns(pred_grid_ebs,
#ll_crs = 32603,
ll_names = c("lon", "lat"))
range(grid$X)
# Grid from example
#wcvi_grid <- readRDS(here::here("data/wcvi-grid.rds"))
When you have spatiotemporal data, you need a grid for each year. There is a nice tidy little chunk of purrr code that will do that for you!
Predict CPUE across the grid
To get predicted
pred_gam <- predict(fit_gam, type = "response", newdata = grid) #This takes a long time.
pred_gam_df <- cbind(grid, pred_gam)
pred_gam_df$Shape_Area_ha <- pred_gam_df$Shape_Area*0.0001 # original Shape_area is in m^2
pred_gam_df$predicted_tot_grid <- pred_gam_df$Shape_Area_ha*pred_gam_df$pred_gam
Plot predictions over survey area. Note: Because the EBS grid is
irregularly spaced, you have to use geom_point()
here
instead of geom_tile()
or geom_raster()
.
pred_gam_df |>
filter(year==1999) |> # single year
ggplot(aes(X, Y, color=pred_gam)) +
geom_point(size = 0.1) +
scale_fill_viridis_c() +
#facet_wrap(~year) +
coord_fixed() +
labs(color = expression(CPUE (kg/km^2))) +
theme_light() +
theme(legend.position = "bottom")
Calculate biomass index from GAM via simulation
Generally, model-based indices of abundance use the area of each grid cell and the prediction for that area to calculate a total index across the survey area. The simplest way to do this is to sum the predicted biomasses across the full grid like so:
gam_idx_mt <- pred_gam_df |>
dplyr::group_by(year) |>
summarize(total_wt_mt = sum(predicted_tot_grid)/1000)
head(gam_idx_mt)
Get uncertainties for GAM-based index by sampling from posteriors
{gratia}
is an R package for evaluating and displaying GAM fits. This script uses
the fitted_samples()
function, which draws fitted values
from the posterior of a model using a Gaussian approximation.
sims <- gratia::fitted_samples(fit_gam, n=10, data=grid,
scale="response", seed=9)
sims$year <- grid$year[sims$row]
sims$area <- rep(pred_gam_df$Shape_Area_ha, times = 10) # matching the # of draws
sims$biomass <- sims$fitted * sims$area # expand from density to biomass, given area
level <- 0.95 # specify probability for confidence interval
# Get sum of simulated biomass (density*area) across grid cells, with CI
lwr_fn <- function(x) {as.numeric(quantile(x, probs = (1 - level) / 2))}
upr_fn <- function(x) {as.numeric(quantile(x, probs = 1 - (1 - level) / 2))}
sims_sum <- sims %>%
group_by(year,draw) %>%
summarise_at("biomass", list(biomass = sum)) %>%
group_by(year) %>%
summarise_at("biomass", list(est = median, # could use mean
lwr = lwr_fn,
upr = upr_fn))
Index standardization using sdmTMB
To calculate an index from any of these models, we need to run the
predict.sdmTMB()
function with the argument
return_tmb_object = TRUE
. We can then run the
get_index()
function to extract the total biomass
calculations and standard errors.
We can set the area argument to our cell_area
column in
km2. In this case the value is 4 km2 for all of
the cells, since our grid cells are 2 km x 2 km. If some grid cells were
not fully in the survey domain (or were on land), we could feed a vector
of grid areas to the area argument that matched the number of grid
cells. Because the density units are tonnes per km2 for this
data, the index is in tonnes.
p <- predict(fit, newdata = grid, return_tmb_object = TRUE)
index <- get_index(p, area = grid$cell_area, bias_correct = FALSE)
ggplot(index, aes(year, est)) +
geom_line() +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.4) +
xlab("Year") +
ylab("Biomass estimate (tonnes)")
We used bias_correction = FALSE
to speed things up, but
for any final result you will want to use the bias correction. Let’s see
how much the scale of the index changes with bias correction.
index_c <- get_index(p, area = grid$cell_area, bias_correct = TRUE)
index_c$Method <- "Bias correction"
bind_rows(index, index_c) %>%
ggplot(aes(year, est, fill = Method)) +
geom_line(aes(colour = Method)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.4) +
xlab("Year") +
ylab("Biomass estimate (tonnes)")
Calculate biomass index from GAM via simulation
sims <- gratia::fitted_samples(fit_gam, n=10, newdata=grid,
scale="response", seed=9)
sims$year <- grid$year[sims$row]
sims$biomass <- sims$fitted * 4 # expand from density to biomass, given area
level <- 0.95 # specify probability for confidence interval
# Get sum of simulated biomass (density*area) across grid cells, with CI
lwr_fn <- function(x) {as.numeric(quantile(x, probs = (1 - level) / 2))}
upr_fn <- function(x) {as.numeric(quantile(x, probs = 1 - (1 - level) / 2))}
sims_sum <- sims %>%
group_by(year,draw) %>%
summarise_at("biomass", list(biomass = sum)) %>%
group_by(year) %>%
summarise_at("biomass", list(est = median, # could use mean
lwr = lwr_fn,
upr = upr_fn))
Note that this approach uses a Gaussian approximation to the posterior, which is all that is implemented currently in the gratia package. However, a better estimate of uncertainty could be derived from sampling from the actual posterior distribution. However, this is beyond the scope of today’s lesson.
Plot the biomass index: