Title: | Site Occupancy Modeling for Environmental DNA Metabarcoding |
---|---|
Description: | Fits multispecies site occupancy models to environmental DNA metabarcoding data collected using spatially-replicated survey design. Model fitting results can be used to evaluate and compare the effectiveness of species detection to find an efficient survey design. Reference: Fukaya et al. (2022) <doi:10.1111/2041-210X.13732>. |
Authors: | Keiichi Fukaya [aut, cre], Ken Kellner [cph] (summary method for occumbFit class), Mika Takahashi [aut] |
Maintainer: | Keiichi Fukaya <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.0.9008 |
Built: | 2024-11-05 08:23:54 UTC |
Source: | https://github.com/fukayak/occumb |
eval_util_L()
evaluates the expected utility of a local
species diversity assessment by using Monte Carlo integration.
eval_util_L( settings, fit = NULL, z = NULL, theta = NULL, phi = NULL, N_rep = 1, cores = 1L )
eval_util_L( settings, fit = NULL, z = NULL, theta = NULL, phi = NULL, N_rep = 1, cores = 1L )
settings |
A data frame that specifies a set of conditions under which
utility is evaluated. It must include columns named |
fit |
An |
z |
Sample values of site occupancy status of species stored in an array
with sample |
theta |
Sample values of sequence capture probabilities of species
stored in a matrix with sample |
phi |
Sample values of sequence relative dominance of species stored in
a matrix with sample |
N_rep |
Controls the sample size for the Monte Carlo integration.
The integral is evaluated using |
cores |
The number of cores to use for parallelization. |
The utility of local species diversity assessment for a given set of sites
can be defined as the expected number of detected species per site
(Fukaya et al. 2022). eval_util_L()
evaluates this utility for arbitrary
sets of sites that can potentially have different values for site occupancy
status of species, , sequence capture probabilities of species,
, and sequence relative dominance of species,
, for the combination of
K
and N
values specified in the
conditions
argument.
Such evaluations can be used to balance K
and N
to maximize the utility
under a constant budget (possible combinations of K
and N
under a
specified budget and cost values are easily obtained using list_cond_L()
;
see the example below).
It is also possible to examine how the utility varies with different K
and N
values without setting a budget level, which may be useful for determining
a satisfactory level of K
and N
from a purely technical point of view.
The expected utility is defined as the expected value of the conditional utility in the form:
where is a latent indicator variable representing
the inclusion of the sequence of species
in replicate
at site
, and
is a latent variable that
is proportional to the relative frequency of the sequence of species
, conditional on its presence in replicate
at site
(Fukaya et al. 2022).
Expectations are taken with respect to the posterior (or possibly prior)
predictive distributions of
and
, which are evaluated numerically using
Monte Carlo integration. The predictive distributions of
and
depend on the model
parameters
,
, and
values.
Their posterior (or prior) distribution is specified by supplying an
occumbFit
object containing their posterior samples via the fit
argument,
or by supplying a matrix or array of posterior (or prior) samples of
parameter values via the z
, theta
, and phi
arguments. Higher
approximation accuracy can be obtained by increasing the value of N_rep
.
The eval_util_L()
function can be executed by supplying the fit
argument
without specifying the z
, theta
, and phi
arguments, by supplying the
three z
, theta
, and phi
arguments without the fit
argument, or by
supplying the fit
argument and any or all of the z
, theta
, and phi
arguments. If z
, theta
, or phi
arguments are specified in addition
to the fit
, the parameter values given in these arguments are used
preferentially to evaluate the expected utility. If the sample sizes differ among
parameters, parameters with smaller sample sizes are resampled with
replacements to align the sample sizes across parameters.
The expected utility is evaluated assuming homogeneity of replicates, in the
sense that and
, the model parameters
associated with the species detection process, are constant across
replicates within a site. For this reason,
eval_util_L()
does not accept
replicate-specific and
. If the
occumbFit
object supplied in the fit
argument has a replicate-specific
parameter, the parameter samples to be used in the utility evaluation must be
provided explicitly via the theta
or phi
arguments.
The Monte Carlo integration is executed in parallel on multiple CPU cores, where
the cores
argument controls the degree of parallelization.
A data frame with a column named Utility
in which the estimates of the
expected utility are stored. This is obtained by adding the Utility
column
to the data frame provided in the settings
argument.
K. Fukaya, N. I. Kondo, S. S. Matsuzaki and T. Kadoya (2022) Multispecies site occupancy modelling and study design for spatially replicated environmental DNA metabarcoding. Methods in Ecology and Evolution 13:183–193. doi:10.1111/2041-210X.13732
set.seed(1) # Generate a random dataset (20 species * 2 sites * 2 reps) I <- 20 # Number of species J <- 2 # Number of sites K <- 2 # Number of replicates data <- occumbData( y = array(sample.int(I * J * K), dim = c(I, J, K))) # Fitting a null model fit <- occumb(data = data) ## Estimate expected utility # Arbitrary K and N values (util1 <- eval_util_L(expand.grid(K = 1:3, N = c(1E3, 1E4, 1E5)), fit)) # K and N values under specified budget and cost (util2 <- eval_util_L(list_cond_L(budget = 1E5, lambda1 = 0.01, lambda2 = 5000, fit), fit)) # K values restricted (util3 <- eval_util_L(list_cond_L(budget = 1E5, lambda1 = 0.01, lambda2 = 5000, fit, K = 1:5), fit)) # theta and phi values supplied (util4 <- eval_util_L(list_cond_L(budget = 1E5, lambda1 = 0.01, lambda2 = 5000, fit, K = 1:5), fit, theta = array(0.5, dim = c(4000, I, J)), phi = array(1, dim = c(4000, I, J)))) # z, theta, and phi values, but no fit object supplied (util5 <- eval_util_L(list_cond_L(budget = 1E5, lambda1 = 0.01, lambda2 = 5000, fit, K = 1:5), fit = NULL, z = array(1, dim = c(4000, I, J)), theta = array(0.5, dim = c(4000, I, J)), phi = array(1, dim = c(4000, I, J))))
set.seed(1) # Generate a random dataset (20 species * 2 sites * 2 reps) I <- 20 # Number of species J <- 2 # Number of sites K <- 2 # Number of replicates data <- occumbData( y = array(sample.int(I * J * K), dim = c(I, J, K))) # Fitting a null model fit <- occumb(data = data) ## Estimate expected utility # Arbitrary K and N values (util1 <- eval_util_L(expand.grid(K = 1:3, N = c(1E3, 1E4, 1E5)), fit)) # K and N values under specified budget and cost (util2 <- eval_util_L(list_cond_L(budget = 1E5, lambda1 = 0.01, lambda2 = 5000, fit), fit)) # K values restricted (util3 <- eval_util_L(list_cond_L(budget = 1E5, lambda1 = 0.01, lambda2 = 5000, fit, K = 1:5), fit)) # theta and phi values supplied (util4 <- eval_util_L(list_cond_L(budget = 1E5, lambda1 = 0.01, lambda2 = 5000, fit, K = 1:5), fit, theta = array(0.5, dim = c(4000, I, J)), phi = array(1, dim = c(4000, I, J)))) # z, theta, and phi values, but no fit object supplied (util5 <- eval_util_L(list_cond_L(budget = 1E5, lambda1 = 0.01, lambda2 = 5000, fit, K = 1:5), fit = NULL, z = array(1, dim = c(4000, I, J)), theta = array(0.5, dim = c(4000, I, J)), phi = array(1, dim = c(4000, I, J))))
eval_util_R()
evaluates the expected utility of a regional
species diversity assessment using Monte Carlo integration.
eval_util_R( settings, fit = NULL, psi = NULL, theta = NULL, phi = NULL, N_rep = 1, cores = 1L )
eval_util_R( settings, fit = NULL, psi = NULL, theta = NULL, phi = NULL, N_rep = 1, cores = 1L )
settings |
A data frame that specifies a set of conditions under which
utility is evaluated. It must include columns named |
fit |
An |
psi |
Sample values of the site occupancy probabilities of species
stored in a matrix with sample |
theta |
Sample values of sequence capture probabilities of species
stored in a matrix with sample |
phi |
Sample values of sequence relative dominance of species stored in
a matrix with sample |
N_rep |
Controls the sample size for the Monte Carlo integration.
The integral is evaluated using a total of |
cores |
The number of cores to use for parallelization. |
The utility of a regional species diversity assessment can be defined as
the number of species expected to be detected in the region of interest
(Fukaya et al. 2022). eval_util_R()
evaluates this utility for the region
modeled in the occumbFit
object for the combination of J
, K
, and N
values specified in the conditions
argument.
Such evaluations can be used to balance J
, K
, and N
to maximize the
utility under a constant budget (possible combinations of J
, K
, and N
under a specified budget and cost values are easily obtained using
list_cond_R()
; see the example below).
It is also possible to examine how the utility varies with different J
,
K
, and N
values without setting a budget level, which may be useful in determining
the satisfactory levels of J
, K
, and N
from a purely technical point of
view.
The expected utility is defined as the expected value of the conditional utility in the form:
where is a latent indicator variable representing
the inclusion of the sequence of species
in replicate
at site
, and
is a latent variable that
is proportional to the relative frequency of the sequence of species
, conditional on its presence in replicate
at site
(Fukaya et al. 2022).
Expectations are taken with respect to the posterior (or possibly prior)
predictive distributions of
and
, which are evaluated numerically using
Monte Carlo integration. The predictive distributions of
and
depend on the model
parameters
,
, and
values.
Their posterior (or prior) distribution is specified by supplying an
occumbFit
object containing their posterior samples via the fit
argument,
or by supplying a matrix or array of posterior (or prior) samples of
parameter values via the psi
, theta
, and phi
arguments. Higher
approximation accuracy can be obtained by increasing the value of N_rep
.
The eval_util_R()
function can be executed by supplying the fit
argument
without specifying the psi
, theta
, and phi
arguments, by supplying the
three psi
, theta
, and phi
arguments without the fit
argument, or by
supplying the fit
argument and any or all of the psi
, theta
, and phi
arguments. If the psi
, theta
, or phi
arguments are specified in addition
to the fit
, the parameter values given in these arguments are preferentially
used to evaluate the expected utility. If the sample sizes differed among
parameters, parameters with smaller sample sizes are resampled with
replacements to align the sample sizes across parameters.
The expected utility is evaluated assuming homogeneity of replicates, in the
sense that and
, the model parameters
associated with the species detection process, are constant across
replicates within a site. For this reason,
eval_util_R()
does not accept
replicate-specific and
. If the
occumbFit
object supplied in the fit
argument has a replicate-specific
parameter, the parameter samples to be used in the utility evaluation must be
provided explicitly via the theta
or phi
arguments.
If the parameters are modeled as a function of site covariates in the fit
object, or if the psi
, theta
, and/or phi
arguments have site dimensions,
the expected utility is evaluated to account for the site heterogeneity of
the parameters. To incorporate site heterogeneity, the
parameter values for each J
site are determined by selecting site-specific
parameter values in the fit
, or those supplied in psi
, theta
, and phi
via random sampling with replacement. Thus, expected utility is
evaluated by assuming a set of supplied parameter values as a statistical
population of site-specific parameters.
The Monte Carlo integration is executed in parallel on multiple CPU cores, where
the cores
argument controls the degree of parallelization.
A data frame with a column named Utility
in which the estimates of
the expected utility are stored. This is obtained by adding the Utility
column
to the data frame provided in the settings
argument.
K. Fukaya, N. I. Kondo, S. S. Matsuzaki and T. Kadoya (2022) Multispecies site occupancy modelling and study design for spatially replicated environmental DNA metabarcoding. Methods in Ecology and Evolution 13:183–193. doi:10.1111/2041-210X.13732
set.seed(1) # Generate a random dataset (20 species * 2 sites * 2 reps) I <- 20 # Number of species J <- 2 # Number of sites K <- 2 # Number of replicates data <- occumbData( y = array(sample.int(I * J * K), dim = c(I, J, K))) # Fitting a null model fit <- occumb(data = data) ## Estimate expected utility # Arbitrary J, K, and N values (util1 <- eval_util_R(expand.grid(J = 1:3, K = 1:3, N = c(1E3, 1E4, 1E5)), fit)) # J, K, and N values under specified budget and cost (util2 <- eval_util_R(list_cond_R(budget = 50000, lambda1 = 0.01, lambda2 = 5000, lambda3 = 5000), fit)) # K values restricted (util3 <- eval_util_R(list_cond_R(budget = 50000, lambda1 = 0.01, lambda2 = 5000, lambda3 = 5000, K = 1:5), fit)) # J and K values restricted (util4 <- eval_util_R(list_cond_R(budget = 50000, lambda1 = 0.01, lambda2 = 5000, lambda3 = 5000, J = 1:3, K = 1:5), fit)) # theta and phi values supplied (util5 <- eval_util_R(list_cond_R(budget = 50000, lambda1 = 0.01, lambda2 = 5000, lambda3 = 5000, J = 1:3, K = 1:5), fit, theta = array(0.5, dim = c(4000, I, J)), phi = array(1, dim = c(4000, I, J)))) # psi, theta, and phi values, but no fit object supplied (util6 <- eval_util_R(list_cond_R(budget = 50000, lambda1 = 0.01, lambda2 = 5000, lambda3 = 5000, J = 1:3, K = 1:5), fit = NULL, psi = array(0.9, dim = c(4000, I, J)), theta = array(0.9, dim = c(4000, I, J)), phi = array(1, dim = c(4000, I, J))))
set.seed(1) # Generate a random dataset (20 species * 2 sites * 2 reps) I <- 20 # Number of species J <- 2 # Number of sites K <- 2 # Number of replicates data <- occumbData( y = array(sample.int(I * J * K), dim = c(I, J, K))) # Fitting a null model fit <- occumb(data = data) ## Estimate expected utility # Arbitrary J, K, and N values (util1 <- eval_util_R(expand.grid(J = 1:3, K = 1:3, N = c(1E3, 1E4, 1E5)), fit)) # J, K, and N values under specified budget and cost (util2 <- eval_util_R(list_cond_R(budget = 50000, lambda1 = 0.01, lambda2 = 5000, lambda3 = 5000), fit)) # K values restricted (util3 <- eval_util_R(list_cond_R(budget = 50000, lambda1 = 0.01, lambda2 = 5000, lambda3 = 5000, K = 1:5), fit)) # J and K values restricted (util4 <- eval_util_R(list_cond_R(budget = 50000, lambda1 = 0.01, lambda2 = 5000, lambda3 = 5000, J = 1:3, K = 1:5), fit)) # theta and phi values supplied (util5 <- eval_util_R(list_cond_R(budget = 50000, lambda1 = 0.01, lambda2 = 5000, lambda3 = 5000, J = 1:3, K = 1:5), fit, theta = array(0.5, dim = c(4000, I, J)), phi = array(1, dim = c(4000, I, J)))) # psi, theta, and phi values, but no fit object supplied (util6 <- eval_util_R(list_cond_R(budget = 50000, lambda1 = 0.01, lambda2 = 5000, lambda3 = 5000, J = 1:3, K = 1:5), fit = NULL, psi = array(0.9, dim = c(4000, I, J)), theta = array(0.9, dim = c(4000, I, J)), phi = array(1, dim = c(4000, I, J))))
A dataset of fish eDNA metabarcoding collected in the Kasumigaura watershed, Japan.
fish
fish
An occumbData class object containing the sequence read count y
,
a species covariate mismatch
, and a site covariate riverbank
.
mismatch
represents the total number of mismatched bases in the
priming region of the forward and reverse primers for each species.
riverbank
indicates whether the riverbank at each site lacks aquatic
and riparian vegetation.
Sequence reads were obtained from three replicates (collected from the
center of the river and near the left and right riverbanks)
from 50 sites across the watershed, of which read counts from six samples were
missing.
The resulting sequence counts of 50 freshwater fish taxa were recorded.
K. Fukaya, N. I. Kondo, S. S. Matsuzaki, T. Kadoya (2021) Data from: Multispecies site occupancy modeling and study design for spatially replicated environmental DNA metabarcoding. Dryad Digital Repository. doi:10.5061/dryad.3bk3j9kkm
A dataset of fish eDNA metabarcoding collected in the Kasumigaura watershed, Japan.
fish_raw
fish_raw
A list containing an array of sequence read count, y
,
a vector of the total number of mismatched bases in the priming region of
the forward and reverse primers for each species, mismatch
,
and a factor indicating whether the riverbank at each site lacked aquatic
and riparian vegetation, riverbank
.
Sequence reads were obtained from three replicates (collected from the
center of the river and near the left and right riverbanks)
from 50 sites across the watershed, of which read counts from six samples were
missing.
The resulting sequence counts of 50 freshwater fish taxa detected were
recorded.
K. Fukaya, N. I. Kondo, S. S. Matsuzaki, T. Kadoya (2021) Data from: Multispecies site occupancy modeling and study design for spatially replicated environmental DNA metabarcoding. Dryad Digital Repository. doi:10.5061/dryad.3bk3j9kkm
get_post_samples()
extracts posterior samples of the specified
parameters from a model-fit object.
get_post_summary()
extracts posterior summary of the specified
parameters from a model-fit object.
get_post_samples( fit, parameter = c("z", "pi", "phi", "theta", "psi", "alpha", "beta", "gamma", "alpha_shared", "beta_shared", "gamma_shared", "Mu", "sigma", "rho") ) get_post_summary( fit, parameter = c("z", "pi", "phi", "theta", "psi", "alpha", "beta", "gamma", "alpha_shared", "beta_shared", "gamma_shared", "Mu", "sigma", "rho") )
get_post_samples( fit, parameter = c("z", "pi", "phi", "theta", "psi", "alpha", "beta", "gamma", "alpha_shared", "beta_shared", "gamma_shared", "Mu", "sigma", "rho") ) get_post_summary( fit, parameter = c("z", "pi", "phi", "theta", "psi", "alpha", "beta", "gamma", "alpha_shared", "beta_shared", "gamma_shared", "Mu", "sigma", "rho") )
fit |
An |
parameter |
A string of parameter name. See Details for possible choices and corresponding parameters. |
The functions return posterior samples or a summary of one of the
following parameters in the model, stored in the model-fit object
fit
:
z
Site occupancy status of species.
pi
Multinomial probabilities of species sequence read counts.
phi
Sequence relative dominance of species.
theta
Sequence capture probabilities of species.
psi
Site occupancy probabilities of species.
alpha
Species-specific effects on sequence relative dominance
(phi
).
beta
Species-specific effects on sequence capture
probabilities (theta
).
gamma
Species-specific effects on site occupancy
probabilities (psi
).
alpha_shared
Effects on sequence relative dominance
(phi
) common across species.
beta_shared
Effects on sequence capture probabilities
(theta
) that are common across species.
gamma_shared
Effects on site occupancy probabilities
(psi
) that are common across species.
Mu
Community-level averages of species-specific effects
(alpha
, beta
, gamma
).
sigma
Standard deviations of species-specific effects
(alpha
, beta
, gamma
).
rho
Correlation coefficients of the species-specific effects
(alpha
, beta
, gamma
).
See the package vignette for details of these parameters.
The parameter may have dimensions corresponding to species, sites,
replicates, and effects (covariates) and the dimension
and label
attributes are added to the output object to inform these dimensions.
If the sequence read count data y
have species, site, or replicate
names appended as the dimnames
attribute (see Details in
occumbData()
), they are copied into the label
attribute of the returned object.
get_post_samples()
returns a vector, matrix, or array of posterior
samples for a selected parameter.
get_post_summary()
returns a table (matrix) of the posterior summary
of the selected parameters. The elements of the posterior summary are the
same as those obtained with the jags()
function in the
jagsUI
package: they include the mean, standard deviation, percentiles
of posterior samples; the Rhat
statistic; the effective sample size,
n.eff
; overlap0
, which checks if 0 falls in the parameter's 95%
credible interval; and the proportion of the posterior with the same sign
as the mean, f
.
The dimension
and label
attributes of the output object
provide information regarding the dimensions of the parameter.
# Generate the smallest random dataset (2 species * 2 sites * 2 reps) I <- 2 # Number of species J <- 2 # Number of sites K <- 2 # Number of replicates y_named <- array(sample.int(I * J * K), dim = c(I, J, K)) dimnames(y_named) <- list(c("species 1", "species 2"), c("site 1", "site 2"), NULL) data_named <- occumbData(y = y_named) # Fitting a null model fit <- occumb(data = data_named, n.iter = 10100) # Extract posterior samples (post_sample_z <- get_post_samples(fit, "z")) # Look dimensions of the parameter attributes(post_sample_z) # Extract posterior summary (post_summary_z <- get_post_summary(fit, "z")) # Look dimensions of the parameter attributes(post_summary_z)
# Generate the smallest random dataset (2 species * 2 sites * 2 reps) I <- 2 # Number of species J <- 2 # Number of sites K <- 2 # Number of replicates y_named <- array(sample.int(I * J * K), dim = c(I, J, K)) dimnames(y_named) <- list(c("species 1", "species 2"), c("site 1", "site 2"), NULL) data_named <- occumbData(y = y_named) # Fitting a null model fit <- occumb(data = data_named, n.iter = 10100) # Extract posterior samples (post_sample_z <- get_post_samples(fit, "z")) # Look dimensions of the parameter attributes(post_sample_z) # Extract posterior summary (post_summary_z <- get_post_summary(fit, "z")) # Look dimensions of the parameter attributes(post_summary_z)
gof()
calculates omnibus discrepancy measures and
their Bayesian -values for the fitted model using the posterior
predictive check approach.
gof( fit, stats = c("Freeman_Tukey", "deviance", "chi_squared"), cores = 1L, plot = TRUE, ... )
gof( fit, stats = c("Freeman_Tukey", "deviance", "chi_squared"), cores = 1L, plot = TRUE, ... )
fit |
An |
stats |
The discrepancy statistics to be applied. |
cores |
The number of cores to use for parallelization. |
plot |
Logical, determine if draw scatter plots of the fit statistics. |
... |
Additional arguments passed to the default |
A discrepancy statistic for the fitted model is obtained using a posterior predictive checking procedure. The following statistics are currently available:
where ,
, and
are the subscripts of species, site, and
replicate, respectively,
is sequence read count data,
is multinomial cell probabilities of sequence read
counts,
is expected
value of the sequence read counts conditional on their cell probabilities,
and
is the multinomial log-likelihood of the sequence read counts in replicate
of site
conditional on their cell probabilities.
The Bayesian -value is estimated as the probability that the value of
the discrepancy statistics of the replicated dataset is more extreme than that
of the observed dataset.
An extreme Bayesian
-value may indicate inadequate model fit.
See Gelman et al. (2014), Kéry and Royle (2016), and
Conn et al. (2018) for further details on the procedures used for posterior
predictive checking.
Computations can be run in parallel on multiple CPU cores where the cores
argument controls the degree of parallelization.
A list with the following named elements:
stats
The discrepancy statistics applied.
p_value
Bayesian -value.
stats_obs
Discrepancy statistics for observed data.
stats_rep
Discrepancy statistics for repeated data.
P. B. Conn, D. S. Johnson, P. J. Williams, S. R. Melin and M. B. Hooten. (2018) A guide to Bayesian model checking for ecologists. Ecological Monographs 88:526–542. doi:10.1002/ecm.1314
A. Gelman, J. B. Carlin, H. S. Stern D. B. Dunson, A. Vehtari and D. B. Rubin (2013) Bayesian Data Analysis. 3rd edition. Chapman and Hall/CRC. http://www.stat.columbia.edu/~gelman/book/
M. Kéry and J. A. Royle (2016) Applied Hierarchical Modeling in Ecology — Analysis of Distribution, Abundance and Species Richness in R and BUGS. Volume 1: Prelude and Static Models. Academic Press. https://www.mbr-pwrc.usgs.gov/pubanalysis/keryroylebook/
# Generate the smallest random dataset (2 species * 2 sites * 2 reps) I <- 2 # Number of species J <- 2 # Number of sites K <- 2 # Number of replicates data <- occumbData( y = array(sample.int(I * J * K), dim = c(I, J, K)), spec_cov = list(cov1 = rnorm(I)), site_cov = list(cov2 = rnorm(J), cov3 = factor(1:J)), repl_cov = list(cov4 = matrix(rnorm(J * K), J, K))) # Fitting a null model fit <- occumb(data = data) # Goodness-of-fit assessment gof_result <- gof(fit) gof_result
# Generate the smallest random dataset (2 species * 2 sites * 2 reps) I <- 2 # Number of species J <- 2 # Number of sites K <- 2 # Number of replicates data <- occumbData( y = array(sample.int(I * J * K), dim = c(I, J, K)), spec_cov = list(cov1 = rnorm(I)), site_cov = list(cov2 = rnorm(J), cov3 = factor(1:J)), repl_cov = list(cov4 = matrix(rnorm(J * K), J, K))) # Fitting a null model fit <- occumb(data = data) # Goodness-of-fit assessment gof_result <- gof(fit) gof_result
list_cond_L()
constructs a list of possible local species
diversity assessment conditions under the specified budget and cost values.
list_cond_L(budget, lambda1, lambda2, fit, K = NULL)
list_cond_L(budget, lambda1, lambda2, fit, K = NULL)
budget |
A numeric specifying budget amount. The currency unit
is arbitrary but must be consistent with that of |
lambda1 |
A numeric specifying the cost per sequence read for
high-throughput sequencing. The currency unit is arbitrary but must be
consistent with that of |
lambda2 |
A numeric specifying the cost per replicate for library
preparation. The currency unit is arbitrary but must be consistent with that
of |
fit |
An |
K |
An optional vector for manually specifying the number of replicates. |
This function can generate a data frame object to be given to the
settings
argument of eval_util_L()
; see Examples of eval_util_L()
.
By default, it outputs a list of all feasible combinations of values for the
number of replicates per site K
and the sequencing depth per replicate
N
based on the given budget, cost values, and number of sites
(identified by reference to the fit
object). The resulting N
can be a
non-integer because it is calculated simply by assuming that the maximum
value can be obtained. To obtain a list for only a subset of the
possible K
values under a given budget and cost value, the K
argument is used to provide a vector of the desired K
values.
A data frame containing columns named budget
, lambda1
, lambda2
,
K
, and N
.
list_cond_R()
constructs a list of possible regional species
diversity assessment conditions under the specified budget and cost values.
list_cond_R(budget, lambda1, lambda2, lambda3, J = NULL, K = NULL)
list_cond_R(budget, lambda1, lambda2, lambda3, J = NULL, K = NULL)
budget |
A numeric specifying budget amount. The currency unit is
arbitrary but must be consistent with that of |
lambda1 |
A numeric specifying the cost per sequence read for
high-throughput sequencing. The currency unit is arbitrary but must be
consistent with that of |
lambda2 |
A numeric specifying the cost per replicate for library
preparation. The currency unit is arbitrary but must be consistent with that
of |
lambda3 |
A numeric specifying the visiting cost per site. The currency
unit is arbitrary but must be consistent with that of |
J |
An optional vector for manually specifying the number of sites |
K |
An optional vector used to specify the number of replicates manually.
For computational convenience, the |
This function can generate a data frame object to be given to the
settings
argument of eval_util_R()
; see Examples of eval_util_R()
.
By default, it outputs a list of all feasible combinations of values for the
number of sites J
, number of replicates per site K
, and sequencing
depth per replicate N
based on the given budget and cost values.
The resulting N
can be a non-integer because it is calculated simply by
assuming that the maximum value can be obtained.
If one wants to obtain a list for only a subset of the possible values of J
and K
under a given budget and cost value, use the J
and/or K
arguments (in fact, it is recommended that a relatively small number of K
values be specified using the K
argument because the list of all
conditions achievable under moderate budget and cost values can be large, and
it is rarely practical to have a vast number of replicates per site). If a
given combination of J
and K
values is not feasible under the specified
budget and cost values, the combination will be ignored and excluded from
the output.
A data frame containing columns named budget
, lambda1
, lambda2
,
lambda3
, J
, K
, and N
.
occumb()
fits the multispecies site-occupancy model for eDNA
metabarcoding (Fukaya et al. 2022) and returns a model-fit object containing
posterior samples.
occumb( formula_phi = ~1, formula_theta = ~1, formula_psi = ~1, formula_phi_shared = ~1, formula_theta_shared = ~1, formula_psi_shared = ~1, prior_prec = 1e-04, prior_ulim = 10000, data, n.chains = 4, n.adapt = NULL, n.burnin = 10000, n.thin = 10, n.iter = 20000, parallel = FALSE, ... )
occumb( formula_phi = ~1, formula_theta = ~1, formula_psi = ~1, formula_phi_shared = ~1, formula_theta_shared = ~1, formula_psi_shared = ~1, prior_prec = 1e-04, prior_ulim = 10000, data, n.chains = 4, n.adapt = NULL, n.burnin = 10000, n.thin = 10, n.iter = 20000, parallel = FALSE, ... )
formula_phi |
A right-hand side formula describing species-specific
effects of sequence relative dominance ( |
formula_theta |
A right-hand side formula describing species-specific
effects of sequence capture probability ( |
formula_psi |
A right-hand side formula describing species-specific
effects of occupancy probability ( |
formula_phi_shared |
A right-hand side formula describing effects of
sequence relative dominance ( |
formula_theta_shared |
A right-hand side formula describing effects of
sequence capture probability ( |
formula_psi_shared |
A right-hand side formula describing effects of
occupancy probability ( |
prior_prec |
Precision of normal prior distribution for the community-level average of species-specific parameters and effects common across species. |
prior_ulim |
Upper limit of uniform prior distribution for the standard deviation of species-specific parameters. |
data |
A dataset supplied as an |
n.chains |
Number of Markov chains to run. |
n.adapt |
Number of iterations to run in the JAGS adaptive phase. |
n.burnin |
Number of iterations at the beginning of the chain to discard. |
n.thin |
Thinning rate. Must be a positive integer. |
n.iter |
Total number of iterations per chain (including burn-in). |
parallel |
If TRUE, run MCMC chains in parallel on multiple CPU cores. |
... |
Additional arguments passed to |
occumb()
allows the fitting of a range of multispecies site
occupancy models, including covariates at different levels of the data
generation process.
The most general form of the model can be written as follows (the notation
follows that of the original article; see References).
Sequence read counts:
Relative frequency of species sequences:
Capture of species sequences:
Site occupancy of species:
where the variations of ,
, and
are modeled by specifying model formulas in
formula_phi
, formula_theta
, formula_psi
,
formula_phi_shared
, formula_theta_shared
, and
formula_psi_shared.
Each parameter may have species-specific effects and effects that are common
across species, where the former is specified by formula_phi
,
formula_theta
, and formula_psi
, whereas
formula_phi_shared
, formula_theta_shared
, and
formula_psi_shared
specify the latter.
As species-specific intercepts are specified by default, the intercept
terms in formula_phi_shared
, formula_theta_shared
, and
formula_psi_shared
are always ignored.
Covariate terms must be found in the names of the list elements stored
in the spec_cov
, site_cov
, or repl_cov
slots in the dataset
object provided with the data
argument.
Covariates are modeled using the log link function for
and logit link function for
and
The two arguments, prior_prec
and prior_ulim
, control the
prior distribution of parameters. For the community-level average of
species-specific effects and effects common across species, a normal
prior distribution with a mean of 0 and precision (i.e., the inverse of the
variance) prior_prec
is specified. For the standard deviation of
species-specific effects, a uniform prior distribution with a lower limit of
zero and an upper limit of prior_ulim
is specified. For the correlation
coefficient of species-specific effects, a uniform prior distribution in the
range of 1 to 1 is specified by default.
See the package vignette
for details on the model specifications in occumb()
.
The data
argument requires a dataset object to be generated using
ocumbData()
; see the document of occumbData()
.
The model is fit using the jags()
function of the
jagsUI package, where
Markov chain Monte Carlo methods are used to
obtain posterior samples of the parameters and latent variables.
Arguments n.chains
, n.adapt
, n.burnin
, n.thin
,
n.iter
, and parallel
are passed on to arguments of the
same name in the jags()
function.
See the document of jagsUI's
jags()
function for details.
An S4 object of the occumbFit
class containing the results of
the model fitting and the supplied dataset.
K. Fukaya, N. I. Kondo, S. S. Matsuzaki and T. Kadoya (2022) Multispecies site occupancy modelling and study design for spatially replicated environmental DNA metabarcoding. Methods in Ecology and Evolution 13:183–193. doi:10.1111/2041-210X.13732
# Generate the smallest random dataset (2 species * 2 sites * 2 reps) I <- 2 # Number of species J <- 2 # Number of sites K <- 2 # Number of replicates data <- occumbData( y = array(sample.int(I * J * K), dim = c(I, J, K)), spec_cov = list(cov1 = rnorm(I)), site_cov = list(cov2 = rnorm(J), cov3 = factor(1:J)), repl_cov = list(cov4 = matrix(rnorm(J * K), J, K))) # Fitting a null model (includes only species-specific intercepts) res0 <- occumb(data = data) # Add species-specific effects of site covariates in occupancy probabilities res1 <- occumb(formula_psi = ~ cov2, data = data) # Continuous covariate res2 <- occumb(formula_psi = ~ cov3, data = data) # Categorical covariate res3 <- occumb(formula_psi = ~ cov2 * cov3, data = data) # Interaction # Add species covariate in the three parameters # Note that species covariates are modeled as common effects res4 <- occumb(formula_phi_shared = ~ cov1, data = data) # phi res5 <- occumb(formula_theta_shared = ~ cov1, data = data) # theta res6 <- occumb(formula_psi_shared = ~ cov1, data = data) # psi # Add replicate covariates # Note that replicate covariates can only be specified for theta and phi res7 <- occumb(formula_phi = ~ cov4, data = data) # phi res8 <- occumb(formula_theta = ~ cov4, data = data) # theta # Specify the prior distribution and MCMC settings explicitly res9 <- occumb(data = data, prior_prec = 1E-2, prior_ulim = 1E2, n.chains = 1, n.burnin = 1000, n.thin = 1, n.iter = 2000) res10 <- occumb(data = data, parallel = TRUE, n.cores = 2) # Run MCMC in parallel
# Generate the smallest random dataset (2 species * 2 sites * 2 reps) I <- 2 # Number of species J <- 2 # Number of sites K <- 2 # Number of replicates data <- occumbData( y = array(sample.int(I * J * K), dim = c(I, J, K)), spec_cov = list(cov1 = rnorm(I)), site_cov = list(cov2 = rnorm(J), cov3 = factor(1:J)), repl_cov = list(cov4 = matrix(rnorm(J * K), J, K))) # Fitting a null model (includes only species-specific intercepts) res0 <- occumb(data = data) # Add species-specific effects of site covariates in occupancy probabilities res1 <- occumb(formula_psi = ~ cov2, data = data) # Continuous covariate res2 <- occumb(formula_psi = ~ cov3, data = data) # Categorical covariate res3 <- occumb(formula_psi = ~ cov2 * cov3, data = data) # Interaction # Add species covariate in the three parameters # Note that species covariates are modeled as common effects res4 <- occumb(formula_phi_shared = ~ cov1, data = data) # phi res5 <- occumb(formula_theta_shared = ~ cov1, data = data) # theta res6 <- occumb(formula_psi_shared = ~ cov1, data = data) # psi # Add replicate covariates # Note that replicate covariates can only be specified for theta and phi res7 <- occumb(formula_phi = ~ cov4, data = data) # phi res8 <- occumb(formula_theta = ~ cov4, data = data) # theta # Specify the prior distribution and MCMC settings explicitly res9 <- occumb(data = data, prior_prec = 1E-2, prior_ulim = 1E2, n.chains = 1, n.burnin = 1000, n.thin = 1, n.iter = 2000) res10 <- occumb(data = data, parallel = TRUE, n.cores = 2) # Run MCMC in parallel
occumbData()
creates a data list compatible with the model fitting
function occumb()
.
occumbData(y, spec_cov = NULL, site_cov = NULL, repl_cov = NULL)
occumbData(y, spec_cov = NULL, site_cov = NULL, repl_cov = NULL)
y |
A 3-D array of sequence read counts ( |
spec_cov |
A named list of species covariates.
Each covariate can be a vector of continuous ( |
site_cov |
A named list of site covariates.
Each covariate can be a vector of continuous ( |
repl_cov |
A named list of replicate covariates.
Each covariate can be a matrix of continuous ( |
The element (i.e., covariate) names for spec_cov
, site_cov
, and
repl_cov
must all be unique.
If y
has a dimnames
attribute, it is retained in the resulting
occumbData
object, and can be referenced in subsequent analyses.
An S4 object of the occumbData
class.
# Generate the smallest random dataset (2 species * 2 sites * 2 reps) I <- 2 # Number of species J <- 2 # Number of sites K <- 2 # Number of replicates data <- occumbData( y = array(sample.int(I * J * K), dim = c(I, J, K)), spec_cov = list(cov1 = rnorm(I)), site_cov = list(cov2 = rnorm(J), cov3 = factor(1:J)), repl_cov = list(cov4 = matrix(rnorm(J * K), J, K)) ) # A case for named y (with species and site names) y_named <- array(sample.int(I * J * K), dim = c(I, J, K)) dimnames(y_named) <- list(c("common species", "uncommon species"), c("good site", "bad site"), NULL) data_named <- occumbData( y = y_named, spec_cov = list(cov1 = rnorm(I)), site_cov = list(cov2 = rnorm(J), cov3 = factor(1:J)), repl_cov = list(cov4 = matrix(rnorm(J * K), J, K)) ) # A real data example data(fish_raw) fish <- occumbData( y = fish_raw$y, spec_cov = list(mismatch = fish_raw$mismatch), site_cov = list(riverbank = fish_raw$riverbank) ) # Get an overview of the datasets summary(data) summary(data_named) summary(fish)
# Generate the smallest random dataset (2 species * 2 sites * 2 reps) I <- 2 # Number of species J <- 2 # Number of sites K <- 2 # Number of replicates data <- occumbData( y = array(sample.int(I * J * K), dim = c(I, J, K)), spec_cov = list(cov1 = rnorm(I)), site_cov = list(cov2 = rnorm(J), cov3 = factor(1:J)), repl_cov = list(cov4 = matrix(rnorm(J * K), J, K)) ) # A case for named y (with species and site names) y_named <- array(sample.int(I * J * K), dim = c(I, J, K)) dimnames(y_named) <- list(c("common species", "uncommon species"), c("good site", "bad site"), NULL) data_named <- occumbData( y = y_named, spec_cov = list(cov1 = rnorm(I)), site_cov = list(cov2 = rnorm(J), cov3 = factor(1:J)), repl_cov = list(cov4 = matrix(rnorm(J * K), J, K)) ) # A real data example data(fish_raw) fish <- occumbData( y = fish_raw$y, spec_cov = list(mismatch = fish_raw$mismatch), site_cov = list(riverbank = fish_raw$riverbank) ) # Get an overview of the datasets summary(data) summary(data_named) summary(fish)
Applies jagsUI's
plot method to an occumbFit
object to draw trace plots
and density plots of MCMC samples of model parameters.
## S4 method for signature 'occumbFit' plot(x, y = NULL, ...)
## S4 method for signature 'occumbFit' plot(x, y = NULL, ...)
x |
An |
y |
|
... |
Additional arguments passed to the plot method for jagsUI object. |
Returns NULL
invisibly.
Draws a scatter plot of fit statistics.
## S4 method for signature 'occumbGof' plot(x, y = NULL, ...)
## S4 method for signature 'occumbGof' plot(x, y = NULL, ...)
x |
An |
y |
|
... |
Additional arguments passed to the default |
Returns NULL
invisibly.
Obtain predictions of parameters related to species occupancy
and detection from an occumbFit
model object.
## S4 method for signature 'occumbFit' predict( object, newdata = NULL, parameter = c("phi", "theta", "psi"), scale = c("response", "link"), type = c("quantiles", "mean", "samples") )
## S4 method for signature 'occumbFit' predict( object, newdata = NULL, parameter = c("phi", "theta", "psi"), scale = c("response", "link"), type = c("quantiles", "mean", "samples") )
object |
An |
newdata |
An optional |
parameter |
The parameter to be predicted. |
scale |
The scale on which the prediction is made.
|
type |
The type of prediction.
|
Applying predict()
to an occumbFit
object generates predictions
for the specified parameter (phi
, theta
, or psi
) based
on the estimated effects and the given covariates. It is important to
recognize that the predictions are specific to the individual species being
modeled since they depend on the estimated species-specific effects (i.e.,
alpha
, beta
, and gamma
; see
the package vignette for details).
When providing newdata
, it must thus be assumed that the set of
species contained in newdata
is the same as that of the data being
fitted.
Predictions are obtained as a matrix or array that can have dimensions
corresponding to statistics (or samples), species, sites, and replicates.
The dimension
and label
attributes are added to the output
object to inform these dimensions.
If the sequence read count data y
has species, site, or replicate
names appended as the dimnames
attribute (see Details in
occumbData()
), they will be copied into the label
attribute of the returned object.
Summarizes dataset stored in an occumbData
object.
## S4 method for signature 'occumbData' summary(object)
## S4 method for signature 'occumbData' summary(object)
object |
An |
Returns NULL
invisibly.
Summarizes model fitting result stored in an occumbFit
object.
## S4 method for signature 'occumbFit' summary(object)
## S4 method for signature 'occumbFit' summary(object)
object |
An |
Returns NULL
invisibly.