The purpose of this document is to introduce you how to use the MOSAIC application. This application is based on the R software1 and especially the rjags
library (version 4.10)2 andthe jagsUI
library (version 1.5.1)3, to estimate parameters of Toxico-Kinetic (TK) models under a Bayesian framework. MOSAIC is developed as an R-Shiny interface (version 1.6.0)4. If you want to be kept informed, please email us: sandrine.charles@univ-lyon1.fr.
=======
The purpose of this document is to introduce you how to use the MOSAIC application. This application is based on the R software1 and especially the rjags
library (version 4.10)2 andthe jagsUI
library (version 1.5.1)3, to estimate parameters of Toxico-Kinetic (TK) models under a Bayesian framework. MOSAIC is developed as an R-Shiny interface (version 1.5.0)4. If you want to be kept informed, please email us: sandrine.charles@univ-lyon1.fr.
>>>>>>> Gamma
The MOSAIC application is a turn-key web tool providing bioaccumulation factors (BCF/BMF/BSAF) from a TK model fitted to accumulation and depuration data. It is designed to fulfill the requirements of regulators when examining applications for market authorization of active substances.
Toxico-Kinetic/Toxico-Dynamic (TKTD) models are used to describe and predict the toxicity and the effects of chemical substances on individual traits based on experimental data. The TK part describes the relationship between the exposure medium and the organism, considering various processes such as ADME (accumulation, depuration, metabolization and excretion)5. Regulation No 283/2013 (EU)6 defines the data requirements for active substances of plant protection products in marketing authorization applications. In particular, a bioaccumulation study on fish is required following OECD guideline 3057. Achieved in agreement with EFSA’s scientific opinion on good modeling practices8,9, this ready-to-use on-line service allows to easily estimate BCF/BMF/BSAF as required in a regulatory framework, accounting for bioaccumulation of parent compounds and their metabolites through biotransformation. MOSAIC does not expect any input besides the accumulation-depuration data sets according to exposure concentrations. The service automatically fits the TK model, initially defined from the appropriately user data and optimizes the estimation of its parameters. Then, the service provides the corresponding bioaccumulation factors, as well as all goodness-of-fit criteria required to carefully check the reliability of the results10. All calculations are based on the JAGS software and its companion R package rjags
2,11.
If you use the web interface (https://mosaic.univ-lyon1.fr/bioacc), you don’t need to install anything.
However, if you want to run the R script (downloadable from the application) by yourself, you need to install:
rjags
package2. You can install it directly from the R software > Tools > Install Packages > rjags or from the CRAN website http://cran.r-project.org/web/packages/rjags/index.html.jagsUI
package3. You can install it directly from the R software > Tools > Install Packages > jagsUI or from the CRAN website https://cran.r-project.org/web/packages/jagsUI/index.html.tidyverse
, gridExtra
, ggmcmc
, GGally
, ggmcmc
, stringr
and DT
.Here is an example of the R code to install the required packages:
if(is.element('rjags', installed.packages()[,1]) == FALSE)
{install.packages('rjags')}
if(is.element('jagsUI', installed.packages()[,1]) == FALSE)
{install.packages('jagsUI')}
if(is.element('tidyverse', installed.packages()[,1]) == FALSE)
{install.packages('tidyverse')}
if(is.element('gridExtra', installed.packages()[,1]) == FALSE)
{install.packages('gridExtra')}
if(is.element('ggmcmc', installed.packages()[,1]) == FALSE)
{install.packages('ggmcmc')}
if(is.element('GGally', installed.packages()[,1]) == FALSE)
{install.packages('GGally')}
if(is.element('stringr', installed.packages()[,1]) == FALSE)
{install.packages('stringr')}
if(is.element('DT', installed.packages()[,1]) == FALSE)
{install.packages('DT')}
When using MOSAIC, the first step is to upload input data (Fig. 1.1):
You can upload your own data (click on “Browse”) by taking care about the format specification of your file. MOSAIC expects to receive data as a .txt file or a .csv file (comma, semicolon or tabular separator). Each line of the table corresponds to a time point for a given replicate and a given exposure concentration of the contaminant. The table must contain the four following columns, with exact header names (Table 1.1):
According to your data, further columns can be added in the file:
Then be careful to the units:
time | conc | expw | replicate |
---|---|---|---|
0 | 0.000 | 0.0044 | 2 |
3 | 0.225 | 0.0044 | 2 |
7 | 0.355 | 0.0044 | 2 |
14 | 0.553 | 0.0044 | 2 |
21 | 0.658 | 0.0044 | 2 |
28 | 0.785 | 0.0044 | 2 |
Once the upload is complete, you have to manually select the corresponding separator, the appropriate time unit and the duration of the accumulation phase.
Instead of using your own data, you can try MOSAIC from several example files that are provided. Three data sets use a simple TK model (one exposure route and one elimination process), whereas two data sets consider a more complex TK model, accounting for metabolization and growth (Fig. 1.2).
Please note that more data are, the more the TK model is complex, and the more the calculations take time. Thus, we indicated the approximative mean time calculation for each example data sets (Fig.1.2).
In case you upload a data set with several exposure concentrations, select the one for which you want to see the results. We propose two types of visualization to check if the file has correctly been uploaded: as a table or a plot (Fig. 1.3 and 1.4).
According to your data, you can visualize the plot for the parent compound, the metabolite(s) and/or for growth (Fig. 1.5).
Don’t forget to select the appropriate duration of the accumulation phase and the growth unit (if required) before to continue (Fig. 1.3 and 1.4, left side).
The most complete model and its corresponding parameters are automatically selected according to the experimental design as given within the uploaded data set (Fig. 2.1). For now, the test of other parameter selections is not available yet. The equations of the most complete model are provided to the users on-line (Fig. 2.2).
For the accumulation phase:
For the depuration phase:
Also, \(\sigma\) are the expected standard deviations of the measured contaminant concentration or growth of the organisms:
A mathematical model is composed of two parts: the deterministic and the stochastic parts. In the case of TK models fitted to concentration measurements, it can be written as follows (Eq. (1)): \[
y = f(x, \theta) + \varepsilon \quad (1)
\] with \(f\) the function describing the mean relationship between \(x\) and \(y\), \(y\) the observed variable, \(x\) the controlled variable, \(\theta\) the parameter vector to estimate and \(\varepsilon\) the random variable describing the variability of the data around the mean tendency.
The deterministic part
The organisms are here considered as single compartments for which a generic first-order kinetic bioaccumulation model can be expressed as follows12 (Eqs. (2) and (3) for the accumulation phase and Eqs. (4) and (5) for the elimination phase):
\(\left\{\begin{array}{l}\frac{d C_p(t)}{d t} = U - (E + M) C_p(t) \quad (2) \\ \frac{d C_{m_\ell}(t)}{d t} = k_{m_\ell} C_p(t) - k_{e_\ell} C_{m_\ell}(t) \quad \forall \ell=1 \ldots M \quad (3) \end{array}\right. \quad \text { for } 0 \leqslant t \leqslant t_{c}\)
\(\left\{\begin{array}{l}\frac{d C_p(t)}{d t} = - (E + M) C_p(t) \quad (4)\\ \frac{d C_{m_\ell}(t)}{d t} = k_{m_\ell} C_p(t) - k_{e_\ell} C_{m_\ell}(t) \quad \forall \ell=1\ldots M \quad (5)\end{array}\right.\quad\; \ \ \ \ \ \text { for } t>t_{c}\)
with:
Symbol | Meaning |
---|---|
\(I\) | total number of exposure sources |
\(J\) | total number of elimination processes |
\(L\) | total number of metabolites |
\(i\) | index of exposure sources, \(i = 1 \dots I\) |
\(j\) | index of elimination processes, \(j = 1 \dots J\) |
\(\ell\) | index of metabolites, \(\ell = 1 \dots L\) |
\(t\) | time (expressed in time units) |
\(c_i\) | exposure concentration of route \(i\) (in \(\mu g.m L^{-1}\)) |
\(C_p(t)\) | internal concentration of the parent compound at time \(t\) (in \(\mu g.g^{-1}\)) |
\(C_{m_\ell}(t)\) | internal concentration of metabolite \(\ell\) (in \(\mu g.g^{-1}\)) |
\(k_{u_i}\) | uptake rate of exposure source \(i\) (expressed per time units) |
\(k_{e_j}\) | elimination rates of elimination process \(j\) (expressed per time units) |
\(k_{e_{m\ell}}\) | elimination rates of metabolite \(\ell\) (expressed per time units) |
\(k_{m_\ell}\) | metabolization rate of metabolite \(\ell\) (expressed per time units) |
\(t_c\) | duration of the accumulation phase |
\(U = \sum_{i=1}^{I} k_{u_i} c_i\) | sum of all uptake terms |
\(E = \sum_{j=1}^{J} k_{e_j}\) | sum of all elimination terms for the parent compound |
\(M = \sum_{\ell=1}^{L} k_{m_\ell}\) | sum of all elimination terms for metabolite \(\ell\) |
The simplest model in MOSAIC application considers only one exposure route (for example by water, parameter \(k_{u_w}\)) with the corresponding elimination rate (excretion, parameter \(k_{e_e}\)), as given by Eq. (6):
\[
\left \{\begin{array}{l}
\frac{dC_p(t)}{dt} = k_{u_w} c_{w} - k_{e_e} C_p(t) \quad \text { for } 0 \leqslant t \leqslant t_{c} \\
\frac{dC_p(t)}{dt} = - k_{e_e} C_p(t) \quad\quad\quad\quad\quad\: \: \: \text { for } t>t_{c}
\end{array}\right. \quad (6)
\]
where \(k_{u_w}\) is the uptake rate from water (\(time^{-1}\)), \(c_w\) is the mean contaminant concentration in water (\(\mu g.mL^{-1}\)) and \(k_{e_e}\) is the rate related to the excretion process (\(time^{-1}\)).
The more complex model in MOSAIC application considers a maximum of four exposure routes (water: \(k_{u_w}\), pore water: \(k_{u_{pw}}\), sediment: \(k_{u_s}\) and food: \(k_{u_f}\)) and a maximum of three elimination processes (excretion, \(k_{e_e}\), biotransformation: \(k_{m\ell}\), growth dilution: \(k_{e_g}\)), as given by Eq. (7) for parent compound, Eq. (8) for metabolite \(\ell\) and Eq. (9) for growth (Von Bertalanffy equations, classically used for fishes):
\(\left \{\begin{array}{l} \frac{dC_p(t)}{dt} = k_{u_w} c_{w} + k_{u_{pw}} c_{pw} + k_{u_s} c_{s} + k_{u_f} c_{f} - (k_{e_e} + k_{e_g} + k_{m_\ell}) C_p(t) \quad \text { for } 0 \leqslant t \leqslant t_{c} \\ \frac{dC_p(t)}{dt} = - (k_{e_e} + k_{e_g} + k_{m_\ell}) C_p(t) \quad\quad\quad\quad\quad\: \: \: \text { for } t>t_{c}\end{array}\right. \:(7)\)
\(\left \{\begin{array}{l}\frac{dC_{m_\ell}(t)}{dt} = k_{m_\ell} C_p(t) - k_{e_{m\ell}} C_{m_\ell}(t) \quad \text { for } 0 \leqslant t \leqslant t_{c} \\ \frac{dC_{m_\ell}(t)}{dt} = - k_{e_{m\ell}} C_{m_\ell}(t) \quad\quad\quad\quad\quad\: \: \: \text { for } t>t_{c} \end{array}\right. \quad (8)\)
\(\frac{dG(t)}{dt} = k_{e_g} (g_{max}- G(t)) \quad (9)\)
with \(G(t)\) the measured growth of the organism at time \(t\) (in growth unit) and \(g_{max}\) the asymptotic growth \(\ell\) (in \(\mu g.g^{-1}\)).
More details and especially the exact solutions of these equations are given here.
The stochastic part
We assumed a Gaussian (normal) probability distribution of the concentration measurements within the organism as follows (Eqs. (10) to (12)):
\(\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad C_{obs_p}(t) \sim \mathcal{N}(C_p(t),\sigma_p^2) \quad (10)\)
\(\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad C_{obs_{m_\ell}}(t) \sim \mathcal{N}(C_{m_\ell}(t),\sigma_{m_\ell}^2) \quad (11)\)
\(\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad G_{obs}(t) \sim \mathcal{N}(G(t),\sigma_g^2) \quad (12)\)
where \(\mathcal{N}\) stands for the normal law, \(C_{obs_p}(t)\) and \(C_{obs_{m\ell}}(t)\) correspond to the measured parent and metabolite \(\ell\) concentrations within the organism measured at time \(t\), \(C_p(t)\) is the parent compound concentration at time \(t\) predicted by the model based on to Eqs. (2) and (4), \(C_{m_\ell}(t)\) is the concentration of metabolite \(\ell\) at time \(t\) predicted by the model based on Eqs. (3) and (5), \(G_{obs}(t)\) is the measured growth of the organism at time \(t\), \(G(t)\) is the growth at time \(t\) predicted by the model based on Eq. (9), \(\sigma\) are the expected standard deviations of the measured contaminant concentration or growth of the organisms (\(\mu g.g^{-1}\) or growth unit).
A Directed Acyclic Graph (DAG) is given on Fig. 2.3, which symbolize the deterministic links between parameters and variables for the generic TK model (Eqs.(2) to (5)) and the stochastic links between the observed and predicted data (Eqs. (11) to (13)).
In MOSAIC, prior choice is hidden to the user. However, we give here some information to help the user to understand the model behind. Before conducting an experimental study, prior distributions are defined for each parameter according to information available from the literature, expert knowledge and/or previous experiments. Depending on the sources where the information come from, informative, semi-informative or non-informative prior distributions can be used. If a parameter was already estimated in previous studies or if previous data are available, a prior distribution can easily be characterized with an appropriate probability distribution. However, if no information is available but an order of magnitude is (positive only, for example), it is possible to use a weakly informative prior. If any information is available on the order of magnitude of a parameter, its prior is preferably defined on a decimal logarithm scale in order to consider with equal probability both low and high expected values.
As MOSAIC application has to be the most generic as possible, priors were assumed to be non-informative log10-uniform distributions within [-5, 5] for all uptake and elimination rate constants. For growth, a uniform prior distribution [\(10 \times mean(G)/4\), \(10 \times mean(G)\)] was assumed for parameter \(g_{max}\) and a uniform prior distribution [\(0\), \((10 \times mean(G))/(8-10^{-10})\)] for parameter \(g_{0}\). We assumed a non-informative Gamma (0.001, 0.001) prior for all \(\sigma\) as usually done13,14. We are aware that Gelman et al. (2006)15 did not recommend the Gamma(\(\varepsilon\), \(\varepsilon\)) family for non-informative prior distributions. Indeed, in cases where \(\sigma\) is estimated to be near zero, the resulting inference may be highly sensitive to the choice of \(\varepsilon\). They rather recommend the use of a prior uniform density on \(\sigma\) on a wide range (for example, \(U(0, A)\) with large \(A\)) or a half-normal prior distribution centered at 0 with standard deviation set to a high value such as 100 (for example \(\mathcal{N}(0, 100^2)\)). After some tests, no differences were obtained whatever the prior choice on \(\sigma\) in our cases.
The Bayesian approach considers that data are fixed but that the parameters are unknown random variables following a probabilistic distribution. This leads to the following practical implications: (i) the Bayesian process optimises the probability of parameter vector \(\theta\) given the data set \(\mathrm{Y}\) used for calibration (the so-called posterior distribution) not only the likelihood (see below); (ii) there is a need to provide reasonable prior information, then updating this information by accounting for the data. Below is a short introduction to Bayesian principles16.
In short, the Bayesian approach requires the following steps:
• Choose the prior distributions on parameters based on previous results, literature or expert knowledge (without looking at the data to fit): \(\mathrm{P}(\theta)\);
• Define the probabilistic model from the data, that is the random variable whose data would be one realisation assuming known values of parameters, namely the likelihood: \(\mathrm{P(Y} \, \mid\, \theta)\);
• Calculate the joint posterior distribution of the parameters given the data via the Bayes formula: \(\mathrm{P}(\theta \, \mid \, \mathrm{Y})\);
• Provide statistical summaries of parameter estimates (namely, appropriate quantiles);
• Get any function of the parameter estimates as posterior probability distribution, like for example BCF calculations or predictions of new observations.
Basic principles
The keystone of the Bayesian approach is the Bayes formula (Eq. (13): \[
\mathrm{P}(\theta \mid \mathrm{Y})=\frac{\mathrm{P}(\theta) \mathrm{P}(\mathrm{Y} \mid \theta)}{\mathrm{P}(\mathrm{Y})} \quad (13)
\] where Y are the observed data; \(\mathrm{P}(\theta \mid \mathrm{Y})\) is the joint posterior distribution of parameter vector ; \(\mathrm{P}(\mathrm{Y} \mid \theta)\) is the likelihood of the data given the parameters; \(\mathrm{P}(\theta)\) is the joint prior distribution of parameter vector \(\theta\). Given that \(\mathrm{P(Y)}\) is known and fixed, it is often not considered as it does not depend on and will not influence the posterior distribution. Hence (Eq. (14)): \[
\mathrm{P}(\theta \mid \mathrm{Y}) \simeq \mathrm{P}(\theta) \mathrm{P}(\mathrm{Y} \mid \theta) \quad (14)
\]
with \(\mathrm{P}(\theta)\mathrm{P}(Y|\theta)\) the unormalised posterior density leading to (Eq. (15)):
\[
\mathrm{P(Y)}=\int \mathrm{P}(\theta) \mathrm{P(Y} \mid \theta) d \theta \quad (15)
\] The prior distribution \(\mathrm{P}(\theta)\) expresses the available parameter information without knowing the observed data, while the posterior distribution \(\mathrm{P}(\theta \, \mid\, \mathrm{Y})\) combines this prior information (which may be more or less informative depending on what is known about the value of the parameters beforehand) with evidence from the data (expressed through the likelihood) into a joint posterior density probability distribution for the parameters. The overall expectation is to get a narrower posterior distribution compared to the prior one after the computing of the Bayesian algorithm: the difference between the two distributions reflects the information provided by the data. When the non-informative prior is vague (translated for example into a flat uniform distribution), and the data sufficiently informative, the results are similar than those obtained under a frequentist approach.
Joint posterior distribution
The joint posterior distribution has the dimension of the number of parameters times the number of iterations within the Monte Carlo Markov Chain (MCMC) chains. It can be plotted in planes of parameter pairs to visualise correlations between parameters. In an example case with two binormally distributed parameters, the joint posterior distribution can be plotted in the 2D-parameter space as illustrated by ellipses on Fig. 2.4; in this example, parameters \(\theta_1\) and \(\theta_2\) appear slightly positively correlated. From the joint posterior distribution, the marginal posterior distributions for each parameter (as illustrated by grey normal distributions on bottom and left sides of Fig. 2.4) can be extracted. Then, from the marginal posterior distributions, some statistical summaries on parameter estimates can be extracted, usually the median (illustrated by vertical and horizontal plain grey lines on Fig. 2.4) as well as 2.5% and 97.5% quantiles to serve as 95% credible intervals (illustrated by vertical and horizontal dotted grey lines on Fig. 2.4). Another advantage of having the joint posterior distribution is that the posterior distribution of any function of the parameters can also be obtained.
Parameter uncertainties
One implication of adopting a Bayesian approach is that the uncertainty on a parameter can easily be expressed as a probability distribution from which a credible interval (also called a Bayesian confidence interval) can be extracted. For example, the 95% credible interval delimits a range of values where the parameters lies with a 95% probability.
Numerical computation
Many numerical methods have been developed to approximately compute the joint posterior distribution, mainly based on simulations by Monte Carlo Markov Chain (MCMC) sampling methods used to generate random numbers from complex joint distributions. MCMC algorithms are a general method based on drawing values of parameter vector \(\theta\) from approximate distributions and then correcting those draws to better approximate the target posterior distribution \(\mathrm{P}(\theta \, \mid\, \mathrm{Y})\). The sampling is done sequentially, with the distribution of the sampled draws depending only on the last value drawn; hence, the draws form a Markov Chain. The key to the method success, however, is not the Markov property but rather the fact that the approximate distributions are improved at each step of the simulation, in the sense that it finally converges to the target posterior distribution after an enough number of iterations. Indeed, with MCMC algorithms, the simulation process must run long enough so that the distribution of the current draws is close enough to the desired target posterior distribution.
MCMC algorithms use random walk algorithms, among which the Metropolis algorithm (and its generalisation, the Metropolis–Hasting algorithm) as an adaptation of a random walk with an acceptance/rejection rule to converge to the specified target distribution17,18. The Gibbs sampler is a special case of the Metropolis–Hastings algorithm applicable when the joint distribution is not known explicitly, or when it is difficult to directly sample from, while the conditional distribution of each parameter is known and it is easy (or at least, easier) to sample from19.
Several tools are available to automatically perform these computations. In MOSAIC, JAGS11 (version 4.3.0. (2017-08-10)) and R software1 (version 4.0.2 (2020-06-22)) are used. The models are fitted to bioaccumulation data using Bayesian inference via Monte Carlo Markov Chain (MCMC) sampling based on a Gibbs-type algorithm. For each model, we start by running a short sampling phase with three chains (5,000 iterations after a burn-in phase of 10,000 iterations) then using the Raftery and Lewis20 method to set the necessary thinning and number of iterations to reach an accurate estimation of the joint posterior distribution.
Once data uploaded and the model stated, you can click on button “Calculate and Display.” We recommend you to pay attention to the selected exposure concentration before to proceed to calculations. Besides, calculations can take several minutes for the most complicated models.
You will find at the end of this user guide an appendix (section 5) which gathers together other types of results which can be obtained with other data sets and how to interpret them.
<<<<<<< HEAD
To illustrate the result section, here we used the example file ‘Oncorhynchus_two.csv,’ where Oncorhynchus mykiss are exposed to a highly hydrophobic substance (logKow = 9) spiked water at \(0.0044\) \(\mu g.mL^{-1}\) and where the duration of the accumulation phase is equal to 49 days.
=======
To illustrate the result section, here we used the example file ‘Oncorhynchus_two.csv’, where Oncorhynchus mykiss are exposed to a highly hydrophobic substance (logKow = 9) spiked water at \(0.0044\) \(\mu g.mL^{-1}\) and where the duration of the accumulation phase is equal to 49 days.
>>>>>>> Gamma
As a first result, we provide the bioaccumulation factors (BCF, BCFpw, BSAF and BMF). Please, note that this section is reactive according to your data. If required, change tab to get the other bioaccumulation factors (Fig. 3.1).
As a first result, we provide the kinetic bioconcentration factor (\(BCF_{k}\)). The BCF at steady-state (\(BCF_{ss}\)) can be asked if you consider a steady-state is almost reached at the end of the accumulation phase. These factors are mathematically given by Equations (16) and (17) respectively:
\[ BCF_{k} =\frac{k_{u_w}}{\sum_{j=1}^{J} k_{e_j} + \sum_{\ell=1}^{L} k_{m_\ell}} \quad (16) \]
\[ BCF_{ss} =\frac{C_p(t_{c})}{c_{w}} \quad (17) \]
where \(C_p(t_c)\) is the contaminant concentration within the organism at steady-state that is at the end of the accumulation phase (\(\mu g. g^{-1}\)) and \(c_w\) is the contaminant concentration in water (\(\mu g . mL^{-1}\)).
BCF are given as probability distributions (Fig. 3.2) and summarized with their median and their 95% uncertainty limits (95% credible intervals, Table 3.1).
2.5% | 50% | 97.5% | |||
---|---|---|---|---|---|
\(BCF_k\) | <<<<<<< HEAD233 | 276 | 337 | ||
\(BCF_{ss}\) | 148 | 233 | =======232 | 276 | 336 |
\(BCF_{ss}\) | 147 | 232 | >>>>>>> Gamma316 |
In the above example, the steady-state is almost reached at time \(t_c\) (Fig. 3.3), thus it is reasonable to ask for the \(BCF_{ss}\).
<<<<<<< HEAD
Note that the OECD guideline 3057 reports that “greater emphasis must be on kinetic BCF estimate (when possible) next to estimating the BCF at steady-state.” Thus we recommend to preferentially consider the \(BCF_k\) rather than the \(BCF_{ss}\).
When appropriate, we provide the kinetic pore water bioconcentration factor (\(BCF_{pw_{k}}\)).Again, the corresponding BCF at steady-state (\(BCF_{pw_{ss}}\)) can be asked. These factors are mathematically given by Equations (18) and (19) respectively:
\[ BCF_{pw_{k}} =\frac{k_{u_{pw}}}{\sum_{j=1}^{J} k_{e_j} + \sum_{\ell=1}^{L} k_{m_\ell}} \quad (18) \]
\[ BCF_{pw_{ss}} =\frac{C_p(t_{c})}{c_{pw}} \quad (19) \]
where \(C_p(t_c)\) is the contaminant concentration within the organism at time \(t_c\) (\(\mu g. g^{-1}\)) and \(c_{pw}\) is the contaminant concentration in pore water (\(\mu g . mL^{-1}\)). BCFpw are given as probability distributions and summarized with their median and their 95% uncertainty limits (95% credible intervals).
When appropriate, we provide the kinetic biota-sediment factor (\(BSAF_{k}\)) and the BSAF at steady-state (\(BSAF_{ss}\)) can be asked. These factors are mathematically given by Equations (20) and (21) respectively:
\[ BSAF_{k} =\frac{k_{u_s}}{\sum_{j=1}^{J} k_{e_j} + \sum_{\ell=1}^{L} k_{m_\ell}} \quad (20) \]
\[ BSAF_{ss} =\frac{C_p(t_{c})}{c_{s}} \quad (21) \]
where \(C_p(t_c)\) is the contaminant concentration within the organism at time \(t_c\) (\(\mu g. g^{-1}\)) and \(c_s\) is the contaminant concentration in sediment (\(\mu g . g^{-1}\)). BSAF are given as probability distributions and summarized with their median and their 95% uncertainty limits (95% credible intervals).
When appropriate, we provide the kinetic biomagnification factor (\(BMF_{k}\)) and the BMF at steady-state (\(BSAF_{ss}\)) can be asked. These factors are mathematically given by Equations (22) and (23) respectively:
\[ BMF_{k} =\frac{k_{u_f}}{\sum_{j=1}^{J} k_{e_j} + \sum_{\ell=1}^{L} k_{m_\ell}} \quad (22) \]
\[ BMF_{ss} =\frac{C_p(t_{c})}{c_{f}} \quad (23) \]
where \(C_p(t_c)\) is the contaminant concentration within the organism at time \(t_c\) (\(\mu g. g^{-1}\)) and \(c_f\) is the contaminant concentration in food (\(\mu g . g^{-1}\)). BMF are given as probability distributions and summarized with their median and their 95% uncertainty limits (95% credible intervals).
A coefficient of variation (\(CV\)) for each bioaccumulation factor can be calculated as follows (Eq. (24)):
\[ CV =\frac{Q_{k}97.5-Q_{k}2.5}{4 \times Q_{k}50} \quad (24) \]
with \(Q_{k}2.5\), \(Q_{k}50\) and \(Q_{k}97.5\) the 2.5%, 50% and 97.5% quantiles of the kinetic bioaccumulation factor. Based on our experiment, the coefficient of variation is expected to not exceed \(0.5\).
If the bioaccumulation factor at steady-state is asked, the corresponding \(CV\) is given (Eq. (25)):
\[ CV =\frac{Q_{ss}97.5-Q_{ss}2.5}{4 \times Q_{ss}50} \quad (25) \]
with \(Q_{ss}2.5\), \(Q_{ss}50\) and \(Q_{ss}97.5\) the 2.5%, 50% and 97.5% quantiles of the steady-state bioaccumulation factor.
We first provide the fitted curve superimposed to the observations (Fig. 3.3, black dots): the orange plain line is the median curve, the gray zone is the uncertainty band delimited by 2.5% and 97.5% quantiles of predictions in orange dotted lines. This section is reactive according to your data: if there is biotransformation or growth, the fitted curve superimposed to the observations for parent compound, metabolite(s) and/or growth data are given (for example from the example file “Male_Gammarus_seanine.csv,” Fig. 3.4).
We first provide the fitted curve superimposed to the observations (Fig. 3.3, black dots): the orange plain line is the median curve, the gray zone is the uncertainty band delimited by 2.5% and 97.5% quantiles of predictions in orange dotted lines. This section is reactive according to your data: if there is biotransformation or growth, the fitted curve superimposed to the observations for parent compound, metabolite(s) and/or growth data are given (for example from the example file “Male_Gammarus_seanine.csv”, Fig. 3.4).
From the joint posterior distribution, we can obtain the marginal posterior distributions for each parameter, which can be summarized by their medians and their 95% credible intervals (Table 3.2).
2.5% | 50% | 97.5% | |||||
---|---|---|---|---|---|---|---|
\(k_{u_w}\) | <<<<<<< HEAD7.374 | 10.58 | 15.36 | =======7.447 | 10.59 | 15.33 | >>>>>>> Gamma\(d^{-1}\) |
\(k_{e_e}\) | <<<<<<< HEAD0.02329 | 0.03836 | 0.06098 | =======0.02361 | 0.0386 | 0.06109 | >>>>>>> Gamma\(d^{-1}\) |
\(\sigma_p\) | <<<<<<< HEAD0.1247 | 0.1676 | 0.2453 | =======0.125 | 0.1682 | 0.2459 | >>>>>>> Gamma\(\mu g.g^{-1}\) |
Goodness-of-fit criteria are given below in our prioritised order; the Posterior Predictive Check (PPC) and the prior-posterior comparison are the most important to check; if they do not correspond to the expectation, you must consider your results with an even more particular attention. As an indication, if at least two criteria are fulfilled, the results obtained can be considered as good enough. We suggest that you refer to the appendix at the end of the document for more information about cases not in accordance with the classical expectations (section 5).
The PPC shows the observed values against their corresponding estimated predictions (black dots), along with their 95% credible interval (vertical segments). If the fit is correct, we expect to see 95% of the data within the intervals. Ideally, observations and predictions should coincide, so we would expect to see black dots along the first bisector \(y = x\) (plain black line). The 95% credible intervals are colored in green if they overlap this line, in red otherwise. In the following example (Fig. 3.5), 95.24% of the measured concentrations (\(n=20/21\)) are in the 95% credible intervals of their predictions.
An example of prior and posterior distributions is illustrated in Fig. 3.6. The prior distribution is represented by the gray area and the posterior distribution by the orange area. The precision of the model parameter estimation can be visualized by comparing prior and posterior distributions: the overall expectation is to get a narrower posterior distribution compared to the prior one, what reflects that data contributed enough to precisely estimate parameters. In the example of Fig. 3.6, marginal posterior distributions for \(k_{u_w}\) and \(k_{e_e}\) are narrower (orange area) than their respective prior distributions (grey area). Within the application, you can chose to visualize either the deterministic or the stochastic parameters by selecting the corresponding tab.
MOSAIC gives a colored matrix in order to highlight correlations between parameters (Fig. 3.7, 3.8). This output allows you to see at a glance the most correlated or anti-correlated parameters, in order to diagnose potential problems of precision due to highly correlated parameters.
Correlations between parameters can also be visualized by projecting the joint posterior distribution in a plot matrix with planes of parameter pairs (Fig. 3.9, lower triangular elements), marginal posterior distribution of each model parameter (Fig. 3.9, diagonal) and Pearson correlation coefficients (Fig. 3.9, upper triangular elements). Correlations are expected to be low (reflected by “potatoid” shapes of density lines in orange, e.g., \(k_{e_e}\) and \(\sigma_p\) in Fig. 3.9); a leaning elliptical shape translates high correlations (positive if leaning to the right, e.g., \(k_{u_w}\) and \(k_{e_e}\) in Fig. 3.9, negative if leaning to the left).
In the example of Fig. 3.9, \(k_{u_w}\) and \(k_{e_e}\) are highly positively correlated (\(r = 0.941\)), meaning that the estimate obtained for one of these two parameters will strongly influence the estimate of the other parameter. This high correlation may be due to the model structure (e.g., by definition \(k_{u_w}\) and \(k_{e_e}\) are correlated), or comes from data not fully appropriate to precisely estimate the parameters.
Convergence of the MCMC can be checked with the Gelman-Rubin diagnostic expressed with the potential scale reduction factor (PSRF). Approximate convergence is diagnosed when the PSRF is close to \(1.0\) (Fig. 3.10)21. In the example of Fig. 3.10, the PSRF is equal to \(1.0\) for each model parameter, thus the convergence of the MCMC was correctly achieved.
Information criteria offer a computationally appealing way of estimating the generalization performance of the model. A fully Bayesian criterion is the widely applicable information criterion (WAIC) by Watanabe a penalized deviance statistics accounting for the uncertainty in the parameters and can be used also for singular models. WAIC is widely used in model comparison for a same dataset (e.g., with or without \(k_{e_e}\)). Sub-models with lower WAIC values will be preferred.
For example, for highly hydrophobic substances, two models can be compared, one considering water and food exposure (which we call here model (a)) and one model considering food-only exposure due to the physico-chemical properties of the substance (model (b)). If the WAIC of model (a) is smaller than the WAIC of model (b), then model (a) will be preferred.
In this version of MOSAIC, this criterion cannont be used as only one model is proposed per data set. It can however be kept as information to be used later, for example in future versions when it will be possible to test different models of decreasing complexity.
This criterion, denoted DIC, is a penalized deviance statistics accounting for the number of parameters for use in model comparison for a same data set (e.g., with or without \(k_{e_e}\)). Sub-models with lower DIC values will be preferred20. DIC value can be negative. However, DIC value itself is not important, what matters is the difference between two DICs, what can help in deciding which model is the most appropriate.
For example, for highly hydrophobic substances, two models can be compared, one considering water and food exposure (which we call here model (a)) and one model considering food-only exposure due to the physico-chemical properties of the substance (model (b)). If the DIC of model (a) is smaller than the DIC of model (b), then model (a) will be preferred.
In this version of MOSAIC, this criterion cannont be used as only one model is proposed per data set. It can however be kept as information to be used later, for example in future versions when it will be possible to test different models of decreasing complexity.
You can get more information of the use of the DIC in Ratier et al. (2019)12.
A traceplot is also an essential plot for assessing convergence and diagnosing of MCMC. It shows the time series of the sampling process leading to the joint posterior distribution. Different colors are used for each of the chains (here three) to assess the within-chain convergence. The user must check whether all MCMC converge towards the same distribution limit (overlapping of the chains). This can be verified visually by observing the simulated values for each node of interest as a function of the number of iterations (Fig. 3.11). In the following example, the three MCMC overlap and converge towards the same distribution limit for each model parameter. Thus, the algorithm has suitably converged.
You can download all plots as displayed by the application in several formats (.png, .jpg, .pdf, .svg, .tiff and .eps): equations, bioaccumulation factors, fitting results and goodness-of-fit criteria. They can be downloadable separately or simultaneously in a .zip file.
You can also download table results in .txt or .csv, as for example the BCF numerical values and the joint posterior distribution for all parameters (columns) and all iterations of the MCMC algorithm (lines).
You can easily download a full report of your calculations, which summarize all the results in a .pdf, .html or .docx file. We warn you that the creation of the report may take some time depending on your data set.
The R script can be downloaded as a gateway to identically reproduce all calculations provided within the application (this guarantees the full reproducibility of the results) and to perform further calculations directly within the R software with your own computer.
In practice, you may encounter situations where the results are not “ideal,” conversely to those presented in sections 1 to 5 of this document. This appendix will allow you to better interpret the results in such cases. For the goodness-of-fit criteria, we suggest to consider as enough the fact that two criteria as given by MOSAIC are receivable.
If you asked for the \(BCF_{ss}\), \(BCF_{pw_{ss}}\), \(BSAF_{ss}\) or \(BMF_{ss}\) but the median value and the 95% credible interval are not of the same order of magnitude than for the \(BCF_{k}\), \(BCF_{pw_{k}}\), \(BSAF_{k}\) or \(BMF_{k}\), then ensure the accumulation phase really reached the steady-state (i.e., at least three successive measured concentrations with no statistical differences).
On Fig. 5.1, you can see a counter-example for asking for the \(BCF_{ss}\):
Indeed, the steady-state was not reached at the end of the accumulation phase (day 4, Fig. 5.2). Thus, the median value for \(BCF_{ss}\) is totally different from the one of \(BCF_{k}\). The same consideration can be applied for the other bioaccumulation factors.
Large 95% credible intervals can sometimes be obtained for some parameters, especially for parameter \(k_{e_e}\) (Table 5.1). Such a situation leads to non precise estimate of these parameters, what should question their use for predictions. This can be due to the model structure or the experimental data themselves. Thus, it is an information to consider when you interpret the fitting results.
2.5% | 50% | 97.5% | |||||
---|---|---|---|---|---|---|---|
\(k_{u_w}\) | <<<<<<< HEAD639.7 | 726.8 | 841.3 | =======639.6 | 727 | 835.7 | >>>>>>> Gamma\(d^{-1}\) |
\(k_{e_e}\) | <<<<<<< HEAD1.598e-05 | 0.006017 | 0.02015 | =======1.619e-05 | 0.00607 | 0.01982 | >>>>>>> Gamma\(d^{-1}\) |
\(\sigma_p\) | <<<<<<< HEAD0.04669 | 0.063 | 0.09079 | =======0.04673 | 0.06283 | 0.08992 | >>>>>>> Gamma\(\mu g.g^{-1}\) |
If the fit is correct, it is expected to get 95% of the data within the 95% credible intervals of their predictions. So, if the range of the percentage of data within the credible intervals is between 92 and 96%, calculations and predictions can be considered as good enough. If the percentage is under 92%, predictions are considered as underestimated. If the percentage is upper 96%, predictions are considered as overestimated.
On Fig. 5.3, you can see a counter-example with large uncertainties of the model predictions leading to 100% of the data within their credible intervals.
We remind you that prior distributions are defined by default to be the most generic as possible. However, it can happen that your data would require other prior distributions (e.g., inspired by literature, expert knowledge or by a previous study leading to parameter estimations outside of the default values as used in MOSAICbioacc).
The precision of the model parameter estimation can be visualized by comparing prior and posterior distributions: the overall expectation is to get a narrower posterior distribution compared to the prior one, what reflects that data contributed enough to precisely estimate parameters.
If one of the posterior distribution for a model parameter has bounds close to the lower or the upper bound of the prior distributions (e.g. \(log10(\theta) \simeq -5\) or \(log10(\theta) \simeq 5\) with \(\theta\) being one of the model parameter), then the prior distribution may be not well defined. For example, if a distribution tail is observed at these bounds as illustrated on the left of Fig. 5.4 (here the distribution tail of \(log10(k_{e_e})\) is too close to \(-5\), the lower bound), then the inference process needs to be questioned. Conversely, the fit can be considered as correct if you obtain prior and posterior distributions as illustrated on the right of Fig. 5.4.
In MOSAIC, it is not possible to change the prior distributions of parameters directly within the application. To do this, we suggest you to download the R code and to change the prior distributions directly in the R software. We remind you that to define the prior distributions you should not have a look at your data, but only on previous experiments, literature data or expert knowledge.
You can also check this goodness-of-fit criterion through the estimated values of the model parameters. If one of the model parameter is closed to the lower or the upper bound of its 95% credible interval (\(\theta\simeq10^{-5}\) or \(\theta\simeq10^{5}\) with \(\theta\) one of the model parameter), the prior distribution may be not appropriate. As for example illustrated in Table 5.1, the distribution tail of \(log10(k_{e_e})\) is too closed to the lower bound \(-5\).
If a high correlation is obtained between two parameters (e.g., more than 0.7 or less than -0.7 for the Pearson correlation coefficient), it is an information to consider, not necessarily a bad result. It means that the estimate obtained for one of these two parameters will strongly influence the estimate of the other one. Such a high correlation may be due to the model structure itself (for example, by definition \(k_{u_w}\) and \(k_{e_e}\) are correlated, Fig. 5.5, or to the data so that it cannot be avoided).
Sometimes, you may get a bimodal posterior distribution for one or several parameters what translates into a double maximum on density plots (e.g., parameters \(k_{u_w}\) and \(k_{e_e}\) in Fig. 5.5). To make the application as generic as possible, we defined priors for each parameter the most global as possible. However, depending on the experimental conditions, the parameters may not be really included within the prior chosen range of values. In such a case, we recommend you to contact sandrine.charles@univ-lyon1.fr if you are not experimented with Bayesian inference and R software, or to change the downloadable R script by yourself.
This criterion must be as close as possible to 1 for each model parameter to ensure that the between-chain variability is small compared to the within-chain variability. Based on our experience, from a value of 1.03, the results should be questioned. Most often, such a case appears when priors are not fully appropriate or when the data do not contain enough information. One of the solution may be to increase the number of iterations in the MCMC by using the R script directly.
The WAIC or the DIC are not criteria to consider to check the goodness-of-fit. However, they are crucial criteria to consider when two models or more are compared after fitting on a same data set. In this version of the application, the model comparison is not available. Please contact us to be kept informed.
You must check whether the MCMC converge towards the same distribution limit (overlapping of the chains). As shown in Fig. 5.6, the three MCMC do not overlap and do not converge towards the same distribution limit for two model parameters (\(k_{u_w}\) and \(k_{e_e}\)). If at least two of the other criteria are good enough, this can be disregarded. If not, your experimental data could be not sufficient to performed bioaccumulation factor calculations and to estimate parameters of a TK model.
Bioconcentration factor (BCF): BCF is a parameter describing bioaccumulation of water-associated organic compounds or metals into tissues of ecological receptors.
Biomagnification factor (BMF): BMF is a parameter describing bioaccumulation of food (or in the predator’s prey)-associated organic compounds or metals into tissues of ecological receptors.
Biota-Sediment Accumulation Factor (BSAF): BSAF is a parameter describing bioaccumulation of sediment-associated organic compounds or metals into tissues of ecological receptors.
Credible Interval (CI): A credible interval is the interval in which an parameter has a given probability. It is the Bayesian equivalent of the confidence interval.
Directed Acyclic Graph (DAG): It symbolize the deterministic links between parameters and variables for a model and the stochastic links between the observed and predicted data.
Monte Carlo Markov Chain (MCMC): A method which comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain.
Potential Scale Reduction Factor (PSRF): Gelman-Rubin diagnostic to check the convergence of the MCMC.
Toxico-Kinetic model (TK model): Toxico-kinetics is the mathematical description of the uptake and disposition of a chemical in the organism. TK modeling is usually implemented by describing the time course of the amount or concentration of the parent substance and its metabolites in all the organism.
(3) Kellner, K. (2019) jagsUI: A Wrapper Around ’rjags’ to Streamline ’JAGS’ Analyses. R package version 1.5.1.
(4) Chang, W., Cheng, J., Allaire, J. J., Xie, Y., and McPherson, J. (2020) Shiny: Web Application Framework for R.
(5) MacKay, D., and Fraser, A. (2000) Bioaccumulation of persistent organic chemicals: Mechanisms and models. Environmental Pollution 110, 375–391.
(6) European Commission. (2013) COMMISSION REGULATION (EU) No 283/2013 of 1 March 2013 setting out the data requirements for active substances, in accordance with Regulation (EC) No 1107/2009 of the European Parliament and of the Council concerning the placing of plant protection produc.
(7) OECD. (2012) Test No. 305: Bioaccumulation in Fish: Aqueous and Dietary Exposure. Paris.
(8) Charles, S., Veber, P., and Delignette-Muller, M. L. (2018) MOSAIC: a web-interface for statistical analyses in ecotoxicology. Environmental Science and Pollution Research 25, 11295–11302.
(9) EFSA. (2014) Scientific Opinion on good modelling practice in the context of mechanistic effect models for risk assessment of plant protection products. EFSA Journal 12.
(10) Fernández-i-Marín, X. (2016) ggmcmc: Analysis of MCMC samples and Bayesian inference. Journal of Statistical Software 70(9), 1–20.
(11) Plummer, M. (2003) JAGS: A program for analysis of Bayesian models using Gibbs sampling. Proceedings of the 3rd International Workshop on Distributed Statistical Computing; Vienna, Austria.
(12) Ratier, A., Lopes, C., Labadie, P., Budzinski, H., Delorme, N., Quéau, H., Peluhet, L., Geffard, O., and Babut, M. (2019) A Bayesian framework for estimating parameters of a generic toxicokinetic model for the bioaccumulation of organic chemicals by benthic invertebrates: Proof of concept with PCB153 and two freshwater species. Ecotoxicology and Environmental Safety 180.
(13) Lambert, P. C., Sutton, A. J., Burton, P. R., Abrams, K. R., and Jones, D. R. (2005) How vague is vague? A simulation study of the impact of the use of vague prior distributions in MCMC using WinBUGS. Statistics in Medicine 24, 2401–2428.
(14) Richards, R. G., and Chaloupka, M. (2009) Temperature-dependent bioaccumulation of copper in an estuarine oyster. Science of the Total Environment 407, 5901–5906.
(15) Gelman, A. (2006) Prior Distribution for Variance Parameters in Hierarchical Models. Bayesian Analysis 3, 5901–5906.
(16) Ockleford, C., Adriaanse, P., Berny, P., Brock, T., Duquesne, S., Grilli, S., Hernandez-Jerez, A. F., Bennekou, S. H., Klein, M., Kuhl, T., Laskowski, R., Machera, K., Pelkonen, O., Pieper, S., Smith, R. H., Stemmer, M., Sundh, I., Tiktak, A., Topping, C. J., Wolterink, G., Cedergreen, N., Charles, S., Focks, A., Reed, M., Arena, M., Ippolito, A., Byers, H., and Teodorovic, I. (2018) Scientific Opinion on the state of the art of Toxicokinetic/Toxicodynamic (TKTD) effect models for regulatory risk assessment of pesticides for aquatic organisms.
(17) Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953) Equation of state calculations by fast computing machines. The Journal of Chemical Physics 21, 1087–1092.
(18) Hastings, W. K. (1970) Monte carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109.
(19) Geman, S., and Geman, D. (1984) Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741.
(20) Raftery, A. E., and Lewis, S. M. (1992) [Practical Markov chain Monte Carlo]: Comment: one long run with diagnostics: implementation strategies for Markov chain Monte Carlo. Statistical Science 7, 493–497.
(21) Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van Der Linde, A. (2002) Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society. Series B: Statistical Methodology 64, 583–639.
>>>>>>> Gamma