Title: | Code and Data Accompanying the Eco-Stats Text (Warton 2022) |
---|---|
Description: | Functions and data supporting the Eco-Stats text (Warton, 2022, Springer), and solutions to exercises. Functions include tools for using simulation envelopes in diagnostic plots, and a function for diagnostic plots of multivariate linear models. Datasets mentioned in the package are included here (where not available elsewhere) and there is a vignette for each chapter of the text with solutions to exercises. |
Authors: | David Warton [aut, cre], Christopher Chung [ctb], Mark Donoghoe [ctb], Eve Slavich [ctb] |
Maintainer: | David Warton <[email protected]> |
License: | LGPL (>= 2.1) |
Version: | 1.1.11 |
Built: | 2025-02-03 04:43:11 UTC |
Source: | https://github.com/dwarton/ecostats |
Adds a smoother to a plot of y
against x
, and a confidence band around the smoother with pointwise coverage
(not global, unlike plotenvelope
). The smoother is constructed using gam
and confidence bands use Wald intervals, extracting the standard error from predict.gam
.
addSmooth( x, y, conf.level = 0.05, line.col = "slateblue4", envelope.col = adjustcolor(line.col, alpha.f = 0.1), ... )
addSmooth( x, y, conf.level = 0.05, line.col = "slateblue4", envelope.col = adjustcolor(line.col, alpha.f = 0.1), ... )
x |
is the |
y |
is the |
conf.level |
the confidence level to use in constructing the confidence band around the smoother. |
line.col |
color of smoother. Defaults to "slateblue4". |
envelope.col |
color of the global envelope around the expected trend. All data points should always stay within this envelope
(and will for a proportion |
... |
further arguments sent through to |
A challenge when interpreting diagnostic plots is understanding the extent to which
deviations from the expected pattern could be due to random noise (sampling variation)
rather than actual assumption violations. This function is intended to assess this,
by simulating multiple realizations of residuals (and fitted values) in situations where
assumptions are satisfied, and plotting a global simulation envelope around these at level conf.level
.
This function adds a smoother to a plot of y
against x
using gam
, and a confidence band
around the smoother with pointwise coverage (not global, unlike plotenvelope
) using Wald intervals
that take the standard error of predicted values from predict.gam
.
When constructing a plot to diagnose model fit, plotenvelope
is preferred, because this constructs
simulation envelopes globally, simplifying interpretation, and tends to give more accurate envelopes, via simulation.
This function is intended to fit trends to data for descriptive purposes, but it could be used (with caution)
to diagnose model fit in residual plots in situations where plotenvelope
is unavailable.
A smoother is added to the existing plot, with confidence band. In addition, a data frame is invisibly returned with the following components:
X
The values of x
at which the smoother is evaluated.
Yhat
Values of the smoother used to predict y
, for each value of x
.
Yhi
The upper bound on the confidence band, for each value of x
.
Ylo
The lower bound on the global envelope, for each value of x
.
David Warton <[email protected]>
Warton DI (2022) Eco-Stats - Data Analysis in Ecology, from t-tests to multivariate abundances. Springer, ISBN 978-3-030-88442-0
data(globalPlants) plot(log(height)~lat,data=globalPlants) with(globalPlants, addSmooth(lat,log(height)) )
data(globalPlants) plot(log(height)~lat,data=globalPlants) with(globalPlants, addSmooth(lat,log(height)) )
Compute analysis of variance (or deviance) tables for two fitted model objects, with the P-value estimated using a parametric bootstrap – by repeatedly simulating new responses from the fitted model under the null hypothesis.
anovaPB( objectNull, object, n.sim = 999, colRef = switch(class(object)[1], lm = 5, lmerMod = 6, 4), rowRef = 2, ... )
anovaPB( objectNull, object, n.sim = 999, colRef = switch(class(object)[1], lm = 5, lmerMod = 6, 4), rowRef = 2, ... )
objectNull |
is the fitted model under the null hypothesis. This can be
any object that responds to |
object |
is the fitted model under the alternative hypothesis. This can be
any object that responds to |
n.sim |
the number of simulated datasets to generate for parametric bootstrap testing. Defaults to 999. |
colRef |
the column number where the test statistic of interest can be found
in the standard |
rowRef |
the row number where the test statistic of interest can be found
in the standard |
... |
further arguments sent through to |
The anova
function typically relies on classical asymptotic results
which sometimes don't apply well, e.g. when using penalised likelihood to fit a
generalised additive model, or when testing whether a random effect should be
included in a mixed model. In such cases we can get a more accurate test by
using simulation to estimate the P-value – repeatedly simulating
new sets of simulated responses, refitting the null and alternative models, and
recomputing the test statistic. This process allows us to estimate the
distribution of the test statistic when the null hypothesis is true, hence
draw a conclusion about whether the observed test statistic is large relative
to what would be expected under the null hypothesis. The process is often
referred to as a parametric bootstrap (Davison & Hinkley 1997), which
the PB in the function name (anovaPB
) is referring to.
This function will take any pair of fitted models, a null (objectNull
)
and an alternative (object
), and use simulation to estimate the
P-value of the test of whether there is evidence against the model in
objectNull
in favour of the model in object
. This function works
by repeatedly performing an anova
to compare objectNull
to
object
, where the responses in the model.frame
have been replaced
with values simulated by applying simulate
to objectNull
. Hence
for this function to work, the objects being compared need to respond
to the anova
, simulate
and model.frame
functions.
This function needs to be able to find the test statistic in the anova
output, but it is stored in different places for different types of objects.
It is assumed that anova(objectNull,object)
returns a matrix and that the
test statistic is stored in the (rowRef,colRef)
th element, where both
rowRef
and colRef
have been chosen to be correct for common
model types (glm
, gam
, lmer
).
A matrix which has the appearance and behaviour of an object returned by
anova(objectNull,object)
, except that the P-value has been computed
by parametric bootstrap, and the matrix now inherits class anovaPB
.
David Warton <[email protected]>
Davison A.C. & Hinkley D. V. (1997) Bootstrap methods and their application, Cambridge University Press. Warton DI (2022) Eco-Stats - Data Analysis in Ecology, from t-tests to multivariate abundances. Springer, ISBN 978-3-030-88442-0
# fit a Poisson regression to random data: y = rpois(50,lambda=1) x = 1:50 rpois_glm = glm(y~x,family=poisson()) rpois_int = glm(y~1,family=poisson()) anovaPB(rpois_int,rpois_glm,n.sim=99)
# fit a Poisson regression to random data: y = rpois(50,lambda=1) x = 1:50 rpois_glm = glm(y~x,family=poisson()) rpois_int = glm(y~1,family=poisson()) anovaPB(rpois_int,rpois_glm,n.sim=99)
A study of the effects of bird exclusion on biological control of aphids in oat and wheat fields in Germany (Grass et al. 2017). Many thanks to Ingo for providing the raw data. In each of two fields (one of oats, one of wheat) there were eight plots, four with plastic netting to exclude birds, and four without. Aphid abundance was counted on seven different occasions over the first 38 weeks following netting. The expectation was that aphid numbers would decrease on bird exclusion, because an important food source to tree sparrows is aphid predators, hoverflies and ladybird beetles, so presence of birds may be limit the effectiveness of a biological control of aphids.
data(aphids)
data(aphids)
A list containing two dataframes, oat
and wheat
, depending on the crop. Each
dataframe contains:
The plot ID, a factor with eight levels
A factor indicating whether birds were excluded, excluded
or present
.
The number of days since netting was applied.
Aphid abundance (counts)
log(y+1)-transformed aphid abundance
Grass et al. (2017) Insectivorous birds disrupt biological control of cereal aphids. Ecology, 98 1583-90.
data(aphids) cols=c(rgb(1,0,0,alpha=0.5),rgb(0,0,1,alpha=0.5)) #transparent colours with(aphids$oat, interaction.plot(Time,Plot,logcount,legend=FALSE, col=cols[Treatment], lty=1, ylab="Counts [log(y+1) scale]", xlab="Time (days since treatment)") ) legend("bottomleft",c("Excluded","Present"),col=cols,lty=1)
data(aphids) cols=c(rgb(1,0,0,alpha=0.5),rgb(0,0,1,alpha=0.5)) #transparent colours with(aphids$oat, interaction.plot(Time,Plot,logcount,legend=FALSE, col=cols[Treatment], lty=1, ylab="Counts [log(y+1) scale]", xlab="Time (days since treatment)") ) legend("bottomleft",c("Excluded","Present"),col=cols,lty=1)
A subset from a study of the effects of bird exclusion on biological control of aphids in a German oat
field (Grass et al. 2017). Many thanks to Ingo for providing the raw data. The full data are
in aphids
, here we provide just two sampling times, 3 days and 30 days after netting.
OK, this is not quite a BACI design, 3 days after netting is not quite "before" but effects at this
point should be negligible...
There were eight plots, four with plastic netting to exclude birds, and four without. Aphid abundance was counted on 5 shoots per plot. The expectation was that aphid numbers would decrease on bird exclusion, because an important food source to tree sparrows is aphid predators, hoverflies and ladybird beetles, so presence of birds may be limit the effectiveness of a biological control of aphids.
data(aphidsBACI)
data(aphidsBACI)
A dataframe containing:
The plot ID, a factor with eight levels
A factor indicating whether birds were excluded, excluded
or present
.
The data of sampling (June 18th or July 15th, 3 and 30 days after netting, respectively)
Aphid abundance (counts)
log(y+1)-transformed aphid abundance
Grass et al. (2017) Insectivorous birds disrupt biological control of cereal aphids. Ecology, 98 1583-90.
data(aphidsBACI) plot(logcount~interaction(Treatment,Time),data=aphidsBACI,col=c("lightgreen","pink"))
data(aphidsBACI) plot(logcount~interaction(Treatment,Time),data=aphidsBACI,col=c("lightgreen","pink"))
A spatial moving block bootstrap resampling scheme is constructed, that can then
be used in resampling-based inference (e.g. using anova.manyglm
).
Blocks are constructed as discs or tiles of observations of a user-specified size and shape.
BlockBootID( x, y, block_L, nBoot = 499, Grid_space, lookuptables.folderpath = NA, shape = "disc", ... )
BlockBootID( x, y, block_L, nBoot = 499, Grid_space, lookuptables.folderpath = NA, shape = "disc", ... )
x |
the easting of spatial coordinate for the site. |
y |
the northing of the spatial coordinate for the site. |
block_L |
the size of the spatial blocks to be resampled. |
nBoot |
the number of resamples required (defaults to 499). |
Grid_space |
the resolution of the lattice used to sample centres of spatial blocks.
Defaults to a third of the resolution of |
lookuptables.folderpath |
(optional) path to a directory where lookup tables can be found that assign observations to spatial blocks. Such tables would considerably speed up the process. |
shape |
shape of the spatial blocks to resample. Default is |
... |
additional arguments passed to |
A spatial moving block bootstrap resampling scheme is constructed, that can then
be used in resampling-based inference. A matrix of IDs is returned, with different resamples
in different rows of the matrix, which can be used as input to functions designed for
resampling-based inference (such as anova.manyglm
).
Blocks are constructed as discs or tiles of observations, whose size is controlled by blockl_L
,
these blocks are kept together in resampling. The centre of each block is chosen
at random along a lattice specified via Grid_space
.
The most computationally intensive part of this process is working out which observations
belong in which blocks. If repeated analyses are to be undertaken in the same spatial domain
using the same sites, it is best to run this process only once and save the result as a set of
lookup tables.
Then the path to the directory containing these tables is specified via lookuptables.folderpath
and the whole thing runs heaps faster.
A matrix of IDs for each moving block bootstrap resample, with different resamples in rows and observations in columns.
Eve Slavich <[email protected]>
data(Myrtaceae) # fit a lm: library(mvabund) Myrtaceae$logrich=log(Myrtaceae$richness+1) mft_richAdd = manylm(logrich~soil+poly(TMP_MAX,degree=2)+ poly(TMP_MIN,degree=2)+poly(RAIN_ANN,degree=2), data=Myrtaceae) # construct a boot ID matrix: BootID = BlockBootID(x = Myrtaceae$X, y = Myrtaceae$Y, block_L = 20, nBoot = 199, Grid_space = 5) anova(mft_richAdd,resamp="case",bootID=BootID)
data(Myrtaceae) # fit a lm: library(mvabund) Myrtaceae$logrich=log(Myrtaceae$richness+1) mft_richAdd = manylm(logrich~soil+poly(TMP_MAX,degree=2)+ poly(TMP_MIN,degree=2)+poly(RAIN_ANN,degree=2), data=Myrtaceae) # construct a boot ID matrix: BootID = BlockBootID(x = Myrtaceae$X, y = Myrtaceae$Y, block_L = 20, nBoot = 199, Grid_space = 5) anova(mft_richAdd,resamp="case",bootID=BootID)
Predicted values using full conditional models derived from a multivariate
linear model (mlm
) object. The full conditionals model each response as a
linear model with all other responses used as predictors in addition to the
regressors specified in the formula of the mlm
object.
cpredict(object, standardize = TRUE, ...)
cpredict(object, standardize = TRUE, ...)
object |
a |
standardize |
logical defaults to |
... |
further arguments passed to |
Predictions using an mlm
object but based on the full conditional model,
that is, from a linear model for each response as a function of all responses
as well as predictors. This can be used in plots to diagnose the multivariate
normality assumption.
By default predictions are standardised to facilitate overlay plots of multiple
responses, as in plotenvelope
.
This function behaves much like predict.lm
, but currently, standard
errors and confidence intervals around predictions are not available.
A matrix of predicted values from full conditional models.
David Warton <[email protected]>
Warton DI (2022) Eco-Stats - Data Analysis in Ecology, from t-tests to multivariate abundances. Springer, ISBN 978-3-030-88442-0
cresiduals
, lm
, plotenvelope
, predict.lm
data(iris) # fit a mlm: iris.mlm=lm(cbind(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)~Species,data=iris) # predict each response conditionally on the values of all other responses: cpredict(iris.mlm)
data(iris) # fit a mlm: iris.mlm=lm(cbind(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)~Species,data=iris) # predict each response conditionally on the values of all other responses: cpredict(iris.mlm)
Residuals from full conditionals of a Multivariate
Linear Model (mlm
) object. The full conditional for each response is a
linear model with all other responses used as predictors in addition to the
regressors specified in the formula of the mlm
object. This is used to
diagnose the multivariate normality assumption in plotenvelope
.
cresiduals(object, standardize = TRUE, ...)
cresiduals(object, standardize = TRUE, ...)
object |
a |
standardize |
logical defaults to |
... |
further arguments passed to |
A residuals
function for mlm
objects, which returns residuals from a full
conditional model, that is, a linear model of each response against all responses
as well as predictors, which can be used to diagnose the multivariate normality assumption.
These can be standardized (standardize=TRUE
) to facilitate overlay plots of multiple
responses, as in plotenvelope
.
A matrix of residuals
David Warton <[email protected]>
Warton DI (2022) Eco-Stats - Data Analysis in Ecology, from t-tests to multivariate abundances. Springer, ISBN 978-3-030-88442-0
cpredict
, lm
, plotenvelope
, residuals
, rstandard
data(iris) # fit a mlm: iris.mlm=lm(cbind(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)~Species,data=iris) # construct full conditional residuals: cresiduals(iris.mlm)
data(iris) # fit a mlm: iris.mlm=lm(cbind(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)~Species,data=iris) # construct full conditional residuals: cresiduals(iris.mlm)
Data from an observational study of whether there is a different in microinvertebrate communities between estuaries that have been heavily modified by human activity and those that have not, across seven estuaries along the coast of New South Wales, Australia (Clark et al. 2015).
data(estuaries)
data(estuaries)
A dataframe containing (amongst other things):
A factor describing whether the sample was taken from a 'Modified' or 'Pristine' estuary.
Whether the sample was taken from Inner (upstream) or Outer (downstream) zone of the estuary.
A factor with seven levels identifying which estuary the sample was taken from.
Total abundance of all invertebrates in the sample
Richness of taxa in the sample – the number of responses (of those in columns 8-94) taking a non-zero value
Other variables in the dataset give invertebrate counts separately for different taxa.
Clark, G. F., Kelaher, B. P., Dafforn, K. A., Coleman, M. A., Knott, N. A., Marzinelli, E. M., & Johnston, E. L. (2015). What does impacted look like? high diversity and abundance of epibiota in modified estuaries. Environmental Pollution 196, 12-20.
data(estuaries) plot(Total~Estuary,data=estuaries,col=c(4,2,2,4,2,4,2))
data(estuaries) plot(Total~Estuary,data=estuaries,col=c(4,2,2,4,2,4,2))
Data from an observational study of whether there is a different in microinvertebrate communities between estuaries that have been heavily modified by human activity and those that have not, across seven estuaries along the coast of New South Wales, Australia (Clark et al. 2015). Sampling was undertaken in both Inner and Outer zones of the estuary.
data(estuaryZone)
data(estuaryZone)
A dataframe containing (amongst other things):
A factor describing whether the sample was taken from a 'Modified' or 'Pristine' estuary.
Whether the sample was taken from Inner (upstream) or Outer (downstream) zone of the estuary.
A factor with seven levels identifying which estuary the sample was taken from.
Total abundance of all invertebrates in the sample
Richness of taxa in the sample – the number of responses (of those in columns 8-94) taking a non-zero value
Other variables in the dataset give invertebrate counts separately for different taxa.
Clark, G. F., Kelaher, B. P., Dafforn, K. A., Coleman, M. A., Knott, N. A., Marzinelli, E. M., & Johnston, E. L. (2015). What does impacted look like? high diversity and abundance of epibiota in modified estuaries. Environmental Pollution 196, 12-20.
data(estuaryZone) cols=c("blue","red","lightblue","pink") plot(Total~interaction(Estuary,Zone),data=estuaryZone,col=cols[c(1,2,2,1,2,1,2,3,4,4,3,4,3,4)])
data(estuaryZone) cols=c("blue","red","lightblue","pink") plot(Total~interaction(Estuary,Zone),data=estuaryZone,col=cols[c(1,2,2,1,2,1,2,3,4,4,3,4,3,4)])
Data inspired by a worldwide study of plant height and its association with climate and latitude (Moles et al. 2009). Height is recorded for one dominant plant at each of 178 sites, along with dozens of climatic variables extracted from BIOCLIM. Sites, climate data and species lists come from Moles et al. (2009) but height data are not actually field measurements, due to data unavailability. Instead height has been taken from species descriptions available on Google, accessed 7/1/21.
data(globalPlants)
data(globalPlants)
A dataframe containing a whole bunch of stuff, the main things being:
Maximum plant height, in metres, taken from species descriptions on Google (where available)
latitude of the site this plant was found at
average daily maximum temperature
total annual precipitation
Rainfall in the wettest month
Moles et al. (2009) Global patterns in plant height. Journal of Ecology 97, 923-932.
data(globalPlants) plot(height~lat,data=globalPlants)
data(globalPlants) plot(height~lat,data=globalPlants)
Data derived from a study of the effects of nicotine on the development of guineapig offspring (Johns et al 1993). Ten pregnant guineapigs were injected with nicotine hydrogen tartite solution, and ten control guinea pigs were injected with saline solution as a control. Learning capabilities of offspring were then measured as the number of errors made when trying to remember where to find food in a simple maze. Data presented are not from the original paper- they have been generated to match the summary statistics reported in Johns et al (1993).
data(guineapig)
data(guineapig)
A list containing two dataframes, oat
and wheat
, depending on the crop. Each
dataframe contains:
Number of errors made
N=Nicotine, C=Control
Johns et al. (1993) The effects of chronic prenatal exposure to nicotine on the behavior of guinea pigs (Cavia porcellus). The Journal of General Psychology 120, 49-63.
data(guineapig) plot(errors~treatment,data=guineapig)
data(guineapig) plot(errors~treatment,data=guineapig)
Displays of 14 male Anolis lizards were recorded in Puerto Rican forest (Ord et al. 2016), along with key environmental characteristics. These lizards bob their head up and down (and do push-ups) in attempts to attract the attention of females, and advertise territory ownership to other males. How fast a lizard bobs its head was recorded, and it was of interest to understand which environmental features (out of temperature, light and noisiness) were related to head bobbing speed.
data(headbobLizards)
data(headbobLizards)
A dataframe containing:
A factor indicating Which lizard we are talking about. There is one observation per lizard (but more were originally collected).
Temperature, in degrees Celsius, at the location where lizard was first observed.
Ambient light recorded using a handheld sensor pointed towards the bobbing is happening.
Visual background noise measured using video analysis.
Maximum head-bobbing speed.
Date and time of observation.
Ord, T. J., Charles, G. K., Palmer, M. & Stamps, J. A. (2016). Plasticity in social communication and its implications for the colonization of novel habitats. Behavioral Ecology 27, 341-351
data(headbobLizards) plot(Hbspd_max~Bg_noise_max, data=headbobLizards)
data(headbobLizards) plot(Hbspd_max~Bg_noise_max, data=headbobLizards)
Monthly average measurements of carbon dioxide concentration from the Mauna Loa Observatory in Hawaii, from March 1958 to February 2021. Data available courtesy of the Global Monitoring Laboratory at the National Oceanic and Atmospheric Administration (NOAA) in the United States (https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html).
data(maunaloa)
data(maunaloa)
A dataframe containing:
The data of the measurement in date format. One measurement is available for each month, the first day of the month is assumed here.
The year of the measurement.
The month of the measurement.
The date in numerical format, as year+month/12
.
Carbon dioxide measurement in parts per million. Calculated as the average of all daily measurements for the month.
data(maunaloa) plot(co2~Date, type="l", data=maunaloa)
data(maunaloa) plot(co2~Date, type="l", data=maunaloa)
Data derived from NSW National Parks and Wildlife Service resources on species richness for members of the Myrtaceae family (eucalypts and relates species) in 1000 monitoring transects west of Sydney in the Greater Blue Mountains World Heritage Area. Also included is soil type classified into 9 categories, and a few climate variables derived from Worldclim.
data(Myrtaceae)
data(Myrtaceae)
A dataframe containing (amongst other things):
Easting of the site (in km).
Nothing of the site (in km).
Total number of Mrtaceae species observed at this site.
Annual average of daily maximum temperature (degrees Celsius).
Annual average of daily minimum temperature (degrees Celsius).
Annual precipitation (in millimetres).
Soil type, classified into nine categories.
Aspect of the site (in degrees, 0=360=due North).
data(Myrtaceae) library(ggplot2) ggplot(Myrtaceae, aes(x=X, y=Y))+geom_point(aes(color=richness))+ theme_classic()+xlab("West<-->East (km)")+ylab("South<-->North (km)")+ scale_color_gradient(low="lightgreen",high="black")+ labs(color="Species richness [log(y+1)-scale]")+theme(legend.position="top")+ guides(color = guide_colorbar(title.position = "top",ticks=FALSE, barwidth=15,barheight=0.5))+coord_fixed()
data(Myrtaceae) library(ggplot2) ggplot(Myrtaceae, aes(x=X, y=Y))+geom_point(aes(color=richness))+ theme_classic()+xlab("West<-->East (km)")+ylab("South<-->North (km)")+ scale_color_gradient(low="lightgreen",high="black")+ labs(color="Species richness [log(y+1)-scale]")+theme(legend.position="top")+ guides(color = guide_colorbar(title.position = "top",ticks=FALSE, barwidth=15,barheight=0.5))+coord_fixed()
Produces diagnostic plots of a fitted model y
, and
adds global envelopes constructed by simulating new residuals, to see how
departures from expected trends compare to what might be expected if the
fitted model were correct. Global envelopes are constructed using the
GET
package (Myllymäki et al 2017) for simultaneous control of error rates over the
whole plot, by simulating new responses from the fitted model then recomputing residuals
(which can be computationally intensive), or alternatively, by simulating residuals directly from the (multivariate) normal distribution
(faster, but not always a smart move). Options for diagnostic plots to construct are a residual vs fits,
a normal quantile plot, or scale-location plot, along the lines of plot.lm
.
plotenvelope( y, which = 1:2, sim.method = "refit", n.sim = 199, conf.level = 0.95, type = "st", overlay = TRUE, transform = NULL, main = c("Residuals vs Fitted Values", "Normal Quantile Plot", "Scale-Location Plot"), xlab = c("Fitted values", "Theoretical Quantiles", "Fitted Values"), ylab = c("Residuals", "Residuals", expression(sqrt("|Residuals|"))), col = NULL, line.col = if (add.smooth) c("slateblue4", "olivedrab", "slateblue4") else rep("olivedrab", 3), envelope.col = adjustcolor(line.col, 0.1), add.smooth = TRUE, plot.it = TRUE, resFunction = NULL, predFunction = NULL, fitMin = if (inherits(y, "glm") | inherits(y, "manyglm")) -6 else -Inf, ... )
plotenvelope( y, which = 1:2, sim.method = "refit", n.sim = 199, conf.level = 0.95, type = "st", overlay = TRUE, transform = NULL, main = c("Residuals vs Fitted Values", "Normal Quantile Plot", "Scale-Location Plot"), xlab = c("Fitted values", "Theoretical Quantiles", "Fitted Values"), ylab = c("Residuals", "Residuals", expression(sqrt("|Residuals|"))), col = NULL, line.col = if (add.smooth) c("slateblue4", "olivedrab", "slateblue4") else rep("olivedrab", 3), envelope.col = adjustcolor(line.col, 0.1), add.smooth = TRUE, plot.it = TRUE, resFunction = NULL, predFunction = NULL, fitMin = if (inherits(y, "glm") | inherits(y, "manyglm")) -6 else -Inf, ... )
y |
is any object that responds to |
which |
a vector specifying the diagnostic plots to construct:
These are the first three options in |
sim.method |
How should residuals be simulated? The default for most objects is |
n.sim |
the number of simulated sets of residuals to be generated, to which the observed residuals will be compared. The default is 199 datasets. |
conf.level |
the confidence level to use in constructing the envelope. |
type |
the type of global envelope to construct, see
|
overlay |
logical. For multivariate data, determines whether residuals from different responses are overlaid on a single plot (default), or plotted separately. |
transform |
a character vector pointing to a function that should be applied to both
axes of the normal quantile plot. The most common use is to set |
main |
the plot title (if a plot is produced). A vector of three titles, one for each plot. If only one value is given that will be used for all plots. |
xlab |
|
ylab |
|
col |
color of points |
line.col |
a vector of length three containing the colors of the lines on the three diagnostic plots. Defaults to "slateblue4" for smoothers and to "olivedrab" otherwise. Because it's cool. |
envelope.col |
color of the global envelope around the expected trend. All data points should always stay within this envelope
(and will for a proportion |
add.smooth |
logical defaults to |
plot.it |
logical. Should the result be plotted? If not, a list of analysis outputs is returned, see Value. |
resFunction |
the function used to compute residuals for all diagnostic plots. Defaults
to |
predFunction |
the function used to compute predicted values in residual vs fits or
scale-location plots. Defaults to |
fitMin |
the minimum fitted value to use in plots, any fitted value less than this will be truncated to
|
... |
further arguments sent through to |
A challenge when interpreting diagnostic plots is understanding the extent to which
deviations from the expected pattern could be due to random noise (sampling variation)
rather than actual assumption violations. This function is intended to assess this,
by simulating multiple realizations of residuals (and fitted values) in situations where
assumptions are satisfied, and plotting a global simulation envelope around these at level conf.level
.
This function can take any fitted model, and construct any of three diagnostic plots, as determined by which
:
Residual vs fits plot (optionally, with a smoother)
Normal quantile plot
Scale-Location plot (optionally, with smoother)
and see if the trend is behaving as expected if the model were true. As long as
the fitted model responds to residuals
and predict
(and when sim.method="refit"
, simulate
and update
) then a simulation envelope
will be constructed for each plot.
Simulation envelopes are global, constructed using the GET-package
, meaning that
(for example) a 95% global envelope on a quantile plot should contain all residuals for 95% of datasets
that satisfy model assumptions. So if any data points lie outside the
quantile plot's envelope we have evidence that assumptions of the fitted model are not satisfied.
The GET-package
was originally constructed to place envelopes around functions, motivated by
the problem of diagnostic testing of spatial processes (Myllymäki et al 2017), but it can equally
well be applied here, by treating the set of residuals (ordered according to the x-axis) as point-wise evaluations of a function.
For residual vs fits and scale-location plots, if do.smooth=TRUE
, global envelopes are constructed for
the smoother, not for the data, hence we are looking to see if the smoother
is wholly contained within the envelope. The smoother is constructed using gam
from the mgcv
package with maximum likelihood estimation (method="ML"
).
Note that the global envelopes only tell you if there is evidence of violation of model assumptions – they do not tell you whether the violations are large enough to worry about. For example, in linear models, violations of normality are usually much less important than violations of linearity or equal variance. And in all cases, modest violations tend to only have modest effects on the validity of inferences, so you need to think about how big observed violations are rather than just thinking about whether or not anything is outside its simulation envelope.
The method used to simulate data for the global envelopes is controlled by sim.method
.
The default method (sim.method="refit"
) uses a parametric bootstrap approach: simulate
new responses from the fitted model, refit the model and recompute residuals and fitted values.
This directly assesses whether trends in observed trends depart from what would be expected if the fitted model
were correct, without any further assumptions. For complex models or large datasets this would however be super-slow.
A fast, more approximate alternative (sim.method="norm"
) is to simulate new (multivariate) normal residuals
repeatedly and use these to assess whether trends in the observed data depart from what would be expected
for independent (multivariate) normal residuals. If residuals are expected to be standard
normal, a more refined check is to simulate from the standard normal using (sim.method="stand.norm"
).
This might for example be useful when diagnosing a model fitted using the mvabund
package (Wang et al. 2012),
since this calculates Dunn-Smyth residuals (Dunn & Smyth 1996), which are approximately standard normal when assumptions are satisfied.
If y
is a glm
with non-Gaussian family then residuals will not be normal and "refit"
is the
only appropriate option, hence other choices will be overridden with a warning reporting that this
has been done.
Note that for Multivariate Linear Models (mlm
), cresiduals
and cpredict
are used to construct residuals and fitted values (respectively) from the full conditional models
(that is, models constructed by regressing each response against all other responses
together with predictors). This is done because full conditionals are diagnostic of joint
distributions, so any violation of multivariate normality is expressed as a violation of
linear model assumptions on full conditionals. Results for all responses are overlaid on a single plot,
future versions of this function may have an overlay option to separately plot each response.
The simulated data and subsequent analysis are also used to obtain a P-value
for the test that model assumptions are correct, for each plot. This tests if sample residuals or their smoothers
are unusually far from the values expected of them if model assumptions were satisfied. For details see
global_envelope_test
.
Up to three diagnostic plots with simulation envelopes are returned, and additionally a list of three objects used in plotting, for plots 1-3 respectively. Each is a list with five components:
x
X-values used for the envelope. In plots 1 and 3 this is the fitted values, or if add.smooth=TRUE
, this is 500 equally spaced points covering
the range of fitted values. For plot 2, this is sorted normal quantiles corresponding to observed data.
y
The Y-values used for the envelope. In plots 1 and 3 this is the residuals, or with add.smooth=TRUE
, this is the values of the smoother corresponding to x
. For plot 2,
this is the sorted residuals.
lo
The lower bound on the global envelope, for each value of x
.
hi
The upper bound on the global envelope, for each value of x
.
p.value
A P-value for the test that observed smoother or data are consistent with what
would be expected if the fitted model were correct, computed in global_envelope_test
.
David Warton <[email protected]>
Dunn, P. K., & Smyth, G. K. (1996), Randomized quantile residuals. J. Comp. Graphical Stat. 5, 236-244. doi:10.1080/10618600.1996.10474708
Myllymäki, M., Mrkvička, T., Grabarnik, P., Seijo, H. and Hahn, U. (2017), Global envelope tests for spatial processes. J. R. Stat. Soc. B, 79: 381-404. doi:10.1111/rssb.12172
Wang, Y. I., Naumann, U., Wright, S. T., & Warton, D. I. (2012), mvabund
- an R
package for model-based analysis of multivariate abundance data. Methods in Ecology and Evolution, 3, 471-474. doi:10.1111/j.2041-210X.2012.00190.x
Warton DI (2022) Eco-Stats - Data Analysis in Ecology, from t-tests to multivariate abundances. Springer, ISBN 978-3-030-88442-0
cpredict
, cresiduals
, qqenvelope
# fit a Poisson regression to random data: y = rpois(50,lambda=1) x = 1:50 rpois_glm = glm(y~x,family=poisson()) plotenvelope(rpois_glm,n.sim=59) # Fit a multivariate linear model to the iris dataset: data(iris) Y = with(iris, cbind(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)) iris_mlm=lm(Y~Species,data=iris) # check normality assumption: plotenvelope(iris_mlm,n.sim=59,which=2) # A few more plots, with envelopes around data not smoothers: plotenvelope(iris_mlm, which=1:3, add.smooth=FALSE) # Note minor violation on the scale/location plot. # Repeat but with smoothers and with separate plots for each response and # a multiple testing adjustment to sim envelopes: plotenvelope(iris_mlm, which=1:3, overlay=FALSE, conf.level=1-0.05/4)
# fit a Poisson regression to random data: y = rpois(50,lambda=1) x = 1:50 rpois_glm = glm(y~x,family=poisson()) plotenvelope(rpois_glm,n.sim=59) # Fit a multivariate linear model to the iris dataset: data(iris) Y = with(iris, cbind(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)) iris_mlm=lm(Y~Species,data=iris) # check normality assumption: plotenvelope(iris_mlm,n.sim=59,which=2) # A few more plots, with envelopes around data not smoothers: plotenvelope(iris_mlm, which=1:3, add.smooth=FALSE) # Note minor violation on the scale/location plot. # Repeat but with smoothers and with separate plots for each response and # a multiple testing adjustment to sim envelopes: plotenvelope(iris_mlm, which=1:3, overlay=FALSE, conf.level=1-0.05/4)
Produces a normal QQ plot of data, or of residuals from a fitted model y
, with a user-specified
line to compare to "theoretical" quantiles, and global envelopes constructed
by simulating new residuals. Global envelopes are constructed using the
GET
package for simultaneous control of error rates over the whole curve.
qqenvelope(y, n.sim = 199, conf.level = 0.95, ylab = "Sample Quantiles", ...)
qqenvelope(y, n.sim = 199, conf.level = 0.95, ylab = "Sample Quantiles", ...)
y |
can be a set of values for which we wish to check (multivariate) normality,
or it can be any object that responds to the |
n.sim |
the number of simulated sets of residuals to be generated, to which the observed residuals will be compared. The default is 199 datasets. |
conf.level |
the confidence level to use in constructing the envelope. |
ylab |
|
... |
further arguments sent through to |
A challenge when interpreting a qqplot
is understanding the extent to which
deviations from expected values could be due to random noise (sampling variation)
rather than actual assumption violations. This function is intended to assess this,
by simulating multiple realizations of residuals in situations where assumptions
are satisfied, and plotting a global (or "simultaneous") simulation envelope around these at level conf.level
.
All data points should lie if assumptions are satisfied, and will do so for a proportion conf.level
of
datasets which satisfy their assumptions.
This function can take data (univariate or multivariate) and check for (multivariate) normality, or it can take a fitted model and use qq plots to interrogate residuals and see if they are behaving as we would expect them to if the model were true.
The envelope is global, constructed using the GET-package
. So if any data points lie outside the
envelope we have evidence that assumptions are not satisfied.
The GET-package
was originally constructed to place envelopes around functions, motivated by
the problem of diagnostic testing of spatial processes (Myllymäki et al 2017), but it can equally
well be applied here, by treating sorted residuals as point-wise evaluations of a function.
For further details refer to plotenvelope
, which is called to construct the plot.
a qqplot with simulation envelope is returned, and additionally:
x |
a vector of theoretical quantiles from the standard normal sorted from smallest to largest |
y |
a vector of observed residuals sorted from smallest to largest |
lo |
lower bounds on the global simulation envelope for residuals |
hi |
upper bounds on the global simulation envelope for residuals |
p.value |
A P-value for the test that model assumptions are correct, using a 'parametric bootstrap' test, based on how far sample residuals depart from the values expected of them if model assumptions were satisfied. |
David Warton <[email protected]>
Myllymäki, M., Mrkvička, T., Grabarnik, P., Seijo, H. and Hahn, U. (2017), Global envelope tests for spatial processes. J. R. Stat. Soc. B, 79: 381-404. doi:10.1111/rssb.12172 Warton DI (2022) Eco-Stats - Data Analysis in Ecology, from t-tests to multivariate abundances. Springer, ISBN 978-3-030-88442-0
# simulate some data and fit a qq plot: y=rnorm(20) qqenvelope(y) # fit a multivariate linear model to the iris dataset: data(iris) Y = with(iris, cbind(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)) iris.mlm=lm(Y~Species,data=iris) # check normality assumption: qqenvelope(iris.mlm,n.sim=79)
# simulate some data and fit a qq plot: y=rnorm(20) qqenvelope(y) # fit a multivariate linear model to the iris dataset: data(iris) Y = with(iris, cbind(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)) iris.mlm=lm(Y~Species,data=iris) # check normality assumption: qqenvelope(iris.mlm,n.sim=79)
Data from a study of whether ravens fly towards the sound of gunshots (White 2005). Many thanks to Crow White for providing the raw data. There were twelve locations, at which four treatments were applied (1=gunshot, 2-air horn, 3=whistle, 4=no sound). Ravens within a 100 metre radius of the author were counted ten minutes before and ten minutes after each treatment was applied. Primary interest was in assessing if there was an effect of gunshot on raven numbers. Posthoc analyses suggested that ravens were attracted to gunshots in forested areas only.
data(ravens)
data(ravens)
A dataframe containing:
The number of ravens before the treatment was applied
The number of ravens after the treatment was applied
After-Before
count
Location, one of twelve in Jackson Hole, Wyoming's ungulate hunting zone
1=gunshot, 2=air horn, 3=whistle, 4=no sound
1=forested habitat (>300 trees within 100 metres of observer), 0=open
Crow White (2005) Hunters ring dinner bell for ravens: experimental evidence of a unique foraging strategy. Ecology, 86 1057-60.
data(ravens) ravens1 = ravens[ravens$treatment==1,] t.test(ravens1$Before,ravens1$After,paired=TRUE,alternative="less") boxplot(ravens1$delta,ylab="After-Before counts") abline(h=0,col="red",lty=3)
data(ravens) ravens1 = ravens[ravens$treatment==1,] t.test(ravens1$Before,ravens1$After,paired=TRUE,alternative="less") boxplot(ravens1$delta,ylab="After-Before counts") abline(h=0,col="red",lty=3)
Data from a study looking at the effect of revegetation on invertebrate communities (data from Anthony Pik, Macquarie University). Invertebrates were sampled in 4-5 pitfall traps at eight sites that had undergone revegetation, and two sites that hadn't, and it was of interest to see what the effect of revegetation was on the invertebrate community.
data(reveg)
data(reveg)
A list containing three objects:
A data frame with abundances of 24 Orders of invertebrate.
A vector specifying the number of pitfall traps that were collected at each site.
Whether the site was a 'Control' or a site that had undergone revegetation ('Impact').
data(reveg) worms = reveg$abund$Haplotaxida plot(worms~treatment, data=reveg, horizontal=TRUE, las=1, xlab="",ylab="Worm abundance")
data(reveg) worms = reveg$abund$Haplotaxida plot(worms~treatment, data=reveg, horizontal=TRUE, las=1, xlab="",ylab="Worm abundance")
Data from a study of habitat configuration, specifically, does density of invertebrate epifauna on seaweed vary across sites with different levels of isolation from each other.
data(seaweed)
data(seaweed)
A dataframe containing:
A character vector describing size of experimental plots as "SMALL" or "LARGE"
Distance of isolation – 0, 2 or 10 metres from other algal beds
Sampling time - either 5 or 10 weeks from the start of the experiment.
The replicate number (1 to 5).
Wet mass of the algal bed for that plot.
Total invertebrate density in the plot, calculated as nuber of individuals divided by Wmass
.
Other variables in the dataset give invertebrate counts separately for different taxa.
Roberts, D. A. & Poore, A. G. (2006). Habitat configuration affects colonisation of epifauna in a marine algal bed. Biological Conservation 127, 18-26.
data(seaweed) boxplot(Total~Dist, data=seaweed)
data(seaweed) boxplot(Total~Dist, data=seaweed)
Germination rates of Abutilon angulatum from 29 different studies, undertaken at different ambient temperatures. We would like to know how germination rate varies with temperature, in particular, the range of temperatures at which Abutilon angulatum will germinate. These data come from a larger study across species and environments to look for a latitudinal signal in tolerance to changing temperature (Sentinella et al 2020).
data(seedsTemp)
data(seedsTemp)
A dataframe containing:
The number of seeds sown.
The number of seeds that germinated.
The ambient temperature (in degrees Celsius) of the location at which seeds were sown.
Sentinella, AT, Warton, DI, Sherwin, WB, Offord, CA, Moles, AT. (2020) Tropical plants do not have narrower temperature tolerances, but are more at risk from warming because they are close to their upper thermal limits. Global Ecol Biogeogr. 29, 1387-1398.
data(seedsTemp) seedsTemp$propGerm = seedsTemp$NumGerm / seedsTemp$NumSown plot(propGerm/(1-propGerm)~Test.Temp,data=seedsTemp,log="y", ylab="Germination rate [logit scale]", xlab="Temperature (Celsius)")
data(seedsTemp) seedsTemp$propGerm = seedsTemp$NumGerm / seedsTemp$NumSown plot(propGerm/(1-propGerm)~Test.Temp,data=seedsTemp,log="y", ylab="Germination rate [logit scale]", xlab="Temperature (Celsius)")
Simulates new responses for a manyglm object.
## S3 method for class 'manyglm' simulate(object, nsim = 1, seed = NULL, newdata = object$data, ...)
## S3 method for class 'manyglm' simulate(object, nsim = 1, seed = NULL, newdata = object$data, ...)
object |
a |
nsim |
number of simulated datasets to generate. |
seed |
a seed for random number generation (defaults to NULL) |
newdata |
a new dataset with predictors to simulate new values for. Defaults to data model was fitted to. |
... |
additional optional arguments. |
Returns a data frame containing the response and predictors. This function just calls simulate.cord
from the ecoCopula
package, on a cord
object constructed under default settings – that is,
it fits a copula latent variable model with two latent variables, then uses this to simulate new data.
Simulates a data frame of new values for responses. If multiple datasets are requested, these are
stacked one under the other (see example(simulate.cord)
.
David Warton <[email protected]>
Simulate one or more sets of responses from a Multivariate Linear Model (mlm
) object.
## S3 method for class 'mlm' simulate(object, nsim = 1, seed = NULL, ...)
## S3 method for class 'mlm' simulate(object, nsim = 1, seed = NULL, ...)
object |
a |
nsim |
number of replicate datasets to simulate. If |
seed |
an object specifying if and how the random number generator should be initialized (‘seeded’). Either NULL or an integer that will be used in a call to set.seed before simulating the response vectors. If set, the value is saved as the "seed" attribute of the returned value. The default, NULL will not change the random generator state, and return .Random.seed as the "seed" attribute, see ‘Value’ |
... |
additional optional arguments. |
A simulate
function for mlm
objects, which simulates one or more
sets of responses from a Multivariate Linear Model (mlm
) object.
If multiple sets of responses are requested, they are returned in a 3D array,
with simulation number along the third dimension.
The weights argument is currently ignored – a constant
variance-covariance matrix assumed for mlm
.
A matrix of simulated values for the response (or an array, for nsim
greater than 1)
David Warton <[email protected]>
# fit a mlm to iris data: data(iris) iris.mlm=lm(cbind(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)~Species,data=iris) # simulate new responses: simulate(iris.mlm)
# fit a mlm to iris data: data(iris) iris.mlm=lm(cbind(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)~Species,data=iris) # simulate new responses: simulate(iris.mlm)
Data from a study of how time from snowmelt to flowering relates to snowmelt date and elevation (Wheeler et al 2016) in Salix herbacea, based on about 120 plots on each of three mountains in the Swiss Alps.
data(snowmelt)
data(snowmelt)
A dataframe containing:
Plot patch ID
Number of days from snowmelt day to flowering
Snowmelt day of year
Sex of patch, corrected across years, NA - no flowering, 0 - male, 1 - female
Elevation (from dGPS)
Wheeler, J. A., Cortés, A. J., Sedlacek, J., Karrenberg, S., van Kleunen, M., Wipf, S., Hoch G., Bossdorf, O. & Rixen C. (2016) The snow and the willows: earlier spring snowmelt reduces performance in the low-lying alpine shrub Salix herbacea. Journal of Ecology 104, 1041-50.
data(snowmelt) plot(flow~snow,data=snowmelt, log="y")
data(snowmelt) plot(flow~snow,data=snowmelt, log="y")
Data from a study of the association between water quality and catchment area. Fish composition were used to construct an index of biotic integrity (IBI) and relate it to catchment are in the Seine river basin in France.
data(waterQuality)
data(waterQuality)
A dataframe containing:
The catchment area in square kilometres
Index of Biotic Integrity (IBI)
log base 10 of catchment area
Oberdorff, T. & Hughes, R. M. (1992). Modification of an index of biotic integrity based on fish assemblages to characterize rivers of the Seine Basin, France". Hydrobiologia, 228, 117-130.
data(waterQuality) plot(quality~logCatchment, data=waterQuality)
data(waterQuality) plot(quality~logCatchment, data=waterQuality)
Data from a study of the effect of offshore wind farms on fish communities (Bergström et al. 2013), with measurements before (2003) and after (2010) wind farm installation at 36 different sites in three zones – two affected by wind farms, and two control zones. Abundances have been recorded for 16 different taxa.
data(windFarms)
data(windFarms)
A list containing three objects:
A data frame with descriptors of location and time of sampling. These include:
'Year', a factor giving year of sampling, only 2003 and 2010 measurements are available here;
'Zone', a factor giving zone of sampling, WF
for wind farm, N
for Northern zone,
S
for Southern zone;
'Station', a factor indicating station ID;
'Impact', a factor indicating whether sampling is 'Before' or 'After' wind farm construction.
A data frame containing abundances of 16 different fish taxa.
The total abundance of fish at each site.
Bergström, L., Sundqvist, F., & Bergström, U. (2013). Effects of an offshore wind farm on temporal and spatial patterns in the demersal fish community. Marine Ecology Progress Series 485, 199-210.
data(windFarms) eels =windFarms$abund[,14] plot(eels~interaction(Impact,Zone), data=windFarms$X, horizontal=TRUE, las=1, xlab="",ylab="Eel abundance")
data(windFarms) eels =windFarms$abund[,14] plot(eels~interaction(Impact,Zone), data=windFarms$X, horizontal=TRUE, las=1, xlab="",ylab="Eel abundance")