R and stats resources

A quick google or Youtube search will reveal an abundance of tutorials and materials designed to help you learn how to program and conduct statistical analyses with R. Below, we list some free resources that have helped us in the past, but are by no means exhaustive.

Books

Modelling and stats

If you haven’t taken the ANM203 course at UniSC (statistics with teeth: understanding ecological data), contact Dave for course notes. This is a 2nd year undergraduate course offered in the Bachelor of Animal Ecology program, and teaches students how to use R in a quantitative ecology context. Lessons include: introductory programming, basics of data visualisation with ggplot2, general linear models, generalised linear models, and conditional inference trees.

Similarly, ask Dave or Kylie for notes and recordings from the now-deprecated third year course ENS315 (advanced numerical techniques in ecology), which was an advanced unit that covered: generalised additive models, generalised linear mixed models, and multivariate analyses (i.e., PCA, NMDS).

Dave co-runs a series of R Workshops each year at the University of Queensland, in collaboration with Anthony Richardson, Jason Everett, and Christina Buelow. These workshops cover a range of skill levels, techniques and modelling approaches, and students from our lab get free registration and access to all course notes. Contact Dave for upcoming dates/further information.

Dr. Chris Brown (UTAS) often posts free R and stats tutorials on the Seascape Models website.

Dr. Alain Zuur and Dr. Elena Ieno publish an excellent collection of textbooks on statistics with R, that are specifically designed for ecologists. Topics include GLMs/GAMs, mixed effects models, R-INLA, data visualisation, zero-inflated modelling etc. They also run online and in-person courses. Kylie, Dave and Jessie have some personal hard-copies you might be able to borrow - otherwise, check out the Highland Statistics website here.

GLMs and GAMs

Mixed effects models

Note

Mixed-effects models, hierarchical models, nested models, multilevel models, clustered models…. all are the same thing. Different academic communities use different terminology - here, we refer to them as mixed-effects models.

Error distributions

Below are some of the more common distributions used in GLMs and GLMMs that we suggest you familiarise yourself with:

  • Gaussian (AKA ‘normal’). Continuous data. Has two parameters: mean and variance. Assumption: data are normally distributed around the mean (after accounting for predictors). E.g., modelling the heights of university students in a classroom
  • Gamma. Continuous data. Unlike the gaussian distribution, gamma can only take positive values and is distinctly skewed to the right (i.e., negative values are not allowed). E.g., modelling rainfall amounts on wet days.
  • Tweedie. Continuous or count data. Positive values with zeros. Good when there’s a spike at zero/right skewness. Can also be used if poisson results in overdispersion. E.g., used commonly to model fisheries biomass.
  • Poisson. Count data. Positive integers only. Assumption: variance increases with the mean. Unlike a gaussian distribution which has two parameters (mean and variance), the poisson distribution only has one (the mean, which is ALSO the variance: µ). E.g., the number of times a bird feeds a chick in a day.
  • Negative binomial. Count data. Positive integers only. Typically used for overdispersed count data with a large range, that can’t be accurately modelled with the poisson distribution. Includes a quadratic term for the variance (i.e., k) to allow for more variation in the response, compared to a poisson. E.g., number of endoparasites observed in the muscle tissue of a fish.
  • Zero-truncated poisson/negbin. Count data. Positive integers only, AND where the value of 0 cannot occur. E.g., the number of times a whale surfaces to breathe (this can’t be 0, or the whale will die).
  • Binomial. Binary data. 0 or 1. Presence/absence, survival/death, yes/no etc. E.g., presence of crabs on a beach.
  • Beta. Data bounded between 0 and 1, such as percentage, proportion or probability. E.g., the proportion or percentage of healthy and diseased tissue on a coral head.

Regular expressions

Regular Expressions (i.e., “Regexps”) is a very powerful tool for working with string patterns using advanced wildcard options. A Regular Expression is a pattern (or filter) that describes a set of strings that matches the pattern. Regex is supported in R, and we can use it, for example, to extract information from file names (amongst other things).

If our filename is do_u_1981_RG.nc, we can extract just the year by running gsub("[^0-9]+", "", "do_u_1981_RG.nc”). This removes all characters from the string, EXCEPT the numbers.

Here are some other regular expressions we can use with the base R function gsub():

test <- "hello123!@#”
- gsub("\\d", "", test) - return everything EXCEPT digits
- gsub("\\D", "", test) - return JUST digits
- gsub("\\l", "", test) - return everything EXCEPT all lower-case l
- gsub("[A-Za-z1-9]", "", test) - return special characters ONLY
- ls(pattern = "hello.*!@#) - list all objects in the global environment that are called hello, followed by any number, followed by the three special characters (i.e., !@#).

A common way of using regexps in R is via the stringr package. See here for a tutorial and overview of key functions.