contact@statdoe.com

Tutorials

One-Way ANOVA Step-by-Step – R tutorial

ANOVA, Tukey’s test, sumarised table, compact letter display

Table of Contents:

In this tutorial we are going to analyse the results of a one-factor experiment using analysis of variance (ANOVA).

We are going to analyse a data set the measure the release of radon in showers with different aperture diameters, published in the Environment International Journal.1

Loading the appropriate libraries

We are going to start by loading the appropriate libraries, the readr to load the data from a csv file, the ggplot2 for the plots, the dplyr for building a table with the summarized data (mean and standard deviation) and the multcompView to use the compact letter display to indicate significant differences in Tukey’s test.

# loading the appropriate libraries
library(readr)
library(ggplot2)
library(multcompView)
library(dplyr)


Loading and checking the data

The first step of the analysis is to load the data file. We will use the head function to see the first rows and the str function for the file structure.

# loading and checking the data

radon <- data.frame(D = c(0.37,0.37,0.37,0.37,0.51,0.51,0.51,0.51,0.71,0.71,0.71,0.71,1.02,1.02,1.02,1.02,1.4,1.4,1.4,1.4,1.99,1.99,1.99,1.99),
                    RR = c(80,83,83,85,75,75,79,79,74,73,76,77,67,72,74,74,62,62,67,69,60,61,64,66))
str(radon)
## 'data.frame':    24 obs. of  2 variables:
##  $ D : num  0.37 0.37 0.37 0.37 0.51 0.51 0.51 0.51 0.71 0.71 ...
##  $ RR: num  80 83 83 85 75 75 79 79 74 73 ...

The data file presents one column for the factor being studied, \(D\), the diameter, and a second column for the response variable, the percentage of radon released (\(RR\)). The file structure shows that we have 24 observations of 2 variables, and both variables are numeric.


Creating a simple plot for data visualisation

Before doing the analysis of variance, let’s plot the data to have an idea of its behavior. We are going to use the plot function, where the first argument is the \(x\)-variable, the diameter \(D\), and the second argument is the \(y\)-variable, the radon released, \(RR\). The dollar sign indicates that the variables \(D\) and \(RR\) are from the data file \(radon\).

# building a simple plot for data visualisation
plot(radon$D, radon$RR)

The plot shows a clear decrease in the radon released with the diameter. We will probably find a significant effect on the analysis of variance.


Analysis of variance for one factor – One-Way ANOVA

The next step is to perform the analysis of variance, mostly known as ANOVA, using the aov function. The arguments are the response variable \(RR\) as a function of the explanatory variable \(D\).

However, to perform the analysis of variance we need the independent variable, in this case the shower diameter, to be defined as a factor. We are going to do it using the function as.factor. We also need to point to the data file.

The result will be stored in an object called “anova.rr”, and to visualize it, we need to run the function summary.

# analysis of variance
anova.rr <- aov(RR ~ as.factor(D), data = radon)
summary(anova.rr)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(D)  5 1133.4  226.68   30.85 3.16e-08 ***
## Residuals    18  132.2    7.35                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result shows the ANOVA table with the degrees of freedom, the sum of squares, mean squares, F-value, and p-value. We consider that a factor is significant if the p-value is lower than the level of significance \(\alpha\), usually 0.05. The p-value of 3.16e-08 indicates that the shower diameter has a strong effect on the radon released. This result is expected considering the previous visualisation of the data performed in the previous section.


Creating a table with factors, means and standard deviation

To create a table with the resulting mean and standard deviation for each condition, we are going to use the dplyr library, using the functions group_by and summarise.

We are going to group the results in the radon data file by the variable D, the diameter.

In the new data set, that I have called “radon_summary”, we are going to create columns for the mean, and the standard deviation, sd, for each treatment. The response variable is the radon released, RR.

When creating de table, we are going to arrange the means in a descendent order, which will be necessary for adding the superscript letters from Tukey’s test.

# table with factors, means and standard deviation
radon_summary <- group_by(radon, D) %>%
  summarise(mean=mean(RR), sd=sd(RR)) %>%
  arrange(desc(mean))
print(radon_summary)
## # A tibble: 6 × 3
##       D  mean    sd
##   <dbl> <dbl> <dbl>
## 1  0.37  82.8  2.06
## 2  0.51  77    2.31
## 3  0.71  75    1.83
## 4  1.02  71.8  3.30
## 5  1.4   65    3.56
## 6  1.99  62.8  2.75

The summarised table presents the average and the standard deviation of the results for the radon released for each diameter.


Comparing means by Tukey’s test

The means comparison by Tukey’s test can be run on the object resulting from the analysis of variance (anova.rr).

The result (below) is an extensive table with all pairwise comparisons and the p-value for each one of them.

This data can be tricky to interpret and it is usual to use letters to indicate significant differences among the means.

# Tukey's test
tukey.rr <- TukeyHSD(anova.rr)
print(tukey.rr)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = RR ~ as.factor(D), data = radon)
## 
## $`as.factor(D)`
##             diff        lwr         upr     p adj
## 0.51-0.37  -5.75 -11.841234   0.3412336 0.0707511
## 0.71-0.37  -7.75 -13.841234  -1.6587664 0.0084181
## 1.02-0.37 -11.00 -17.091234  -4.9087664 0.0002404
## 1.4-0.37  -17.75 -23.841234 -11.6587664 0.0000004
## 1.99-0.37 -20.00 -26.091234 -13.9087664 0.0000001
## 0.71-0.51  -2.00  -8.091234   4.0912336 0.8968057
## 1.02-0.51  -5.25 -11.341234   0.8412336 0.1153360
## 1.4-0.51  -12.00 -18.091234  -5.9087664 0.0000841
## 1.99-0.51 -14.25 -20.341234  -8.1587664 0.0000089
## 1.02-0.71  -3.25  -9.341234   2.8412336 0.5513482
## 1.4-0.71  -10.00 -16.091234  -3.9087664 0.0007059
## 1.99-0.71 -12.25 -18.341234  -6.1587664 0.0000650
## 1.4-1.02   -6.75 -12.841234  -0.6587664 0.0249971
## 1.99-1.02  -9.00 -15.091234  -2.9087664 0.0021152
## 1.99-1.4   -2.25  -8.341234   3.8412336 0.8432736


Superscript letters to indicate significant differences

The use of letters to indicate significant differences in pairwise comparisons is called compact letter display, and can simplify the visualisation and discussion of significant differences among means. We are going to use the multcompLetters4 function from the multcompView package. The arguments are the object from an aov function and the object from the TukeyHSD function.

# creating the compact letter display
cld.rr <- multcompLetters4(anova.rr, tukey.rr)
print(cld.rr)
## $`as.factor(D)`
## 0.37 0.51 0.71 1.02  1.4 1.99 
##  "a" "ab"  "b"  "b"  "c"  "c"

The code output shows the compact letter display for the radon released with the shower diameter. This last one is the one we are going to use and the next code extract and adds the results of the compact letter display for each treatment to the table with the data summary.

# adding the compact letter display to the table with means and sd
cld <- as.data.frame.list(cld.rr$`as.factor(D)`)
radon_summary$Tukey <- cld$Letters
print(radon_summary)
## # A tibble: 6 × 4
##       D  mean    sd Tukey
##   <dbl> <dbl> <dbl> <chr>
## 1  0.37  82.8  2.06 a    
## 2  0.51  77    2.31 ab   
## 3  0.71  75    1.83 b    
## 4  1.02  71.8  3.30 b    
## 5  1.4   65    3.56 c    
## 6  1.99  62.8  2.75 c

The table can be exported as an csv file using the write.csv function.

write.csv("data_summary.csv")

If the idea is to present the final data as a table (as shown below), the analysis can end here. However, it is always interesting to present the results in a visual form. So the next sections show codes to build bar and scatter plots suitable for the presentation of this particular result.

Diameter (mm)
Radon Released (%)
Influence of shower head orifice size on the release of radon during showering.
0.37 82.8 ± 2.1 a
0.51 77.0 ± 2.3 ab
0.71 75.0 ± 1.8 b
1.02 71.8 ± 3.3 b
1.40 65.0 ± 3.6 c
1.99 62.8 ± 2.8 c


Plots for the final presentation of the results.

These section shows examples of bar and scatter plots suitable for presentations, posters and written reports. The plots show the means, the standard deviation and the compact letter display for each treatment. When using scatterplots, as the explanatory variable, the diameter, is a continuous numeric variable, it is possible do add a trendline showing the data behavior.

They plots are presented in gray-scale and colored according the results of Tukey’s test.

Barplots

A step-by-step tutorial on how to build these plots can be found on building barplots for one factor in R.



Scatterplots

A step-by-step tutorial on how to build these plots can be found on building scatterplots for one factor in R.




  1. Data source: Environment International Volume 18, Issue 4, 1992, Pages 363-369. https://doi.org/10.1016/0160-4120(92)90067-E↩︎

Leave a Reply