One-Way ANOVA Step-by-Step – R tutorial
ANOVA, Tukey’s test, sumarised table, compact letter display
Table of Contents:
- Loading the appropriate libraries
- Loading and checking the data
- Creating a simple plot for data visualisation
- Analysis of variance for one factor – One-Way ANOVA
- Creating a table with factors, means and standard deviation
- Comparing means by Tukey’s test
- Superscript letters to indicate significant differences
- Plots for the final presentation of the results.
- Related video tutorials
In this tutorial we are going to analyse the results of a one-factor experiment using analysis of variance (ANOVA).
We are going to analyse a data set the measure the release of radon in showers with different aperture diameters, published in the Environment International Journal.1
Loading the appropriate libraries
We are going to start by loading the appropriate libraries, the
readr
to load the data from a csv file, the
ggplot2
for the plots, the dplyr
for building
a table with the summarized data (mean and standard deviation) and the
multcompView
to use the compact letter display to indicate
significant differences in Tukey’s test.
# loading the appropriate libraries
library(readr)
library(ggplot2)
library(multcompView)
library(dplyr)
Loading and checking the data
The first step of the analysis is to load the data file. We will use
the head
function to see the first rows and the
str
function for the file structure.
# loading and checking the data
<- data.frame(D = c(0.37,0.37,0.37,0.37,0.51,0.51,0.51,0.51,0.71,0.71,0.71,0.71,1.02,1.02,1.02,1.02,1.4,1.4,1.4,1.4,1.99,1.99,1.99,1.99),
radon RR = c(80,83,83,85,75,75,79,79,74,73,76,77,67,72,74,74,62,62,67,69,60,61,64,66))
str(radon)
## 'data.frame': 24 obs. of 2 variables:
## $ D : num 0.37 0.37 0.37 0.37 0.51 0.51 0.51 0.51 0.71 0.71 ...
## $ RR: num 80 83 83 85 75 75 79 79 74 73 ...
The data file presents one column for the factor being studied, \(D\), the diameter, and a second column for the response variable, the percentage of radon released (\(RR\)). The file structure shows that we have 24 observations of 2 variables, and both variables are numeric.
Creating a simple plot for data visualisation
Before doing the analysis of variance, let’s plot the data to have an
idea of its behavior. We are going to use the plot
function, where the first argument is the \(x\)-variable, the diameter \(D\), and the second argument is the \(y\)-variable, the radon released, \(RR\). The dollar sign indicates that the
variables \(D\) and \(RR\) are from the data file \(radon\).
# building a simple plot for data visualisation
plot(radon$D, radon$RR)
The plot shows a clear decrease in the radon released with the diameter. We will probably find a significant effect on the analysis of variance.
Analysis of variance for one factor – One-Way ANOVA
The next step is to perform the analysis of variance, mostly known as
ANOVA, using the aov
function. The arguments are the
response variable \(RR\) as a function
of the explanatory variable \(D\).
However, to perform the analysis of variance we need the independent
variable, in this case the shower diameter, to be defined as a factor.
We are going to do it using the function as.factor
. We also
need to point to the data file.
The result will be stored in an object called “anova.rr”, and to visualize it, we need to run the function summary.
# analysis of variance
<- aov(RR ~ as.factor(D), data = radon)
anova.rr summary(anova.rr)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(D) 5 1133.4 226.68 30.85 3.16e-08 ***
## Residuals 18 132.2 7.35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result shows the ANOVA table with the degrees of freedom, the sum of squares, mean squares, F-value, and p-value. We consider that a factor is significant if the p-value is lower than the level of significance \(\alpha\), usually 0.05. The p-value of 3.16e-08 indicates that the shower diameter has a strong effect on the radon released. This result is expected considering the previous visualisation of the data performed in the previous section.
Creating a table with factors, means and standard deviation
To create a table with the resulting mean and standard deviation for
each condition, we are going to use the dplyr
library,
using the functions group_by
and
summarise
.
We are going to group the results in the radon data file by the variable D, the diameter.
In the new data set, that I have called “radon_summary”, we are going to create columns for the mean, and the standard deviation, sd, for each treatment. The response variable is the radon released, RR.
When creating de table, we are going to arrange the means in a descendent order, which will be necessary for adding the superscript letters from Tukey’s test.
# table with factors, means and standard deviation
<- group_by(radon, D) %>%
radon_summary summarise(mean=mean(RR), sd=sd(RR)) %>%
arrange(desc(mean))
print(radon_summary)
## # A tibble: 6 × 3
## D mean sd
## <dbl> <dbl> <dbl>
## 1 0.37 82.8 2.06
## 2 0.51 77 2.31
## 3 0.71 75 1.83
## 4 1.02 71.8 3.30
## 5 1.4 65 3.56
## 6 1.99 62.8 2.75
The summarised table presents the average and the standard deviation of the results for the radon released for each diameter.
Comparing means by Tukey’s test
The means comparison by Tukey’s test can be run on the object resulting from the analysis of variance (anova.rr).
The result (below) is an extensive table with all pairwise comparisons and the p-value for each one of them.
This data can be tricky to interpret and it is usual to use letters to indicate significant differences among the means.
# Tukey's test
<- TukeyHSD(anova.rr)
tukey.rr print(tukey.rr)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = RR ~ as.factor(D), data = radon)
##
## $`as.factor(D)`
## diff lwr upr p adj
## 0.51-0.37 -5.75 -11.841234 0.3412336 0.0707511
## 0.71-0.37 -7.75 -13.841234 -1.6587664 0.0084181
## 1.02-0.37 -11.00 -17.091234 -4.9087664 0.0002404
## 1.4-0.37 -17.75 -23.841234 -11.6587664 0.0000004
## 1.99-0.37 -20.00 -26.091234 -13.9087664 0.0000001
## 0.71-0.51 -2.00 -8.091234 4.0912336 0.8968057
## 1.02-0.51 -5.25 -11.341234 0.8412336 0.1153360
## 1.4-0.51 -12.00 -18.091234 -5.9087664 0.0000841
## 1.99-0.51 -14.25 -20.341234 -8.1587664 0.0000089
## 1.02-0.71 -3.25 -9.341234 2.8412336 0.5513482
## 1.4-0.71 -10.00 -16.091234 -3.9087664 0.0007059
## 1.99-0.71 -12.25 -18.341234 -6.1587664 0.0000650
## 1.4-1.02 -6.75 -12.841234 -0.6587664 0.0249971
## 1.99-1.02 -9.00 -15.091234 -2.9087664 0.0021152
## 1.99-1.4 -2.25 -8.341234 3.8412336 0.8432736
Superscript letters to indicate significant differences
The use of letters to indicate significant differences in pairwise
comparisons is called compact letter display, and can simplify the
visualisation and discussion of significant differences among means. We
are going to use the multcompLetters4
function from the
multcompView
package. The arguments are the object from an
aov
function and the object from the TukeyHSD
function.
# creating the compact letter display
<- multcompLetters4(anova.rr, tukey.rr)
cld.rr print(cld.rr)
## $`as.factor(D)`
## 0.37 0.51 0.71 1.02 1.4 1.99
## "a" "ab" "b" "b" "c" "c"
The code output shows the compact letter display for the radon released with the shower diameter. This last one is the one we are going to use and the next code extract and adds the results of the compact letter display for each treatment to the table with the data summary.
# adding the compact letter display to the table with means and sd
<- as.data.frame.list(cld.rr$`as.factor(D)`)
cld $Tukey <- cld$Letters
radon_summaryprint(radon_summary)
## # A tibble: 6 × 4
## D mean sd Tukey
## <dbl> <dbl> <dbl> <chr>
## 1 0.37 82.8 2.06 a
## 2 0.51 77 2.31 ab
## 3 0.71 75 1.83 b
## 4 1.02 71.8 3.30 b
## 5 1.4 65 3.56 c
## 6 1.99 62.8 2.75 c
The table can be exported as an csv file using the write.csv function.
write.csv("data_summary.csv")
If the idea is to present the final data as a table (as shown below), the analysis can end here. However, it is always interesting to present the results in a visual form. So the next sections show codes to build bar and scatter plots suitable for the presentation of this particular result.
Diameter (mm)
|
Radon Released (%)
|
|||
---|---|---|---|---|
0.37 | 82.8 | ± | 2.1 | a |
0.51 | 77.0 | ± | 2.3 | ab |
0.71 | 75.0 | ± | 1.8 | b |
1.02 | 71.8 | ± | 3.3 | b |
1.40 | 65.0 | ± | 3.6 | c |
1.99 | 62.8 | ± | 2.8 | c |
Plots for the final presentation of the results.
These section shows examples of bar and scatter plots suitable for presentations, posters and written reports. The plots show the means, the standard deviation and the compact letter display for each treatment. When using scatterplots, as the explanatory variable, the diameter, is a continuous numeric variable, it is possible do add a trendline showing the data behavior.
They plots are presented in gray-scale and colored according the results of Tukey’s test.
Barplots
A step-by-step tutorial on how to build these plots can be found on building barplots for one factor in R.
Scatterplots
A step-by-step tutorial on how to build these plots can be found on building scatterplots for one factor in R.
Data source: Environment International Volume 18, Issue 4, 1992, Pages 363-369. https://doi.org/10.1016/0160-4120(92)90067-E↩︎