Principal Coordinate Analysis (PCoA) in R
Here we will see how we can perform a principal coordinate analysis (PCoA) in R.
I have used a microbiome data from a gut microbiome study. This is just to demonstrate the workflow of how to perform the PCoA. This is not an attempt to do any meaningful scientific analysis as it requires sufficient expertise in the field of microbiome research.
Following are the steps to perform PCoA:
- Import data
- Preprocess
- Calculate distance matrix
vegdist
function fromvegan
library is used.
- Perform PCoA on the distance matrix
- the
cmdscale
function from base R is used.
- the
Import data files
library(tidyverse)
library(vegan)
library(stats)
Read data
I have preprocessed the data so that it can be directly used to calculate distance matrix. The contains sample ids and the abundances of different taxa in those samples.
data <- read.csv('micr_data.csv') %>%
as.data.frame()
We will view the data as tibble to see what it contains.
tibble(data)
## # A tibble: 185 × 295
## X Acetanaerobacterium Acetatifactor Acetivibrio Acidaminococcus
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 S00001 0 0.000006 0.00121 0
## 2 S10021 0.000002 0.000109 0.000081 0.000087
## 3 S10030 0.00001 0.000009 0.000062 0
## 4 S10059 0.000118 0.000068 0.00053 0.000031
## 5 S10067 0.000131 0.000043 0.000378 0.00785
## 6 S10073 0.000203 0.000042 0.0006 0.000068
## 7 S10074 0.000452 0.000022 0.000677 0.00076
## 8 S10091 0.000514 0.000037 0.000886 0
## 9 S10097 0.000043 0.000056 0.000567 0.00119
## 10 S10099 0.00112 0 0.000223 0.000034
## # ℹ 175 more rows
## # ℹ 290 more variables: Actinobacillus <dbl>, Actinomyces <dbl>,
## # Adlercreutzia <dbl>, Aerococcus <dbl>, Aggregatibacter <dbl>,
## # Alistipes <dbl>, Alkaliphilus <dbl>, Alkalitalea <dbl>, Allisonella <dbl>,
## # Alloprevotella <dbl>, Anaerobacterium <dbl>, Anaerofilum <dbl>,
## # Anaerophaga <dbl>, Anaerosinus <dbl>, Anaerosporobacter <dbl>,
## # Anaerostipes <dbl>, Anaerotruncus <dbl>, Anaerovorax <dbl>, …
We can see that the first column X has all sample names. The other columns represent microbial taxa and their values in the sample. The values represent proportion of a taxa in one sample. Therefore the values are less than one. Also, preprocessing of the data has deleted several sample as well as taxa from the original data.
We will set the sample names as the row names of the data frame and
remove the X
column.
# set row names as sample names
rownames(data) = data$X
# remove column X
data = data[,-1]
We do not know what these samples represent. The information about this is found in another file called ‘sample_data.csv’.
Read sample info
The other table we have is the sample ids and the categories that these samples belong to.
sample_info <- tibble(read.csv('sample_data.csv'))
sample_info
## # A tibble: 616 × 2
## sample Group
## <chr> <chr>
## 1 S00001 Stage_III_IV
## 2 S10021 Stage_I_II
## 3 S10023 Healthy
## 4 S10025 Healthy
## 5 S10029 Healthy
## 6 S10030 Stage_I_II
## 7 S10031 Healthy
## 8 S10033 Stage_0
## 9 S10034 HS
## 10 S10036 Healthy
## # ℹ 606 more rows
Calculate distance matrix
The data
is of type data.frame
.
class(data)
## [1] "data.frame"
We will convert this into a matrix.
data_mat = as.matrix(data)
class(data_mat)
## [1] "matrix" "array"
From this matrix will calculate the distance matrix. To make the results reproducible we will set a seed number first.
Then to calculate the distance matrix, we will use the
vegdist
function from vegan
library.
# set seed
set.seed(500)
# calculate distance matrix
dist <- vegdist(data_mat, method='jaccard')
Perform PCoA
We will use the cmdscale
function from base R here to do
the PCoA. The argument k
below makes the data to be
represented in two dimensions.
# fit pcoa
pcoa_result <- cmdscale(dist, k=2, eig=TRUE, add=TRUE)
The PCoA result has a few components. To se what these are, use the
attributes
function.
attributes(pcoa_result)
## $names
## [1] "points" "eig" "x" "ac" "GOF"
The points
contains the PCoA result data as two columns
as we had asked it to give.
The eig
contains the eigenvalues of all the dimensions
of the PCoA result. As we have got back only two dimensions of the total
result, let’s see how much of the variation in data these two coordinate
axes.
pcoa_result$eig[1:2]
## [1] 12.874834 5.373867
These two axis captures about 29% of the variation in the data. We can get this number as follows.
sum(pcoa_result$eig[1:2]) / sum(pcoa_result$eig)*100
## [1] 28.79456
We can now plot this result as a scatter plot and colour according to
sample category from sample_info
.
The points
is a matrix now. We will add column names to
the matrix.
# get x, y points
positions <- pcoa_result$points
# name the columns
colnames(positions) = c('pcoa1', 'pcoa2')
And convert it into a tibble and make the sample ids as the row names.
# conert the matrix into tibble
positions_tibble <- positions %>%
as_tibble(rownames = 'sample')
The sample names in the plot data does not contain the category of
the samples yet. We have read it into the sample_info
variable. Remember, that all samples are not present in the
position_tibble as the preprocessing steps have deleted several of them.
We want to include the category information about only those samples
that are in our final results.
The following code will select those samples from
sample_info
that are present in
positions_tibble
.
filt_sample_info <- sample_info %>%
semi_join(positions_tibble, by='sample')
Now we can add the sample information to the plot data
(positions_tibble
) and plot it simultaneously.
positions_tibble %>%
mutate(labels = filt_sample_info$Group) %>%
ggplot(aes(x=pcoa1, y=pcoa2, color = labels)) +
geom_point()