Principal Coordinate Analysis (PCoA) in R

Here we will see how we can perform a principal coordinate analysis (PCoA) in R.

I have used a microbiome data from a gut microbiome study. This is just to demonstrate the workflow of how to perform the PCoA. This is not an attempt to do any meaningful scientific analysis as it requires sufficient expertise in the field of microbiome research.










code_pcoa.knit

Following are the steps to perform PCoA:

  • Import data
  • Preprocess
  • Calculate distance matrix
    • vegdistfunction from vegan library is used.
  • Perform PCoA on the distance matrix
    • the cmdscale function from base R is used.

Import data files

library(tidyverse)
library(vegan)
library(stats)

Read data

I have preprocessed the data so that it can be directly used to calculate distance matrix. The contains sample ids and the abundances of different taxa in those samples.

data <- read.csv('micr_data.csv') %>%
  as.data.frame()

We will view the data as tibble to see what it contains.

tibble(data)
## # A tibble: 185 × 295
##    X      Acetanaerobacterium Acetatifactor Acetivibrio Acidaminococcus
##    <chr>                <dbl>         <dbl>       <dbl>           <dbl>
##  1 S00001            0             0.000006    0.00121         0       
##  2 S10021            0.000002      0.000109    0.000081        0.000087
##  3 S10030            0.00001       0.000009    0.000062        0       
##  4 S10059            0.000118      0.000068    0.00053         0.000031
##  5 S10067            0.000131      0.000043    0.000378        0.00785 
##  6 S10073            0.000203      0.000042    0.0006          0.000068
##  7 S10074            0.000452      0.000022    0.000677        0.00076 
##  8 S10091            0.000514      0.000037    0.000886        0       
##  9 S10097            0.000043      0.000056    0.000567        0.00119 
## 10 S10099            0.00112       0           0.000223        0.000034
## # ℹ 175 more rows
## # ℹ 290 more variables: Actinobacillus <dbl>, Actinomyces <dbl>,
## #   Adlercreutzia <dbl>, Aerococcus <dbl>, Aggregatibacter <dbl>,
## #   Alistipes <dbl>, Alkaliphilus <dbl>, Alkalitalea <dbl>, Allisonella <dbl>,
## #   Alloprevotella <dbl>, Anaerobacterium <dbl>, Anaerofilum <dbl>,
## #   Anaerophaga <dbl>, Anaerosinus <dbl>, Anaerosporobacter <dbl>,
## #   Anaerostipes <dbl>, Anaerotruncus <dbl>, Anaerovorax <dbl>, …

We can see that the first column X has all sample names. The other columns represent microbial taxa and their values in the sample. The values represent proportion of a taxa in one sample. Therefore the values are less than one. Also, preprocessing of the data has deleted several sample as well as taxa from the original data.

We will set the sample names as the row names of the data frame and remove the X column.

# set row names as sample names
rownames(data) = data$X

# remove column X
data = data[,-1]

We do not know what these samples represent. The information about this is found in another file called ‘sample_data.csv’.

Read sample info

The other table we have is the sample ids and the categories that these samples belong to.

sample_info <- tibble(read.csv('sample_data.csv'))
sample_info
## # A tibble: 616 × 2
##    sample Group       
##    <chr>  <chr>       
##  1 S00001 Stage_III_IV
##  2 S10021 Stage_I_II  
##  3 S10023 Healthy     
##  4 S10025 Healthy     
##  5 S10029 Healthy     
##  6 S10030 Stage_I_II  
##  7 S10031 Healthy     
##  8 S10033 Stage_0     
##  9 S10034 HS          
## 10 S10036 Healthy     
## # ℹ 606 more rows

Calculate distance matrix

The data is of type data.frame.

class(data)
## [1] "data.frame"

We will convert this into a matrix.

data_mat = as.matrix(data)
class(data_mat)
## [1] "matrix" "array"

From this matrix will calculate the distance matrix. To make the results reproducible we will set a seed number first.

Then to calculate the distance matrix, we will use the vegdist function from vegan library.

# set seed
set.seed(500)

# calculate distance matrix
dist <- vegdist(data_mat, method='jaccard')

Perform PCoA

We will use the cmdscale function from base R here to do the PCoA. The argument k below makes the data to be represented in two dimensions.

# fit pcoa
pcoa_result <- cmdscale(dist, k=2, eig=TRUE, add=TRUE)

The PCoA result has a few components. To se what these are, use the attributes function.

attributes(pcoa_result)
## $names
## [1] "points" "eig"    "x"      "ac"     "GOF"

The points contains the PCoA result data as two columns as we had asked it to give.

The eig contains the eigenvalues of all the dimensions of the PCoA result. As we have got back only two dimensions of the total result, let’s see how much of the variation in data these two coordinate axes.

pcoa_result$eig[1:2]
## [1] 12.874834  5.373867

These two axis captures about 29% of the variation in the data. We can get this number as follows.

sum(pcoa_result$eig[1:2]) / sum(pcoa_result$eig)*100
## [1] 28.79456

We can now plot this result as a scatter plot and colour according to sample category from sample_info.

The points is a matrix now. We will add column names to the matrix.

# get x, y points
positions <- pcoa_result$points

# name the columns
colnames(positions) = c('pcoa1', 'pcoa2')

And convert it into a tibble and make the sample ids as the row names.

# conert the matrix into tibble
positions_tibble <- positions %>%
  as_tibble(rownames = 'sample')

The sample names in the plot data does not contain the category of the samples yet. We have read it into the sample_info variable. Remember, that all samples are not present in the position_tibble as the preprocessing steps have deleted several of them. We want to include the category information about only those samples that are in our final results.

The following code will select those samples from sample_info that are present in positions_tibble.

filt_sample_info <- sample_info %>%
  semi_join(positions_tibble, by='sample')

Now we can add the sample information to the plot data (positions_tibble) and plot it simultaneously.

positions_tibble %>%
  mutate(labels = filt_sample_info$Group) %>%
  ggplot(aes(x=pcoa1, y=pcoa2, color = labels)) +
  geom_point()

Popular posts from this blog

Principal Coordinate analysis in R and python