Principal Coordinate analysis in R and python
Principal Coordinate Analysis (PCoA) is used in microbiome research for summarizing the compositional differences in the microbial community between samples. In the few papers I read, the analysis was done in R. I found one paper which had used python for doing PCoA. Here I tried to compare, how PCoA works on both platforms. I wanted to test if I can keep on working in Python which I am comfortable with or will I need to get comfortable with R for microbiome analysis.
Dataset
A small dataset was generated which basically is a matrix of distances between few cities in India. I chose India because I know the relative position of these cities on map. I made a csv file in excel. Following is how the data looks like.
The general steps in PCoA are as follows:
- Get or calculate the distance matrix. In this case we already have it in csv file format.
- Perform PCoA
- Plot the results
We will do the anlaysis in R first and then in python.
PCoA in R
Read the data
data <- read.csv('datasets/indian_city_distance.csv')
data
The above data is in the from of dataframe. We need to convert it to a matrix.
distance_matrix <- as.dist(data[,c(2:12)])
print(distance_matrix)
Perform PCoA
We use cmdscale() function from stats library.
#import stats library
library(stats)
# fit pcoa
my_pcoa <- cmdscale(distance_matrix, k=2, eig=TRUE, add=TRUE)
#print the new coordinates fro each city
print(my_pcoa)
Plot the result
# plot the pcoa results
plot(my_pcoa$points)
#add labels of city names
text(my_pcoa$points[,1], my_pcoa$points[,2], labels=rownames(my_pcoa$points), cex=1)
The relative position of the cities on this are very similar to what they actually are on India's map.
PCoA in python
Now, I wanted to test the same thing in python. Let's do it.
Read the data
The pandas package in python was used to read the csv file. scikit-bio package was used to perform PCoA. For this, a symmetric distance matrix is required. So, I converted the distance matrix which initially was lower triangle matrix into a symmetric matrix.
import pandas as pd
data = pd.read_csv('datasets\indian_city_distance.csv')
data
The dataframe was then converted to matrix form.
distance_matrix = data.iloc[:,1:].to_numpy() #data.iloc[1::,:]
print(distance_matrix)
Perform PCoA
#%% Perform PCoA
from skbio.stats.ordination import pcoa
pcoa_result = pcoa(distance_matrix, number_of_dimensions=2)
print('\n\n', pcoa_result)
Here we can see that the some of the eigenvalues calculated are negative. We will discuss it later.
The dimension reduced data can be seen by :
print('\n\n', pcoa_result.samples)
Plot the result
import matplotlib.pyplot as plt
df = pcoa_result.samples[['PC1', 'PC2']]
df['city_names'] = data['city_names'].to_numpy()
df = df.set_index('city_names')
print('\n\n', df)
#ax = pcoa_result.samples.plot.scatter(x='PC1', y='PC2')
fig, ax = plt.subplots()
df.plot('PC1', 'PC2', kind='scatter', ax=ax)
for k, v in df.iterrows():
ax.annotate(k, v)
Here also, the relative positions of the cities are similar to what they actually are.
Thoughts
It is apparent that PCoA performed similarly in both python and R for this example. However, there were some differences in the way I ran both programs. The R program ran on on my laptop (Windows). It was very convinient for me. The python program which used scikit-bio package, required linux machine to run. I used the linus server available to me. Writing codes there while debugging is quite tedious. So, for my research purpose, using R in this respect seems logical. This is true when I will have a large microbiome datasets. The microbiome dataset files that I deal with about 10 MB size.
The plot with python looked better, but that could be improved with R as I gain experience. R has good libraries like ggplot2 to render great diagrams.
Also, the python PCoA function gave following warning.
...../python3.10/site-packages/skbio/stats/ordination/_principal_coordinate_analysis.py:143: RuntimeWarning: The result contains negative eigenvalues. Please compare their magnitude with the magnitude of some of the largest positive eigenvalues. If the negative ones are smaller, it's probably safe to ignore them, but if they are large in magnitude, the results won't be useful. See the Notes section for more details. The smallest eigenvalue is -699819.7351834401 and the largest is 11583666.045464104.
warn(
Some of the eigen values calculated were negative. Currently there is no method implemented in python to avoid this. The scikit-bio site notes as follows:
If the distance is not euclidean (for example if it is a semimetric and the triangle inequality doesn’t hold), negative eigenvalues can appear. There are different ways to deal with that problem (see Legendre & Legendre 1998, S 9.2.3), but none are currently implemented here. However, a warning is raised whenever negative eigenvalues appear, allowing the user to decide if they can be safely ignored.
While in R, this situation was avoided by adding the parameter add=TRUE to the cmdscale function.
This means that the PCoA method implemented in scikit-bio is still to be improved. Hence, using R for PCoA seems logical for my work where I would have microbiome data from thousands of taxa.