Posts

Showing posts with the label data analysis

Making boxplot with custom statistical values for the boxes

Image
We will see here how to draw boxplots, using Matplotlib, when we have a set of values that represent the box statistics, such as median, mean, minimum, maximum, first quartile, and third quartile. In some case we may also have confidence interval of the median values. We will use the .bxp attribute of the plot axes in pyplot.subplots . We begin first by importing the necessary packages to draw the boxplots, namely, NumPy and Matplotlib . import numpy as np import matplotlib.pyplot as plt The data Let’s take an example of a data. Show in the table below is the data which can be read from a excel or csv file into a pandas dataframe ( df ). df label whislo q1 med mean q3 whishi cilo cihi 0 A 8 26 56 49 96 116 54 62 1 B 7 21 53 45 96 122 51 59 2 C 10 22 57 54 101 120 55 63 3 D 9 26 54 50 98 116 52 60 Convert data into list of dictionaries For making the plot...

Principal Coordinate Analysis (PCoA) in R

Image
Here we will see how we can perform a principal coordinate analysis (PCoA) in R. I have used a microbiome data from a gut microbiome study. This is just to demonstrate the workflow of how to perform the PCoA. This is not an attempt to do any meaningful scientific analysis as it requires sufficient expertise in the field of microbiome research.

Fill missing values using SimpleImputer

Data often would contain missing values. Sometime it makes sense to fill the missing values with some appropriate value. For example we may want to fill the missing value with, say, mean of the available values. We can fill such missing values by calculating the mean of the column and using the fillna() function. However, if several columns have missing values then we might have to repeat this process several times or write a loop. Scikit-learn offers functionality called as SimpleImputer to easily fill the missing values . 

Divide numerical data into categories

Sometimes we need to categories numerical values into different categories. For example, the population of town might be needed to be categorized into different income groups. Or, the marks of students might be needed to be categorized into different grade levels. Pandas’ cut() method can be used to categorize the numerical values very easily.

How to convert categorical text data into numerical data using OneHotEncoder

Image
 Machine learning algorithms handle numerical data better than text data. A dataset can contain categorical data in text form such a gender, food_type, taxonomic_class, etc. In order to better utilize the power of machine learning algorithms we would have to convert the categorical data in text form into numerical form. This can be done using encoders. There are a few types of encoders in scikit-learn that convert the categorical data into either binary or numerical data. Here we will learn about OneHotEncoder in scikit-learn. OneHotEncoder converts  the categorical data into binary data in which each category in  dataframe column is converted into one separate column where the value of the column is 1 in rows where that particular category is present. For example, if the category of gender in row number 12 in a dataset is 'male'. Then the column corresponding to 'male' category created by OneHotEncoder will have 1 in row number 12. We will see an example how to encod...