Posts

Making boxplot with custom statistical values for the boxes

Image
We will see here how to draw boxplots, using Matplotlib, when we have a set of values that represent the box statistics, such as median, mean, minimum, maximum, first quartile, and third quartile. In some case we may also have confidence interval of the median values. We will use the .bxp attribute of the plot axes in pyplot.subplots . We begin first by importing the necessary packages to draw the boxplots, namely, NumPy and Matplotlib . import numpy as np import matplotlib.pyplot as plt The data Let’s take an example of a data. Show in the table below is the data which can be read from a excel or csv file into a pandas dataframe ( df ). df label whislo q1 med mean q3 whishi cilo cihi 0 A 8 26 56 49 96 116 54 62 1 B 7 21 53 45 96 122 51 59 2 C 10 22 57 54 101 120 55 63 3 D 9 26 54 50 98 116 52 60 Convert data into list of dictionaries For making the plot...

Running python code from file using exec()

Image
The exec function in python can be used to execute strings as python code. One of the simplest example can be as follows: a=2 exec("print(a)") 2 Note that we have not directly used the print() function to print the value of variable a , but have given the code print(a) as a string input to the exec function. The ability to excute strings as python code can be used to excute code stored in text files. A larger software tool can generrate a code in terms of variable that can be used in later steps of the program. It can also, be used to perfrom repetative tasks. Let’s say we have to run some imports statements and some code everytime you start a new session of python. We want those imported packages and variables available on the workspace console instead of inside a module. In this case you can just make a file that contains all those commands and run those commands via exec . We will see an example here. Let’s say we want to import matplotlib, numpy, pandas, s...

Principal Coordinate Analysis (PCoA) in R

Image
Here we will see how we can perform a principal coordinate analysis (PCoA) in R. I have used a microbiome data from a gut microbiome study. This is just to demonstrate the workflow of how to perform the PCoA. This is not an attempt to do any meaningful scientific analysis as it requires sufficient expertise in the field of microbiome research.

Why accuracy is not a good metric for scoring classification models?

  Accuracy of machine learning models trained to classify data into discreet categories is the proportion of samples the model is correctly able to classify. For example, in data that contains two categories, if the model is able to correctly predict 45 out of 50 samples, then the accuracy of the model is 90%.

Fill missing values using SimpleImputer

Data often would contain missing values. Sometime it makes sense to fill the missing values with some appropriate value. For example we may want to fill the missing value with, say, mean of the available values. We can fill such missing values by calculating the mean of the column and using the fillna() function. However, if several columns have missing values then we might have to repeat this process several times or write a loop. Scikit-learn offers functionality called as SimpleImputer to easily fill the missing values . 

Divide numerical data into categories

Sometimes we need to categories numerical values into different categories. For example, the population of town might be needed to be categorized into different income groups. Or, the marks of students might be needed to be categorized into different grade levels. Pandas’ cut() method can be used to categorize the numerical values very easily.

Python classes - Inheritance

In the previous post we saw how to create python classes and methods under them. We create a DNA class representing a DNA sequence. However, we can treat a DNA sequence as a string. A special kind of string that consists of only four letters, namely, ‘A’, ‘T’, ‘G’, and ‘C’ representing the nucleotides adenine, thiamine, guanine and cytosine, respectively. The DNA sequence should not contain any other characters. For ease of use we will allow entry of small and capital case letters which would be converted to capital case letter inside the class definition. Here we will create a class that is inherits properties from the built-in str class. class subclass(parent_class): # class definition To do so we just have to put the parent class in brackets while defining our current class. We can create as many subclasses that are themselves inherited from other subclasses in this way. class subclass(parent_class): # class definition class subclass_2(subclass): # class def...

Python classes : introduction

Classes are user-defined objects. Python has several built-in object types such as integers, float and strings. The programmers can create objects required for their program. We have seen some of the python objects previously. For example: a = 3 print ( type (a)) <class 'int'> Here, a is an int type of object.

Chi-square distribution and acceptable range.

Image
When to use χ 2 test? χ 2 test is used to check the goodness of fit of data points calculated by a function against the observed data points. One of the applications I know of it is when unknown parameters when passed to a function give an observed data. function(parameters) --> data points In such cases we can back calculate the parameters from observed data points. The way to solve these problems is to optimize the parameters by passing them to the functions and trying to minimize the sum of squared differences (SSR) of the calculated and observed values. Following is a rough pseudo-code for this.

Functions in python - args and kwargs

Earlier we saw how to write functions in python and how to call them in our program. For those functions, the number of inputs or arguments were defined. They took fixed number of positional or keyword arguments or had a default value assigned to one or more of the arguments.