Divide numerical data into categories

Sometimes we need to categories numerical values into different categories. For example, the population of town might be needed to be categorized into different income groups. Or, the marks of students might be needed to be categorized into different grade levels.

Pandas’ cut() method can be used to categorize the numerical values very easily.

Let’s say we have a set of integer numbers from 1 to 100 and we want to categorize them into following ranges:

1 to 20
21 to 50
51 to 80
81 to 100

It can be done as follows:

First we will import pandas and numpy. We will use numpy for generating numbers.

import numpy as np
import pandas as pd

We will then create the array of random numbers within 1 to 100. Ten thousand numbers are generated.

Also, we will put this array of number into a pandas dataframe.

numbers = np.random.randint(low=1, high=100, size=10000)

data = pd.DataFrame()
data['numbers'] = numbers
data.head()

	numbers
0	53
1	96
2	75
3	85
4	84

To categorize these number as mentioned above following code is used.

data['categories'] = pd.cut(data['numbers'],
      bins=[0, 20, 50, 80, 100])
data.head()

	numbers	categories
0	53	(50, 80]
1	96	(80, 100]
2	75	(50, 80]
3	85	(80, 100]
4	84	(80, 100]

We can see above that the categories are created.

The categories can be labelled according to our needs as follows:

data['categories'] = pd.cut(data['numbers'],
                            bins=[0, 20, 50, 80, 100],
                           labels = [1,2,3,4])
data.head()

	numbers	categories
0	53	3
1	96	4
2	75	3
3	85	4
4	84	4

We can label the categories with strings too but is more useful to have numerical categories as they are more preferable to be used in machine learning algorithms.

Finally we will see the effective code in a picture-form below.

Image generated using Carbon (https://carbon.now.sh/)

Search This Blog

The Dry Lab Stuff

Divide numerical data into categories

Popular Posts

Principal Coordinate Analysis (PCoA) in R

Why accuracy is not a good metric for scoring classification models?