Divide numerical data into categories
Sometimes we need to categories numerical values into different categories. For example, the population of town might be needed to be categorized into different income groups. Or, the marks of students might be needed to be categorized into different grade levels.
Pandas’ cut()
method can be used to categorize the
numerical values very easily.
Let’s say we have a set of integer numbers from 1 to 100 and we want to categorize them into following ranges:
- 1 to 20
- 21 to 50
- 51 to 80
- 81 to 100
It can be done as follows:
First we will import pandas
and numpy
. We
will use numpy
for generating numbers.
import numpy as np
import pandas as pd
We will then create the array of random numbers within 1 to 100. Ten thousand numbers are generated.
Also, we will put this array of number into a pandas
dataframe.
= np.random.randint(low=1, high=100, size=10000)
numbers
= pd.DataFrame()
data 'numbers'] = numbers
data[ data.head()
|
numbers |
---|---|
0 | 53 |
1 | 96 |
2 | 75 |
3 | 85 |
4 | 84 |
To categorize these number as mentioned above following code is used.
'categories'] = pd.cut(data['numbers'],
data[=[0, 20, 50, 80, 100])
bins data.head()
|
numbers | categories |
---|---|---|
0 | 53 | (50, 80] |
1 | 96 | (80, 100] |
2 | 75 | (50, 80] |
3 | 85 | (80, 100] |
4 | 84 | (80, 100] |
We can see above that the categories are created.
The categories can be labelled according to our needs as follows:
'categories'] = pd.cut(data['numbers'],
data[=[0, 20, 50, 80, 100],
bins= [1,2,3,4])
labels data.head()
|
numbers | categories |
---|---|---|
0 | 53 | 3 |
1 | 96 | 4 |
2 | 75 | 3 |
3 | 85 | 4 |
4 | 84 | 4 |
We can label the categories with strings
too but is more
useful to have numerical categories as they are more preferable to be
used in machine learning algorithms.
Finally we will see the effective code in a picture-form below.
Image generated using Carbon (https://carbon.now.sh/)