Import data

In [1]:
import pandas as pd

from sklearn.datasets import load_iris

colnames = ['sepallength', 'sepalwidth', 'petallength', 'petalwidth']

iris = load_iris()

x = iris.data
y = iris.target

x = pd.DataFrame(x, columns=colnames)
y = pd.Series(y, name='class')

iris_data = pd.concat([x, y], axis=1)

Let's see how many data points are there for each class

In [2]:
y.value_counts()
Out[2]:
2    50
1    50
0    50
Name: class, dtype: int64

As you can see above that each category has 50 data points, so we will take only 10 instances of class 0 & 50 instances of class 1 to create an imbalanced dataset.

But, first let's apply PCA on data to convert it to two-dimensional data so that we can see the different techniques of balancing the data on 2D-graphs.

In [3]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
x = pca.fit_transform(x)

Let's give the new 2 columns name as x1 & x2 and add class name as another variables class in the data.

In [4]:
import pandas as pd
data = pd.DataFrame(x, columns=['x1', 'x2'])
data['class'] = iris.target

Let's take only first 10 observations of class 0

In [5]:
data_class0 = data.iloc[:10]

Let's take all 50 observations of class 1

In [6]:
data_class1 = data.iloc[50:100]

Let's combine these 10 observation of class 0 & 50 observations of class 1 to get an imbalanced data

In [7]:
import pandas as pd
data = data_class0.append(data_class1, ignore_index=True)
In [8]:
data
Out[8]:
x1 x2 class
0 -2.684126 0.319397 0
1 -2.714142 -0.177001 0
2 -2.888991 -0.144949 0
3 -2.745343 -0.318299 0
4 -2.728717 0.326755 0
5 -2.280860 0.741330 0
6 -2.820538 -0.089461 0
7 -2.626145 0.163385 0
8 -2.886383 -0.578312 0
9 -2.672756 -0.113774 0
10 1.284826 0.685160 1
11 0.932489 0.318334 1
12 1.464302 0.504263 1
13 0.183318 -0.827959 1
14 1.088103 0.074591 1
15 0.641669 -0.418247 1
16 1.095061 0.283468 1
17 -0.749123 -1.004891 1
18 1.044132 0.228362 1
19 -0.008745 -0.723082 1
20 -0.507841 -1.265971 1
21 0.511699 -0.103981 1
22 0.264977 -0.550036 1
23 0.984935 -0.124818 1
24 -0.173925 -0.254854 1
25 0.927861 0.467179 1
26 0.660284 -0.352970 1
27 0.236105 -0.333611 1
28 0.944734 -0.543146 1
29 0.045227 -0.583834 1
30 1.116283 -0.084617 1
31 0.357888 -0.068925 1
32 1.298184 -0.327787 1
33 0.921729 -0.182738 1
34 0.714853 0.149056 1
35 0.900174 0.328504 1
36 1.332024 0.244441 1
37 1.557802 0.267495 1
38 0.813291 -0.163350 1
39 -0.305584 -0.368262 1
40 -0.068126 -0.705172 1
41 -0.189622 -0.680287 1
42 0.136429 -0.314032 1
43 1.380026 -0.420954 1
44 0.588006 -0.484287 1
45 0.806858 0.194182 1
46 1.220691 0.407620 1
47 0.815095 -0.372037 1
48 0.245958 -0.268524 1
49 0.166413 -0.681927 1
50 0.464800 -0.670712 1
51 0.890815 -0.034464 1
52 0.230548 -0.404386 1
53 -0.704532 -1.012248 1
54 0.356981 -0.504910 1
55 0.331934 -0.212655 1
56 0.376216 -0.293219 1
57 0.642576 0.017738 1
58 -0.906470 -0.756093 1
59 0.299001 -0.348898 1

Draw the graph below and see how data is distributted along new x1 & x2 axis. Blue datapoints belong to class 0 & orange to class 1.

In [9]:
import seaborn as sns
%matplotlib inline
sns.scatterplot(x="x1", y="x2", hue="class", data=data)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x277c596c588>

We will use imblearn package's functions to balance the data. To install this package execute the command in Anaconda Prompt. pip install imblearn

In [10]:
from imblearn.under_sampling import RandomUnderSampler
X = data[['x1', 'x2']]
y = data['class']
RUSampler = RandomUnderSampler(return_indices=True)
X_rusampled, y_rusampled, removed_indexes = RUSampler.fit_sample(X, y)

print('Removed indexes:', removed_indexes)

rusampled_balanced_data = pd.DataFrame(X_rusampled, columns=['x1', 'x2'])
rusampled_balanced_data['class'] = y_rusampled
Removed indexes: [ 0  1  2  3  4  5  6  7  8  9 50 30 25 42 44 39 16 43 46 55]

See above the datapoints' indexes which have been removed from class 1 to undersample it.

Let's see the datapoints remaining in dataset corresponding to each class.

In [11]:
rusampled_balanced_data['class'].value_counts()
Out[11]:
1    10
0    10
Name: class, dtype: int64

Now, let's plot the undersampled data and see how does it look.

In [12]:
sns.scatterplot(x="x1", y="x2", hue="class", data=rusampled_balanced_data)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x277c5a55c50>

So, as you saw in above example that undersampling is all about balancing the dataset by removing the major class datapoints from it.

Oversampling

Now, let's oversample the data by adding duplicate points of minor class into the dataset.

In [13]:
from imblearn.over_sampling import RandomOverSampler

X = data[['x1', 'x2']]
y = data['class']

ROSampler = RandomOverSampler()
X_rosampled, y_rosampled = ROSampler.fit_sample(X, y)

rosampled_balanced_data = pd.DataFrame(X_rosampled, columns=['x1', 'x2'])
rosampled_balanced_data['class'] = y_rosampled

Let's see the shape of new oversampled data

In [14]:
rosampled_balanced_data.shape
Out[14]:
(100, 3)

Shape of dataset has changed from 60 to 100. Now, let's see how many datapoints are there from each class.

In [15]:
rosampled_balanced_data['class'].value_counts()
Out[15]:
1    50
0    50
Name: class, dtype: int64

As you see above, class 0 datapoints have inclreased from 10 to 50.

Let's have a look at class 0 datapoints in the oversampled data. You will find that new 40 datapoints are just replica of 10 original datapoints.

In [16]:
rosampled_balanced_data[rosampled_balanced_data['class']==0]
Out[16]:
x1 x2 class
0 -2.684126 0.319397 0
1 -2.714142 -0.177001 0
2 -2.888991 -0.144949 0
3 -2.745343 -0.318299 0
4 -2.728717 0.326755 0
5 -2.280860 0.741330 0
6 -2.820538 -0.089461 0
7 -2.626145 0.163385 0
8 -2.886383 -0.578312 0
9 -2.672756 -0.113774 0
60 -2.745343 -0.318299 0
61 -2.280860 0.741330 0
62 -2.728717 0.326755 0
63 -2.728717 0.326755 0
64 -2.626145 0.163385 0
65 -2.745343 -0.318299 0
66 -2.684126 0.319397 0
67 -2.886383 -0.578312 0
68 -2.714142 -0.177001 0
69 -2.888991 -0.144949 0
70 -2.684126 0.319397 0
71 -2.888991 -0.144949 0
72 -2.820538 -0.089461 0
73 -2.672756 -0.113774 0
74 -2.626145 0.163385 0
75 -2.672756 -0.113774 0
76 -2.684126 0.319397 0
77 -2.280860 0.741330 0
78 -2.820538 -0.089461 0
79 -2.684126 0.319397 0
80 -2.745343 -0.318299 0
81 -2.728717 0.326755 0
82 -2.886383 -0.578312 0
83 -2.684126 0.319397 0
84 -2.728717 0.326755 0
85 -2.684126 0.319397 0
86 -2.714142 -0.177001 0
87 -2.684126 0.319397 0
88 -2.886383 -0.578312 0
89 -2.684126 0.319397 0
90 -2.684126 0.319397 0
91 -2.728717 0.326755 0
92 -2.672756 -0.113774 0
93 -2.672756 -0.113774 0
94 -2.745343 -0.318299 0
95 -2.728717 0.326755 0
96 -2.684126 0.319397 0
97 -2.672756 -0.113774 0
98 -2.820538 -0.089461 0
99 -2.280860 0.741330 0

Les't see the datapoints graphically now.

In [17]:
sns.scatterplot(x="x1", y="x2", hue="class", data=rosampled_balanced_data)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x277c5ca6128>

In above graph, you won't be able to see 50 datapoints of blue color class 0 datapoints as 40 new datapoints are exact replica of 10 original datapoints so they are hidden behind original datapoints.

SMOTE

We have seen undersampling and oversampling examples above. Let's see how we can synthetically create datapoints for minor class.

In [18]:
from imblearn.over_sampling import SMOTE

smoteSampler = SMOTE(ratio='minority') # ration = minority means resample only the minority class;
X_smotesampled, y_smotesampled = smoteSampler.fit_sample(X, y)

smotesampled_balanced_data = pd.DataFrame(X_smotesampled, columns=['x1', 'x2'])
smotesampled_balanced_data['class'] = y_smotesampled
In [19]:
smotesampled_balanced_data['class'].value_counts()
Out[19]:
1    50
0    50
Name: class, dtype: int64

So, new 40 class0 datapoints have been craeted.

Let's see how datapoints look graphically.

In [20]:
sns.scatterplot(x="x1", y="x2", hue="class", data=smotesampled_balanced_data)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x277c5d72780>

In above graph, you can see new synthetically created blue colored datapoints.