Import data

import pandas as pd

from sklearn.datasets import load_iris

colnames = ['sepallength', 'sepalwidth', 'petallength', 'petalwidth']

iris = load_iris()

x = iris.data
y = iris.target

x = pd.DataFrame(x, columns=colnames)
y = pd.Series(y, name='class')

iris_data = pd.concat([x, y], axis=1)

Let's see how many data points are there for each class

y.value_counts()

2    50
1    50
0    50
Name: class, dtype: int64

As you can see above that each category has 50 data points, so we will take only 10 instances of class 0 & 50 instances of class 1 to create an imbalanced dataset.

But, first let's apply PCA on data to convert it to two-dimensional data so that we can see the different techniques of balancing the data on 2D-graphs.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
x = pca.fit_transform(x)

Let's give the new 2 columns name as x1 & x2 and add class name as another variables class in the data.

import pandas as pd
data = pd.DataFrame(x, columns=['x1', 'x2'])
data['class'] = iris.target

Let's take only first 10 observations of class 0

data_class0 = data.iloc[:10]

Let's take all 50 observations of class 1

data_class1 = data.iloc[50:100]

Let's combine these 10 observation of class 0 & 50 observations of class 1 to get an imbalanced data

import pandas as pd
data = data_class0.append(data_class1, ignore_index=True)

data

Draw the graph below and see how data is distributted along new x1 & x2 axis. Blue datapoints belong to class 0 & orange to class 1.

import seaborn as sns
%matplotlib inline
sns.scatterplot(x="x1", y="x2", hue="class", data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x277c596c588>

We will use imblearn package's functions to balance the data. To install this package execute the command in Anaconda Prompt. pip install imblearn

from imblearn.under_sampling import RandomUnderSampler
X = data[['x1', 'x2']]
y = data['class']
RUSampler = RandomUnderSampler(return_indices=True)
X_rusampled, y_rusampled, removed_indexes = RUSampler.fit_sample(X, y)

print('Removed indexes:', removed_indexes)

rusampled_balanced_data = pd.DataFrame(X_rusampled, columns=['x1', 'x2'])
rusampled_balanced_data['class'] = y_rusampled

Removed indexes: [ 0  1  2  3  4  5  6  7  8  9 50 30 25 42 44 39 16 43 46 55]

See above the datapoints' indexes which have been removed from class 1 to undersample it.

Let's see the datapoints remaining in dataset corresponding to each class.

rusampled_balanced_data['class'].value_counts()

1    10
0    10
Name: class, dtype: int64

Now, let's plot the undersampled data and see how does it look.

sns.scatterplot(x="x1", y="x2", hue="class", data=rusampled_balanced_data)

<matplotlib.axes._subplots.AxesSubplot at 0x277c5a55c50>

So, as you saw in above example that undersampling is all about balancing the dataset by removing the major class datapoints from it.

Oversampling¶

Now, let's oversample the data by adding duplicate points of minor class into the dataset.

from imblearn.over_sampling import RandomOverSampler

X = data[['x1', 'x2']]
y = data['class']

ROSampler = RandomOverSampler()
X_rosampled, y_rosampled = ROSampler.fit_sample(X, y)

rosampled_balanced_data = pd.DataFrame(X_rosampled, columns=['x1', 'x2'])
rosampled_balanced_data['class'] = y_rosampled

Let's see the shape of new oversampled data

rosampled_balanced_data.shape

(100, 3)

Shape of dataset has changed from 60 to 100. Now, let's see how many datapoints are there from each class.

rosampled_balanced_data['class'].value_counts()

1    50
0    50
Name: class, dtype: int64

As you see above, class 0 datapoints have inclreased from 10 to 50.

Let's have a look at class 0 datapoints in the oversampled data. You will find that new 40 datapoints are just replica of 10 original datapoints.

rosampled_balanced_data[rosampled_balanced_data['class']==0]

Les't see the datapoints graphically now.

sns.scatterplot(x="x1", y="x2", hue="class", data=rosampled_balanced_data)

<matplotlib.axes._subplots.AxesSubplot at 0x277c5ca6128>

In above graph, you won't be able to see 50 datapoints of blue color class 0 datapoints as 40 new datapoints are exact replica of 10 original datapoints so they are hidden behind original datapoints.

SMOTE¶

We have seen undersampling and oversampling examples above. Let's see how we can synthetically create datapoints for minor class.

from imblearn.over_sampling import SMOTE

smoteSampler = SMOTE(ratio='minority') # ration = minority means resample only the minority class;
X_smotesampled, y_smotesampled = smoteSampler.fit_sample(X, y)

smotesampled_balanced_data = pd.DataFrame(X_smotesampled, columns=['x1', 'x2'])
smotesampled_balanced_data['class'] = y_smotesampled

smotesampled_balanced_data['class'].value_counts()

1    50
0    50
Name: class, dtype: int64

So, new 40 class0 datapoints have been craeted.

Let's see how datapoints look graphically.

sns.scatterplot(x="x1", y="x2", hue="class", data=smotesampled_balanced_data)

<matplotlib.axes._subplots.AxesSubplot at 0x277c5d72780>

In above graph, you can see new synthetically created blue colored datapoints.

	x1	x2	class
0	-2.684126	0.319397	0
1	-2.714142	-0.177001	0
2	-2.888991	-0.144949	0
3	-2.745343	-0.318299	0
4	-2.728717	0.326755	0
5	-2.280860	0.741330	0
6	-2.820538	-0.089461	0
7	-2.626145	0.163385	0
8	-2.886383	-0.578312	0
9	-2.672756	-0.113774	0
10	1.284826	0.685160	1
11	0.932489	0.318334	1
12	1.464302	0.504263	1
13	0.183318	-0.827959	1
14	1.088103	0.074591	1
15	0.641669	-0.418247	1
16	1.095061	0.283468	1
17	-0.749123	-1.004891	1
18	1.044132	0.228362	1
19	-0.008745	-0.723082	1
20	-0.507841	-1.265971	1
21	0.511699	-0.103981	1
22	0.264977	-0.550036	1
23	0.984935	-0.124818	1
24	-0.173925	-0.254854	1
25	0.927861	0.467179	1
26	0.660284	-0.352970	1
27	0.236105	-0.333611	1
28	0.944734	-0.543146	1
29	0.045227	-0.583834	1
30	1.116283	-0.084617	1
31	0.357888	-0.068925	1
32	1.298184	-0.327787	1
33	0.921729	-0.182738	1
34	0.714853	0.149056	1
35	0.900174	0.328504	1
36	1.332024	0.244441	1
37	1.557802	0.267495	1
38	0.813291	-0.163350	1
39	-0.305584	-0.368262	1
40	-0.068126	-0.705172	1
41	-0.189622	-0.680287	1
42	0.136429	-0.314032	1
43	1.380026	-0.420954	1
44	0.588006	-0.484287	1
45	0.806858	0.194182	1
46	1.220691	0.407620	1
47	0.815095	-0.372037	1
48	0.245958	-0.268524	1
49	0.166413	-0.681927	1
50	0.464800	-0.670712	1
51	0.890815	-0.034464	1
52	0.230548	-0.404386	1
53	-0.704532	-1.012248	1
54	0.356981	-0.504910	1
55	0.331934	-0.212655	1
56	0.376216	-0.293219	1
57	0.642576	0.017738	1
58	-0.906470	-0.756093	1
59	0.299001	-0.348898	1

	x1	x2
0	-2.684126	0.319397
1	-2.714142	-0.177001
2	-2.888991	-0.144949
3	-2.745343	-0.318299
4	-2.728717	0.326755
5	-2.280860	0.741330
6	-2.820538	-0.089461
7	-2.626145	0.163385
8	-2.886383	-0.578312
9	-2.672756	-0.113774
60	-2.745343	-0.318299
61	-2.280860	0.741330
62	-2.728717	0.326755
63	-2.728717	0.326755
64	-2.626145	0.163385
65	-2.745343	-0.318299
66	-2.684126	0.319397
67	-2.886383	-0.578312
68	-2.714142	-0.177001
69	-2.888991	-0.144949
70	-2.684126	0.319397
71	-2.888991	-0.144949
72	-2.820538	-0.089461
73	-2.672756	-0.113774
74	-2.626145	0.163385
75	-2.672756	-0.113774
76	-2.684126	0.319397
77	-2.280860	0.741330
78	-2.820538	-0.089461
79	-2.684126	0.319397
80	-2.745343	-0.318299
81	-2.728717	0.326755
82	-2.886383	-0.578312
83	-2.684126	0.319397
84	-2.728717	0.326755
85	-2.684126	0.319397
86	-2.714142	-0.177001
87	-2.684126	0.319397
88	-2.886383	-0.578312
89	-2.684126	0.319397
90	-2.684126	0.319397
91	-2.728717	0.326755
92	-2.672756	-0.113774
93	-2.672756	-0.113774
94	-2.745343	-0.318299
95	-2.728717	0.326755
96	-2.684126	0.319397
97	-2.672756	-0.113774
98	-2.820538	-0.089461
99	-2.280860	0.741330