Import data
import pandas as pd
from sklearn.datasets import load_iris
colnames = ['sepallength', 'sepalwidth', 'petallength', 'petalwidth']
iris = load_iris()
x = iris.data
y = iris.target
x = pd.DataFrame(x, columns=colnames)
y = pd.Series(y, name='class')
iris_data = pd.concat([x, y], axis=1)
Let's see how many data points are there for each class
y.value_counts()
As you can see above that each category has 50 data points, so we will take only 10 instances of class 0 & 50 instances of class 1 to create an imbalanced dataset.
But, first let's apply PCA on data to convert it to two-dimensional data so that we can see the different techniques of balancing the data on 2D-graphs.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
x = pca.fit_transform(x)
Let's give the new 2 columns name as x1 & x2 and add class name as another variables class in the data.
import pandas as pd
data = pd.DataFrame(x, columns=['x1', 'x2'])
data['class'] = iris.target
Let's take only first 10 observations of class 0
data_class0 = data.iloc[:10]
Let's take all 50 observations of class 1
data_class1 = data.iloc[50:100]
Let's combine these 10 observation of class 0 & 50 observations of class 1 to get an imbalanced data
import pandas as pd
data = data_class0.append(data_class1, ignore_index=True)
data
Draw the graph below and see how data is distributted along new x1 & x2 axis. Blue datapoints belong to class 0 & orange to class 1.
import seaborn as sns
%matplotlib inline
sns.scatterplot(x="x1", y="x2", hue="class", data=data)
We will use imblearn package's functions to balance the data. To install this package execute the command in Anaconda Prompt. pip install imblearn
from imblearn.under_sampling import RandomUnderSampler
X = data[['x1', 'x2']]
y = data['class']
RUSampler = RandomUnderSampler(return_indices=True)
X_rusampled, y_rusampled, removed_indexes = RUSampler.fit_sample(X, y)
print('Removed indexes:', removed_indexes)
rusampled_balanced_data = pd.DataFrame(X_rusampled, columns=['x1', 'x2'])
rusampled_balanced_data['class'] = y_rusampled
See above the datapoints' indexes which have been removed from class 1 to undersample it.
Let's see the datapoints remaining in dataset corresponding to each class.
rusampled_balanced_data['class'].value_counts()
Now, let's plot the undersampled data and see how does it look.
sns.scatterplot(x="x1", y="x2", hue="class", data=rusampled_balanced_data)
So, as you saw in above example that undersampling is all about balancing the dataset by removing the major class datapoints from it.
Now, let's oversample the data by adding duplicate points of minor class into the dataset.
from imblearn.over_sampling import RandomOverSampler
X = data[['x1', 'x2']]
y = data['class']
ROSampler = RandomOverSampler()
X_rosampled, y_rosampled = ROSampler.fit_sample(X, y)
rosampled_balanced_data = pd.DataFrame(X_rosampled, columns=['x1', 'x2'])
rosampled_balanced_data['class'] = y_rosampled
Let's see the shape of new oversampled data
rosampled_balanced_data.shape
Shape of dataset has changed from 60 to 100. Now, let's see how many datapoints are there from each class.
rosampled_balanced_data['class'].value_counts()
As you see above, class 0 datapoints have inclreased from 10 to 50.
Let's have a look at class 0 datapoints in the oversampled data. You will find that new 40 datapoints are just replica of 10 original datapoints.
rosampled_balanced_data[rosampled_balanced_data['class']==0]
Les't see the datapoints graphically now.
sns.scatterplot(x="x1", y="x2", hue="class", data=rosampled_balanced_data)
In above graph, you won't be able to see 50 datapoints of blue color class 0 datapoints as 40 new datapoints are exact replica of 10 original datapoints so they are hidden behind original datapoints.
We have seen undersampling and oversampling examples above. Let's see how we can synthetically create datapoints for minor class.
from imblearn.over_sampling import SMOTE
smoteSampler = SMOTE(ratio='minority') # ration = minority means resample only the minority class;
X_smotesampled, y_smotesampled = smoteSampler.fit_sample(X, y)
smotesampled_balanced_data = pd.DataFrame(X_smotesampled, columns=['x1', 'x2'])
smotesampled_balanced_data['class'] = y_smotesampled
smotesampled_balanced_data['class'].value_counts()
So, new 40 class0 datapoints have been craeted.
Let's see how datapoints look graphically.
sns.scatterplot(x="x1", y="x2", hue="class", data=smotesampled_balanced_data)
In above graph, you can see new synthetically created blue colored datapoints.