Data Exploration¶

How to find out-

box plot
Summary statistics
- describe() method for numeric columns
- value_counts() for categorical columns

1. Box Plot¶

import matplotlib
%matplotlib inline
import seaborn as sns
sns.boxplot(adult_data['age']);

Age 90 is extreme value. Most likely, it should be removed from dataset as it may hamper our analysis. But, whether to consider that as outlier or not depends on buiness knowledge. You may need to study these people separatly if your problem statement requires. Here, we will remove this observation from data.

adult_data = adult_data[adult_data['age'] < 90]

adult_data.shape

(921, 15)

2. Summary statistics¶

describe() method for numeric columns
value_counts() for categorical columns

describe()

adult_data.describe()

You can have a look at the data spread - min, max, median etc and take the decision to keep or remove some data beyond some threshold. For example, keep data between 0.5%ile-99.5%ile and remove rest of the data.

value_counts() - Lets draw bar plot for 'relationship' and see if there is any category which looks peculiar.

adult_data['relationship'].value_counts().plot( kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x2475688d390>

We can keep the all the values here as nothing looks odd.

We can do same analysis on other columns.

Data Preparation¶

Domain Knowledge will come into picture. If extreme values are really outliers or show different behaviour from normal behaviour then whether to
- Leave them as is in the data
- Remove them form data
- Create a completely new model for such extreme values
- go back to data collection team to get more clarification about such values
Delete/replace outliers with mean values

Removal of row with age = 90 was a data preparation step above

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
count	921.00000	9.210000e+02	921.000000	921.000000	921.000000	921.000000
mean	37.80456	1.934025e+05	10.171553	576.691640	92.868621	40.504886
std	12.77033	1.077761e+05	2.523131	2459.580653	412.614565	11.735316
min	17.00000	2.117400e+04	1.000000	0.000000	0.000000	1.000000
25%	28.00000	1.149370e+05	9.000000	0.000000	0.000000	40.000000
50%	36.00000	1.816590e+05	10.000000	0.000000	0.000000	40.000000
75%	46.00000	2.494090e+05	13.000000	0.000000	0.000000	45.000000
max	81.00000	1.033222e+06	16.000000	25236.000000	2415.000000	99.000000