Data Exploration

How to find out-

  1. box plot
  2. Summary statistics
    • describe() method for numeric columns
    • value_counts() for categorical columns

1. Box Plot

In [20]:
import matplotlib
%matplotlib inline
import seaborn as sns
sns.boxplot(adult_data['age']);

Age 90 is extreme value. Most likely, it should be removed from dataset as it may hamper our analysis. But, whether to consider that as outlier or not depends on buiness knowledge. You may need to study these people separatly if your problem statement requires. Here, we will remove this observation from data.

In [21]:
adult_data = adult_data[adult_data['age'] < 90]
In [22]:
adult_data.shape
Out[22]:
(921, 15)

2. Summary statistics

  • describe() method for numeric columns
  • value_counts() for categorical columns

describe()

In [23]:
adult_data.describe()
Out[23]:
age fnlwgt education-num capital-gain capital-loss hours-per-week
count 921.00000 9.210000e+02 921.000000 921.000000 921.000000 921.000000
mean 37.80456 1.934025e+05 10.171553 576.691640 92.868621 40.504886
std 12.77033 1.077761e+05 2.523131 2459.580653 412.614565 11.735316
min 17.00000 2.117400e+04 1.000000 0.000000 0.000000 1.000000
25% 28.00000 1.149370e+05 9.000000 0.000000 0.000000 40.000000
50% 36.00000 1.816590e+05 10.000000 0.000000 0.000000 40.000000
75% 46.00000 2.494090e+05 13.000000 0.000000 0.000000 45.000000
max 81.00000 1.033222e+06 16.000000 25236.000000 2415.000000 99.000000

You can have a look at the data spread - min, max, median etc and take the decision to keep or remove some data beyond some threshold. For example, keep data between 0.5%ile-99.5%ile and remove rest of the data.

value_counts() - Lets draw bar plot for 'relationship' and see if there is any category which looks peculiar.

In [24]:
adult_data['relationship'].value_counts().plot( kind='bar')
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x2475688d390>

We can keep the all the values here as nothing looks odd.

We can do same analysis on other columns.

Data Preparation

  1. Domain Knowledge will come into picture. If extreme values are really outliers or show different behaviour from normal behaviour then whether to
    • Leave them as is in the data
    • Remove them form data
    • Create a completely new model for such extreme values
    • go back to data collection team to get more clarification about such values
  2. Delete/replace outliers with mean values

Removal of row with age = 90 was a data preparation step above