Data Exploration

In [12]:
adult_data.isnull().sum() # it only shows non-na values
Out[12]:
age                0
workclass         62
fnlwgt             0
education          0
education-num      0
marital-status     0
occupation        62
relationship       0
race               0
sex                0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country    18
salary             0
dtype: int64
In [13]:
adult_data.count() # it only shows non-na values
Out[13]:
age               1000
workclass          938
fnlwgt            1000
education         1000
education-num     1000
marital-status    1000
occupation         938
relationship      1000
race              1000
sex               1000
capital-gain      1000
capital-loss      1000
hours-per-week    1000
native-country     982
salary            1000
dtype: int64

To fetch all rows with null values

In [14]:
adult_data[adult_data.isnull().any(axis=1)].head(5)  # fetching only first 5 rows
Out[14]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
14 40 Private 121772 Assoc-voc 11 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male 0 0 40 NaN >50K
27 54 NaN 180211 Some-college 10 Married-civ-spouse NaN Husband Asian-Pac-Islander Male 0 0 60 South >50K
38 31 Private 84154 Some-college 10 Married-civ-spouse Sales Husband White Male 0 0 38 NaN >50K
51 18 Private 226956 HS-grad 9 Never-married Other-service Own-child White Female 0 0 30 NaN <=50K
61 32 NaN 293936 7th-8th 4 Married-spouse-absent NaN Not-in-family White Male 0 0 40 NaN <=50K

To fetch all rows with one specific column having null values

In [15]:
adult_data[adult_data["workclass"].isnull()].head(5) # fetching only first 5 rows
Out[15]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
27 54 NaN 180211 Some-college 10 Married-civ-spouse NaN Husband Asian-Pac-Islander Male 0 0 60 South >50K
61 32 NaN 293936 7th-8th 4 Married-spouse-absent NaN Not-in-family White Male 0 0 40 NaN <=50K
69 25 NaN 200681 Some-college 10 Never-married NaN Own-child White Male 0 0 40 United-States <=50K
77 67 NaN 212759 10th 6 Married-civ-spouse NaN Husband White Male 0 0 2 United-States <=50K
106 17 NaN 304873 10th 6 Never-married NaN Own-child White Female 34095 0 32 United-States <=50K

Data Preparation

1. Go back to data collection team.

Go back to data collection team and try to find out if we can get the missing data somehow. There may be a chance that data was lost because of some reason while passing it to you. Try to find out if Missing data follows a pattern(like only certain area's data is missing, data is missing for salary of certain type of individual because they prefer not to share it) because there may be a important infirmation hidden.

2. Impute Missing Value

Impute the missing value with a mean value of similar records. If you find a record with missing value, try to find similar records. and see what these records are having as a value. You can try to impute mean value or the value which is occuring maximum number of times.

For example in this case if see row index 88th, maritarl-status is NaN. You can replace it with value which records with similar age range and relationship status as 'Not-in-family' are having.

Let's impute value for null values in workclass column for adults who have eductaion till 10th grade and are Never-married

In [16]:
adult_data[(adult_data['marital-status']=='Never-married') & (adult_data['education']=='10th')]
Out[16]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
106 17 NaN 304873 10th 6 Never-married NaN Own-child White Female 34095 0 32 United-States <=50K
219 25 Private 255004 10th 6 Never-married Craft-repair Not-in-family White Male 0 0 40 United-States <=50K
424 33 Private 228528 10th 6 Never-married Craft-repair Unmarried White Female 0 0 35 United-States <=50K
478 27 Private 161155 10th 6 Never-married Other-service Not-in-family White Male 0 0 40 United-States <=50K
609 17 Private 169658 10th 6 Never-married Other-service Own-child White Female 0 0 21 United-States <=50K
814 17 Private 32607 10th 6 Never-married Farming-fishing Own-child White Male 0 0 20 United-States <=50K
866 21 Private 62339 10th 6 Never-married Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K

If you see row with index 106 is having workclass NaN and all other similar records with marital status as 'Never-married' and education as 10th are having workclass as 'Private'. So, we will impute workclass value as 'Privare' for null vaue records.

In [17]:
adult_data.loc[106, 'workclass'] = 'Private'
In [18]:
adult_data[(adult_data['marital-status']=='Never-married') & (adult_data['education']=='10th')]
Out[18]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
106 17 Private 304873 10th 6 Never-married NaN Own-child White Female 34095 0 32 United-States <=50K
219 25 Private 255004 10th 6 Never-married Craft-repair Not-in-family White Male 0 0 40 United-States <=50K
424 33 Private 228528 10th 6 Never-married Craft-repair Unmarried White Female 0 0 35 United-States <=50K
478 27 Private 161155 10th 6 Never-married Other-service Not-in-family White Male 0 0 40 United-States <=50K
609 17 Private 169658 10th 6 Never-married Other-service Own-child White Female 0 0 21 United-States <=50K
814 17 Private 32607 10th 6 Never-married Farming-fishing Own-child White Male 0 0 20 United-States <=50K
866 21 Private 62339 10th 6 Never-married Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K

3. Drop rows with null values

  1. Any : df.dropna(axis=0, how='any') # Drop all rows where any of the values in row is na
  2. All : df.dropna(axis=0, how='all') # Drop all rows where all the values in row are na
  3. Threshold : Keep only the rows with at least 2 non-na values: df.dropna(axis=0, thresh=2) # Drop all rows where non-na values are less than 2
In [19]:
# for simplicity, removing all rows with null values. 
# You can drop rows if they are very less in comparision with total data you have
adult_data.dropna(axis=0, how='any', inplace=True)