Data Exploration¶

adult_data.isnull().sum() # it only shows non-na values

age                0
workclass         62
fnlwgt             0
education          0
education-num      0
marital-status     0
occupation        62
relationship       0
race               0
sex                0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country    18
salary             0
dtype: int64

adult_data.count() # it only shows non-na values

age               1000
workclass          938
fnlwgt            1000
education         1000
education-num     1000
marital-status    1000
occupation         938
relationship      1000
race              1000
sex               1000
capital-gain      1000
capital-loss      1000
hours-per-week    1000
native-country     982
salary            1000
dtype: int64

To fetch all rows with null values

adult_data[adult_data.isnull().any(axis=1)].head(5)  # fetching only first 5 rows

To fetch all rows with one specific column having null values

adult_data[adult_data["workclass"].isnull()].head(5) # fetching only first 5 rows

Data Preparation¶

1. Go back to data collection team.¶

Go back to data collection team and try to find out if we can get the missing data somehow. There may be a chance that data was lost because of some reason while passing it to you. Try to find out if Missing data follows a pattern(like only certain area's data is missing, data is missing for salary of certain type of individual because they prefer not to share it) because there may be a important infirmation hidden.

2. Impute Missing Value¶

Impute the missing value with a mean value of similar records. If you find a record with missing value, try to find similar records. and see what these records are having as a value. You can try to impute mean value or the value which is occuring maximum number of times.

For example in this case if see row index 88th, maritarl-status is NaN. You can replace it with value which records with similar age range and relationship status as 'Not-in-family' are having.

Let's impute value for null values in workclass column for adults who have eductaion till 10th grade and are Never-married

adult_data[(adult_data['marital-status']=='Never-married') & (adult_data['education']=='10th')]

If you see row with index 106 is having workclass NaN and all other similar records with marital status as 'Never-married' and education as 10th are having workclass as 'Private'. So, we will impute workclass value as 'Privare' for null vaue records.

adult_data.loc[106, 'workclass'] = 'Private'

adult_data[(adult_data['marital-status']=='Never-married') & (adult_data['education']=='10th')]

3. Drop rows with null values¶

Any : df.dropna(axis=0, how='any') # Drop all rows where any of the values in row is na
All : df.dropna(axis=0, how='all') # Drop all rows where all the values in row are na
Threshold : Keep only the rows with at least 2 non-na values: df.dropna(axis=0, thresh=2) # Drop all rows where non-na values are less than 2

# for simplicity, removing all rows with null values. 
# You can drop rows if they are very less in comparision with total data you have
adult_data.dropna(axis=0, how='any', inplace=True)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	hours-per-week	native-country	salary
14	40	Private	121772	Assoc-voc	11	Married-civ-spouse	Craft-repair	Husband	Asian-Pac-Islander	Male	40	NaN	>50K
27	54	NaN	180211	Some-college	10	Married-civ-spouse	NaN	Husband	Asian-Pac-Islander	Male	60	South	>50K
38	31	Private	84154	Some-college	10	Married-civ-spouse	Sales	Husband	White	Male	38	NaN	>50K
51	18	Private	226956	HS-grad	9	Never-married	Other-service	Own-child	White	Female	30	NaN	<=50K
61	32	NaN	293936	7th-8th	4	Married-spouse-absent	NaN	Not-in-family	White	Male	40	NaN	<=50K

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
27	54	NaN	180211	Some-college	10	Married-civ-spouse	NaN	Husband	Asian-Pac-Islander	Male	0	60	South	>50K
61	32	NaN	293936	7th-8th	4	Married-spouse-absent	NaN	Not-in-family	White	Male	0	40	NaN	<=50K
69	25	NaN	200681	Some-college	10	Never-married	NaN	Own-child	White	Male	0	40	United-States	<=50K
77	67	NaN	212759	10th	6	Married-civ-spouse	NaN	Husband	White	Male	0	2	United-States	<=50K
106	17	NaN	304873	10th	6	Never-married	NaN	Own-child	White	Female	34095	32	United-States	<=50K

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
106	17	NaN	304873	10th	6	Never-married	NaN	Own-child	White	Female	34095	32	United-States	<=50K
219	25	Private	255004	10th	6	Never-married	Craft-repair	Not-in-family	White	Male	0	40	United-States	<=50K
424	33	Private	228528	10th	6	Never-married	Craft-repair	Unmarried	White	Female	0	35	United-States	<=50K
478	27	Private	161155	10th	6	Never-married	Other-service	Not-in-family	White	Male	0	40	United-States	<=50K
609	17	Private	169658	10th	6	Never-married	Other-service	Own-child	White	Female	0	21	United-States	<=50K
814	17	Private	32607	10th	6	Never-married	Farming-fishing	Own-child	White	Male	0	20	United-States	<=50K
866	21	Private	62339	10th	6	Never-married	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
106	17	Private	304873	10th	6	Never-married	NaN	Own-child	White	Female	34095	32	United-States	<=50K
219	25	Private	255004	10th	6	Never-married	Craft-repair	Not-in-family	White	Male	0	40	United-States	<=50K
424	33	Private	228528	10th	6	Never-married	Craft-repair	Unmarried	White	Female	0	35	United-States	<=50K
478	27	Private	161155	10th	6	Never-married	Other-service	Not-in-family	White	Male	0	40	United-States	<=50K
609	17	Private	169658	10th	6	Never-married	Other-service	Own-child	White	Female	0	21	United-States	<=50K
814	17	Private	32607	10th	6	Never-married	Farming-fishing	Own-child	White	Male	0	20	United-States	<=50K
866	21	Private	62339	10th	6	Never-married	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K