First we need to prepare the data by converting non-numeric columns to numerics ones otherwise error like below will appear because scikit machine learning algorithms don't work on non-numeric data.

ValueError: could not convert string to float: ' United-States'

There are two ways a non-numeric colum can be converted into numeric one

  1. By giving some numbers to column values. For example Lets take workclass column. We can assign 1 to 'State-gov' and 2 to 'private' and so on.. but then there is a problem. Machine will think that private is 1 more than state-gov whereas there is nothing like being less or more. These two workclasses are completely different things which cant be compared numerically. This method is not recommended.

  2. We create new columns using the column's values and wherever column's value exists there we put 1 otherwise 0. For example column 'sex' has two values Male & Female. We can create two new columns Female & Male out of it. Value of Male column will be 1 and Female column 0 in records where sex is male. Its called creating dummy variable or One-hot encoding.

To create dummy variables, we will use pandas.get_dummies() methods of pandas

So, if you notice, having two separate columns for column 'sex' does not make sense. As 'sex' column has only two values - Female and Male. That means if male is 0 means female is 1 and vice versa. Having extra column does not add any value. We can generalise this : if there are k values of a column, it makes sense to have only k-1 variable.

To drop the the extra column, parameter 'drop_first = True' is passed to the get_dummies() funtcion.

In [25]:
adult_data = pd.get_dummies(adult_data, columns=['sex'], drop_first = True)
In [26]:
adult_data.head(5) # Observe column sex_Male and no column sex_Female as we used drop_first = True
Out[26]:
age workclass fnlwgt education education-num marital-status occupation relationship race capital-gain capital-loss hours-per-week native-country salary sex_Male
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White 2174 0 40 United-States <=50K 1
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White 0 0 13 United-States <=50K 1
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White 0 0 40 United-States <=50K 1
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black 0 0 40 United-States <=50K 1
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black 0 0 40 Cuba <=50K 0

You can apply the same one-hot encoding to other columns as well. If you feel nnumber of values are too many in a column, resultimg into too many columns, then you can club some of the column values. For example, 'education' column has many values. Education upto 12th standard can be clubbed into single value 'Schooling'.