In [1]:
import pandas as pd
In [2]:
data_link = r'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

If you look at the dataset, it has few missing values that have been represented by '?'. We need to replace '?' with nan(not a number) so that Python recognises them as null/missing values, while reading the data in Python.

We will use read_csv function of pandas to read the data in Python from web.

  • ',\s' is being used as separator which means a comma and a single white space
  • na_values parameter has been passed value as ["?"] to tell it to consider '?' as null
In [3]:
adult_data = pd.read_csv(data_link, header=None, sep=',\s', na_values=["?"])
C:\Users\crackthetech\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  """Entry point for launching an IPython kernel.

Ignore the warning as it just says that, to read data, 'c' engine is not being used because it does not support regex(sep=',\s'). Rather, python engine is being used for the purpose.

Let's have a look at top 5 rows of data

In [4]:
adult_data.head()
Out[4]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

Digits are being used as column names as of now. We need to provide the dataframe column names. Column names are given in the link of dataset. Take them in a list as below.

In [5]:
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship',
                'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'salary']

Assign column names to dataframe

In [6]:
adult_data.columns = column_names

Now, let's have a look at data type of column

In [7]:
adult_data.dtypes
Out[7]:
age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
salary            object
dtype: object
In [8]:
adult_data.head(5)
Out[8]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

Just check the shape of dataframe

In [9]:
adult_data.shape
Out[9]:
(32561, 15)
In [10]:
adult_data = adult_data.iloc[:1000]  # for simplicity taking only first 1000 rows
In [11]:
adult_data.shape
Out[11]:
(1000, 15)