import pandas as pd
data_link = r'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
If you look at the dataset, it has few missing values that have been represented by '?'. We need to replace '?' with nan(not a number) so that Python recognises them as null/missing values, while reading the data in Python.
We will use read_csv function of pandas to read the data in Python from web.
adult_data = pd.read_csv(data_link, header=None, sep=',\s', na_values=["?"])
Ignore the warning as it just says that, to read data, 'c' engine is not being used because it does not support regex(sep=',\s'). Rather, python engine is being used for the purpose.
Let's have a look at top 5 rows of data
adult_data.head()
Digits are being used as column names as of now. We need to provide the dataframe column names. Column names are given in the link of dataset. Take them in a list as below.
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship',
'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'salary']
Assign column names to dataframe
adult_data.columns = column_names
Now, let's have a look at data type of column
adult_data.dtypes
adult_data.head(5)
Just check the shape of dataframe
adult_data.shape
adult_data = adult_data.iloc[:1000] # for simplicity taking only first 1000 rows
adult_data.shape