import pandas as pd

data_link = r'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

If you look at the dataset, it has few missing values that have been represented by '?'. We need to replace '?' with nan(not a number) so that Python recognises them as null/missing values, while reading the data in Python.

We will use read_csv function of pandas to read the data in Python from web.

',\s' is being used as separator which means a comma and a single white space
na_values parameter has been passed value as ["?"] to tell it to consider '?' as null

adult_data = pd.read_csv(data_link, header=None, sep=',\s', na_values=["?"])

C:\Users\crackthetech\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  """Entry point for launching an IPython kernel.

Ignore the warning as it just says that, to read data, 'c' engine is not being used because it does not support regex(sep=',\s'). Rather, python engine is being used for the purpose.

Let's have a look at top 5 rows of data

adult_data.head()

Digits are being used as column names as of now. We need to provide the dataframe column names. Column names are given in the link of dataset. Take them in a list as below.

column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship',
                'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'salary']

Assign column names to dataframe

adult_data.columns = column_names

Now, let's have a look at data type of column

adult_data.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
salary            object
dtype: object

adult_data.head(5)

Just check the shape of dataframe

adult_data.shape

(32561, 15)

adult_data = adult_data.iloc[:1000]  # for simplicity taking only first 1000 rows

adult_data.shape

(1000, 15)

	0	1	2	3	4	5	6	7	8	9	10	12	13	14
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K