Note

Go to the end to download the full example code.

Pandas: data manipulation¶

It is often said that 80% of data analysis is spent on the cleaning and small, but important, aspect of data manipulation and cleaning with Pandas.

Sources:

Kevin Markham: https://github.com/justmarkham
Pandas doc: http://pandas.pydata.org/pandas-docs/stable/index.html

Data structures

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call pd.Series([1,3,5,np.nan,6,8])
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It stems from the R data.frame() object.

import pandas as pd
import numpy as np

Create DataFrame¶

columns = ['name', 'age', 'gender', 'job']

user1 = pd.DataFrame([['alice', 19, "F", "student"],
                      ['john', 26, "M", "student"]],
                     columns=columns)

user2 = pd.DataFrame([['eric', 22, "M", "student"],
                      ['paul', 58, "F", "manager"]],
                     columns=columns)

user3 = pd.DataFrame(dict(name=['peter', 'julie'],
                          age=[33, 44], gender=['M', 'F'],
                          job=['engineer', 'scientist']))

print(user3)

    name  age gender        job
0  peter   33      M   engineer
1  julie   44      F  scientist

Combining DataFrames¶

Concatenate DataFrame¶

Concatenate columns (axis = 1).

height = pd.DataFrame(dict(height=[1.65, 1.8]))
print(user1, "\n", height)

print(pd.concat([user1, height], axis=1))

    name  age gender      job
0  alice   19      F  student
1   john   26      M  student
    height
0    1.65
1    1.80
    name  age gender      job  height
0  alice   19      F  student    1.65
1   john   26      M  student    1.80

Concatenate rows (default: axis = 0)

users = pd.concat([user1, user2, user3])
print(users)

    name  age gender        job
alice   19      F    student
 john   26      M    student
 eric   22      M    student
 paul   58      F    manager
peter   33      M   engineer
julie   44      F  scientist

Join DataFrame¶

user4 = pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'],
                          height=[165, 180, 175, 171]))
print(user4)

    name  height
alice     165
 john     180
 eric     175
julie     171

Use intersection of keys from both frames

merge_inter = pd.merge(users, user4)
print(merge_inter)

    name  age gender        job  height
alice   19      F    student     165
 john   26      M    student     180
 eric   22      M    student     175
julie   44      F  scientist     171

Use union of keys from both frames

users = pd.merge(users, user4, on="name", how='outer')
print(users)

    name  age gender        job  height
alice   19      F    student   165.0
 eric   22      M    student   175.0
 john   26      M    student   180.0
julie   44      F  scientist   171.0
 paul   58      F    manager     NaN
peter   33      M   engineer     NaN

Reshaping by pivoting¶

“Unpivots” a DataFrame from wide format to long (stacked) format,

staked = pd.melt(users, id_vars="name", var_name="variable", value_name="value")
print(staked)

     name variable      value
 alice      age         19
  eric      age         22
  john      age         26
 julie      age         44
  paul      age         58
 peter      age         33
 alice   gender          F
  eric   gender          M
  john   gender          M
 julie   gender          F
 paul   gender          F
peter   gender          M
alice      job    student
 eric      job    student
 john      job    student
julie      job  scientist
 paul      job    manager
peter      job   engineer
alice   height      165.0
 eric   height      175.0
 john   height      180.0
julie   height      171.0
 paul   height        NaN
peter   height        NaN

“pivots” a DataFrame from long (stacked) format to wide format,

wide = staked.pivot(index='name', columns='variable', values='value')
print(wide)

variable age gender height        job
name
alice     19      F  165.0    student
eric      22      M  175.0    student
john      26      M  180.0    student
julie     44      F  171.0  scientist
paul      58      F    NaN    manager
peter     33      M    NaN   engineer

Summarizing¶

users                   # print the first 30 and last 30 rows
type(users)             # DataFrame
users.head()            # print the first 5 rows
users.tail()            # print the last 5 rows

	name	age	gender	job	height
1	eric	22	M	student	175.0
2	john	26	M	student	180.0
3	julie	44	F	scientist	171.0
4	paul	58	F	manager	NaN
5	peter	33	M	engineer	NaN

Meta-information

users.columns           # Column names
users.index             # Row name"
users.shape             # number of rows and columns
users.dtypes            # data types of each column
users.values            # underlying numpy array

array([['alice', 19, 'F', 'student', 165.0],
       ['eric', 22, 'M', 'student', 175.0],
       ['john', 26, 'M', 'student', 180.0],
       ['julie', 44, 'F', 'scientist', 171.0],
       ['paul', 58, 'F', 'manager', nan],
       ['peter', 33, 'M', 'engineer', nan]], dtype=object)

Columns selection¶

print(users.columns)

users['gender']         # select one column
type(users['gender'])   # Series
users.gender            # select one column using the DataFrame

# select multiple columns
users[['age', 'gender']]        # select two columns
my_cols = ['age', 'gender']     # or, create a list...
users[my_cols]                  # ...and use that list to select columns
type(users[my_cols])            # DataFrame

Index(['name', 'age', 'gender', 'job', 'height'], dtype='object')

iloc is strictly integer position based

users.iloc[:, 2] # select third column

  F
  M
  M
  F
  F
  M
Name: gender, dtype: object

Rows selection (basic)¶

iloc is strictly integer position based

df = users.copy()
df.iloc[0]     # first row
df.iloc[0, :]  # first row
df.iloc[[0, 1], :]  # Two first row

df.iloc[0, 0]  # first item of first row
df.iloc[0, 0] = 55

loc supports mixed integer and label based access.

df.loc[0]         # first row
df.loc[0, :]      # first row
df.loc[0, "age"]  # age item of first row
df.loc[0, "age"] = 55

Selection and index

Select females into a new DataFrame

df = users[users.gender == "F"]
print(df)

    name  age gender        job  height
alice   19      F    student   165.0
julie   44      F  scientist   171.0
 paul   58      F    manager     NaN

Reset index, useful when index is meaningless

df = df.reset_index(drop=True)  # Watch the index
print(df)

    name  age gender        job  height
alice   19      F    student   165.0
julie   44      F  scientist   171.0
 paul   58      F    manager     NaN

Rows iteration¶

df = users[:2].copy()

iterrows(): slow, get series, read-only

Returns (index, Series) pairs.
Slow because iterrows boxes the data into a Series.
Retrieve fields with column name
Don’t modify something you are iterating over. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

for idx, row in df.iterrows():
    print(row["name"], row["age"])

alice 19
eric 22

itertuples(): fast, get namedtuples, read-only

Returns namedtuples of the values and which is generally faster than iterrows.
Fast, because itertuples does not box the data into a Series.
Retrieve fields with integer index starting from 0.
Names will be renamed to positional names if they are invalid Python

identifier

for tup in df.itertuples():
    print(tup[1], tup[2])

alice 19
eric 22

iter using loc[i, …]: read and write

for i in range(df.shape[0]):
    df.loc[i, "age"] *= 10  # df is modified

Rows selection (filtering)¶

simple logical filtering on numerical values

users[users.age < 20]        # only show users with age < 20
young_bool = users.age < 20  # or, create a Series of booleans...
young = users[young_bool]            # ...and use that Series to filter rows
users[users.age < 20].job    # select one column from the filtered results
print(young)

    name  age gender      job  height
0  alice   19      F  student   165.0

simple logical filtering on categorial values

users[users.job == 'student']
users[users.job.isin(['student', 'engineer'])]
users[users['job'].str.contains("stu|scient")]

	name	age	gender	job	height
0	alice	19	F	student	165.0
1	eric	22	M	student	175.0
2	john	26	M	student	180.0
3	julie	44	F	scientist	171.0

Advanced logical filtering

users[users.age < 20][['age', 'job']]            # select multiple columns
users[(users.age > 20) & (users.gender == 'M')]  # use multiple conditions

	name	age	gender	job	height
1	eric	22	M	student	175.0
2	john	26	M	student	180.0
5	peter	33	M	engineer	NaN

Sorting¶

df = users.copy()

df.age.sort_values()                      # only works for a Series
df.sort_values(by='age')                  # sort rows by a specific column
df.sort_values(by='age', ascending=False) # use descending order instead
df.sort_values(by=['job', 'age'])         # sort by multiple columns
df.sort_values(by=['job', 'age'], inplace=True) # modify df

print(df)

    name  age gender        job  height
peter   33      M   engineer     NaN
 paul   58      F    manager     NaN
julie   44      F  scientist   171.0
alice   19      F    student   165.0
 eric   22      M    student   175.0
 john   26      M    student   180.0

Descriptive statistics¶

Summarize all numeric columns

print(df.describe())

             age      height
count   6.000000    4.000000
mean   33.666667  172.750000
std    14.895189    6.344289
min    19.000000  165.000000
25%    23.000000  169.500000
50%    29.500000  173.000000
75%    41.250000  176.250000
max    58.000000  180.000000

Summarize all columns

print(df.describe(include='all'))
print(df.describe(include=['object']))  # limit to one (or more) types

         name        age gender      job      height
count       6   6.000000      6        6    4.000000
unique      6        NaN      2        4         NaN
top     peter        NaN      M  student         NaN
freq        1        NaN      3        3         NaN
mean      NaN  33.666667    NaN      NaN  172.750000
std       NaN  14.895189    NaN      NaN    6.344289
min       NaN  19.000000    NaN      NaN  165.000000
25%       NaN  23.000000    NaN      NaN  169.500000
50%       NaN  29.500000    NaN      NaN  173.000000
75%       NaN  41.250000    NaN      NaN  176.250000
max       NaN  58.000000    NaN      NaN  180.000000
         name gender      job
count       6      6        6
unique      6      2        4
top     peter      M  student
freq        1      3        3

Categorical columns: count and proportions of values

df['job'].value_counts()
df['job'].value_counts(normalize=True).round(2)

job
student      0.50
engineer     0.17
manager      0.17
scientist    0.17
Name: proportion, dtype: float64

Categorical columns: length of strings

df['job'].str.len()

  8
  7
  9
  7
  7
  7
Name: job, dtype: int64

Statistics per group (groupby)

print(df.groupby("job")["age"].mean())
# print(df.groupby("job").describe(include='all'))

job
engineer     33.000000
manager      58.000000
scientist    44.000000
student      22.333333
Name: age, dtype: float64

Groupby in a loop

for grp, data in df.groupby("job"):
    print(grp, data)

engineer     name  age gender       job  height
5  peter   33      M  engineer     NaN
manager    name  age gender      job  height
4  paul   58      F  manager     NaN
scientist     name  age gender        job  height
3  julie   44      F  scientist   171.0
student     name  age gender      job  height
0  alice   19      F  student   165.0
1   eric   22      M  student   175.0
2   john   26      M  student   180.0

Quality check¶

Remove duplicate data¶

df = users.copy()

# Create a duplicate: Append the first at the end
df.loc[len(df.index)] = users.iloc[0]

print(df.duplicated())                 # Series of booleans
# (True if a row is identical to a previous row)
df.duplicated().sum()                  # count of duplicates
df[df.duplicated()]                    # only show duplicates
df.age.duplicated()                    # check a single column for duplicates
df.duplicated(['age', 'gender']).sum() # specify columns for finding duplicates
df = df.drop_duplicates()              # drop duplicate rows

  False
  False
  False
  False
  False
  False
   True
dtype: bool

Missing data¶

# Missing values are often just excluded
df = users.copy()

df.describe(include='all')

# find missing values in a Series
df.height.isnull()           # True if NaN, False otherwise
df.height.notnull()          # False if NaN, True otherwise
df[df.height.notnull()]      # only show rows where age is not NaN
df.height.isnull().sum()     # count the missing values

# find missing values in a DataFrame
df.isnull()             # DataFrame of booleans
df.isnull().sum()       # calculate the sum of each column

name      0
age       0
gender    0
job       0
height    2
dtype: int64

Strategy 1: drop missing values

df.dropna()             # drop a row if ANY values are missing
df.dropna(how='all')    # drop a row only if ALL values are missing

	name	age	gender	job	height
0	alice	19	F	student	165.0
1	eric	22	M	student	175.0
2	john	26	M	student	180.0
3	julie	44	F	scientist	171.0
4	paul	58	F	manager	NaN
5	peter	33	M	engineer	NaN

Strategy 2: fill in missing values

df.height.mean()
df = users.copy()
df.loc[df.height.isnull(), "height"] = df["height"].mean()

print(df)

    name  age gender        job  height
alice   19      F    student  165.00
 eric   22      M    student  175.00
 john   26      M    student  180.00
julie   44      F  scientist  171.00
 paul   58      F    manager  172.75
peter   33      M   engineer  172.75

Operation: multiplication¶

Multiplication of dataframe and other, element-wise

df = users.dropna()
df.insert(0, 'random', np.arange(df.shape[0]))
print(df)
df[["age", "height"]].multiply(df["random"], axis="index")

   random   name  age gender        job  height
     0  alice   19      F    student   165.0
     1   eric   22      M    student   175.0
     2   john   26      M    student   180.0
     3  julie   44      F  scientist   171.0

	age	height
0	0	0.0
1	22	175.0
2	52	360.0
3	132	513.0

Renaming¶

Rename columns

df = users.copy()
df.rename(columns={'name': 'NAME'})

	NAME	age	gender	job	height
0	alice	19	F	student	165.0
1	eric	22	M	student	175.0
2	john	26	M	student	180.0
3	julie	44	F	scientist	171.0
4	paul	58	F	manager	NaN
5	peter	33	M	engineer	NaN

Rename values

df.job = df.job.map({'student': 'etudiant', 'manager': 'manager',
                     'engineer': 'ingenieur', 'scientist': 'scientific'})

Dealing with outliers¶

size = pd.Series(np.random.normal(loc=175, size=20, scale=10))
# Corrupt the first 3 measures
size[:3] += 500

Based on parametric statistics: use the mean¶

Assume random variable follows the normal distribution Exclude data outside 3 standard-deviations: - Probability that a sample lies within 1 sd: 68.27% - Probability that a sample lies within 3 sd: 99.73% (68.27 + 2 * 15.73)

size_outlr_mean = size.copy()
size_outlr_mean[((size - size.mean()).abs() > 3 * size.std())] = size.mean()
print(size_outlr_mean.mean())

248.48963819938044

Based on non-parametric statistics: use the median¶

Median absolute deviation (MAD) is based on the median, is a robust non-parametric statistics.

mad = 1.4826 * np.median(np.abs(size - size.median()))
size_outlr_mad = size.copy()

size_outlr_mad[((size - size.median()).abs() > 3 * mad)] = size.median()
print(size_outlr_mad.mean(), size_outlr_mad.median())

173.80000467192673 178.7023568870694

File I/O¶

csv¶

import tempfile, os.path

tmpdir = tempfile.gettempdir()
csv_filename = os.path.join(tmpdir, "users.csv")
users.to_csv(csv_filename, index=False)
other = pd.read_csv(csv_filename)

Read csv from url¶

url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
salary = pd.read_csv(url)

Excel¶

Package openpyxl is required. To install type:

conda install -c conda-forge openpyxl

xls_filename = os.path.join(tmpdir, "users.xlsx")

# Write
users.to_excel(xls_filename, sheet_name='users', index=False)

# Read
pd.read_excel(xls_filename, sheet_name='users')

# Multiple sheets
with pd.ExcelWriter(xls_filename) as writer:
    users.to_excel(writer, sheet_name='users', index=False)
    df.to_excel(writer, sheet_name='salary', index=False)

pd.read_excel(xls_filename, sheet_name='users')
pd.read_excel(xls_filename, sheet_name='salary')

	name	age	gender	job	height
0	alice	19	F	etudiant	165.0
1	eric	22	M	etudiant	175.0
2	john	26	M	etudiant	180.0
3	julie	44	F	scientific	171.0
4	paul	58	F	manager	NaN
5	peter	33	M	ingenieur	NaN

SQL (SQLite)¶

import pandas as pd
import sqlite3

db_filename = os.path.join(tmpdir, "users.db")

Connect

conn = sqlite3.connect(db_filename)

Creating tables with pandas

url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
salary = pd.read_csv(url)

salary.to_sql("salary", conn, if_exists="replace")

Push modifications

cur = conn.cursor()
values = (100, 14000, 5,  'Bachelor', 'N')
cur.execute("insert into salary values (?, ?, ?, ?, ?)", values)
conn.commit()

Reading results into a pandas DataFrame

salary_sql = pd.read_sql_query("select * from salary;", conn)
print(salary_sql.head())

pd.read_sql_query("select * from salary;", conn).tail()
pd.read_sql_query('select * from salary where salary>25000;', conn)
pd.read_sql_query('select * from salary where experience=16;', conn)
pd.read_sql_query('select * from salary where education="Master";', conn)

   index  salary  experience education management
    0   13876           1  Bachelor          Y
    1   11608           1      Ph.D          N
    2   18701           1      Ph.D          Y
    3   11283           1    Master          N
    4   11767           1      Ph.D          N

	index	salary	experience	education	management
0	3	11283	1	Master	N
1	5	20872	2	Master	Y
2	6	11772	2	Master	N
3	9	12313	3	Master	N
4	11	21371	3	Master	Y
5	16	12884	4	Master	N
6	17	13245	5	Master	N
7	22	13839	6	Master	N
8	23	22884	6	Master	Y
9	25	14803	8	Master	N
10	30	15942	10	Master	N
11	32	23780	10	Master	Y
12	33	25410	11	Master	Y
13	35	16882	12	Master	N
14	38	26330	13	Master	Y
15	39	17949	14	Master	N
16	41	27837	16	Master	Y
17	42	18838	16	Master	N
18	44	19207	17	Master	N

Exercises¶

Data Frame¶

Read the iris dataset at ‘https://github.com/duchesnay/pystatsml/raw/master/datasets/iris.csv’
Print column names
Get numerical columns
For each species compute the mean of numerical columns and store it in a stats table like:

      species  sepal_length  sepal_width  petal_length  petal_width
    setosa         5.006        3.428         1.462        0.246
versicolor         5.936        2.770         4.260        1.326
 virginica         6.588        2.974         5.552        2.026

Missing data¶

Add some missing data to the previous table users:

df = users.copy()
df.loc[[0, 2], "age"] = None
df.loc[[1, 3], "gender"] = None

1. Write a function fillmissing_with_mean(df) that fill all missing value of numerical column with the mean of the current columns.

2. Save the original users and “imputed” frame in a single excel file “users.xlsx” with 2 sheets: original, imputed.

Total running time of the script: (0 minutes 0.683 seconds)

Gallery generated by Sphinx-Gallery

Pandas: data manipulation¶

Create DataFrame¶

Combining DataFrames¶

Concatenate DataFrame¶

Join DataFrame¶

Reshaping by pivoting¶

Summarizing¶

Columns selection¶

Rows selection (basic)¶

Rows iteration¶

Rows selection (filtering)¶

Sorting¶

Descriptive statistics¶

Quality check¶

Remove duplicate data¶

Missing data¶

Operation: multiplication¶

Renaming¶

Dealing with outliers¶

Based on parametric statistics: use the mean¶

Based on non-parametric statistics: use the median¶

File I/O¶

csv¶

Read csv from url¶

Excel¶

SQL (SQLite)¶

Exercises¶

Data Frame¶

Missing data¶

Table of Contents

This Page