Note
Go to the end to download the full example code.
Pandas: data manipulation¶
It is often said that 80% of data analysis is spent on the cleaning and small, but important, aspect of data manipulation and cleaning with Pandas.
Sources:
Kevin Markham: https://github.com/justmarkham
Pandas doc: http://pandas.pydata.org/pandas-docs/stable/index.html
Data structures
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call pd.Series([1,3,5,np.nan,6,8])
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It stems from the R data.frame() object.
import pandas as pd
import numpy as np
Create DataFrame¶
columns = ['name', 'age', 'gender', 'job']
user1 = pd.DataFrame([['alice', 19, "F", "student"],
['john', 26, "M", "student"]],
columns=columns)
user2 = pd.DataFrame([['eric', 22, "M", "student"],
['paul', 58, "F", "manager"]],
columns=columns)
user3 = pd.DataFrame(dict(name=['peter', 'julie'],
age=[33, 44], gender=['M', 'F'],
job=['engineer', 'scientist']))
print(user3)
name age gender job
0 peter 33 M engineer
1 julie 44 F scientist
Combining DataFrames¶
Concatenate DataFrame¶
Concatenate columns (axis = 1).
height = pd.DataFrame(dict(height=[1.65, 1.8]))
print(user1, "\n", height)
print(pd.concat([user1, height], axis=1))
name age gender job
0 alice 19 F student
1 john 26 M student
height
0 1.65
1 1.80
name age gender job height
0 alice 19 F student 1.65
1 john 26 M student 1.80
Concatenate rows (default: axis = 0)
users = pd.concat([user1, user2, user3])
print(users)
name age gender job
0 alice 19 F student
1 john 26 M student
0 eric 22 M student
1 paul 58 F manager
0 peter 33 M engineer
1 julie 44 F scientist
Join DataFrame¶
user4 = pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'],
height=[165, 180, 175, 171]))
print(user4)
name height
0 alice 165
1 john 180
2 eric 175
3 julie 171
Use intersection of keys from both frames
merge_inter = pd.merge(users, user4)
print(merge_inter)
name age gender job height
0 alice 19 F student 165
1 john 26 M student 180
2 eric 22 M student 175
3 julie 44 F scientist 171
Use union of keys from both frames
users = pd.merge(users, user4, on="name", how='outer')
print(users)
name age gender job height
0 alice 19 F student 165.0
1 eric 22 M student 175.0
2 john 26 M student 180.0
3 julie 44 F scientist 171.0
4 paul 58 F manager NaN
5 peter 33 M engineer NaN
Reshaping by pivoting¶
“Unpivots” a DataFrame from wide format to long (stacked) format,
staked = pd.melt(users, id_vars="name", var_name="variable", value_name="value")
print(staked)
name variable value
0 alice age 19
1 eric age 22
2 john age 26
3 julie age 44
4 paul age 58
5 peter age 33
6 alice gender F
7 eric gender M
8 john gender M
9 julie gender F
10 paul gender F
11 peter gender M
12 alice job student
13 eric job student
14 john job student
15 julie job scientist
16 paul job manager
17 peter job engineer
18 alice height 165.0
19 eric height 175.0
20 john height 180.0
21 julie height 171.0
22 paul height NaN
23 peter height NaN
“pivots” a DataFrame from long (stacked) format to wide format,
wide = staked.pivot(index='name', columns='variable', values='value')
print(wide)
variable age gender height job
name
alice 19 F 165.0 student
eric 22 M 175.0 student
john 26 M 180.0 student
julie 44 F 171.0 scientist
paul 58 F NaN manager
peter 33 M NaN engineer
Summarizing¶
users # print the first 30 and last 30 rows
type(users) # DataFrame
users.head() # print the first 5 rows
users.tail() # print the last 5 rows
Meta-information
users.columns # Column names
users.index # Row name"
users.shape # number of rows and columns
users.dtypes # data types of each column
users.values # underlying numpy array
array([['alice', 19, 'F', 'student', 165.0],
['eric', 22, 'M', 'student', 175.0],
['john', 26, 'M', 'student', 180.0],
['julie', 44, 'F', 'scientist', 171.0],
['paul', 58, 'F', 'manager', nan],
['peter', 33, 'M', 'engineer', nan]], dtype=object)
Columns selection¶
print(users.columns)
users['gender'] # select one column
type(users['gender']) # Series
users.gender # select one column using the DataFrame
# select multiple columns
users[['age', 'gender']] # select two columns
my_cols = ['age', 'gender'] # or, create a list...
users[my_cols] # ...and use that list to select columns
type(users[my_cols]) # DataFrame
Index(['name', 'age', 'gender', 'job', 'height'], dtype='object')
iloc is strictly integer position based
users.iloc[:, 2] # select third column
0 F
1 M
2 M
3 F
4 F
5 M
Name: gender, dtype: object
Rows selection (basic)¶
iloc is strictly integer position based
df = users.copy()
df.iloc[0] # first row
df.iloc[0, :] # first row
df.iloc[[0, 1], :] # Two first row
df.iloc[0, 0] # first item of first row
df.iloc[0, 0] = 55
loc supports mixed integer and label based access.
df.loc[0] # first row
df.loc[0, :] # first row
df.loc[0, "age"] # age item of first row
df.loc[0, "age"] = 55
Selection and index
Select females into a new DataFrame
df = users[users.gender == "F"]
print(df)
name age gender job height
0 alice 19 F student 165.0
3 julie 44 F scientist 171.0
4 paul 58 F manager NaN
Reset index, useful when index is meaningless
df = df.reset_index(drop=True) # Watch the index
print(df)
name age gender job height
0 alice 19 F student 165.0
1 julie 44 F scientist 171.0
2 paul 58 F manager NaN
Rows iteration¶
df = users[:2].copy()
iterrows(): slow, get series, read-only
Returns (index, Series) pairs.
Slow because iterrows boxes the data into a Series.
Retrieve fields with column name
Don’t modify something you are iterating over. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
for idx, row in df.iterrows():
print(row["name"], row["age"])
alice 19
eric 22
itertuples(): fast, get namedtuples, read-only
Returns namedtuples of the values and which is generally faster than iterrows.
Fast, because itertuples does not box the data into a Series.
Retrieve fields with integer index starting from 0.
Names will be renamed to positional names if they are invalid Python
identifier
for tup in df.itertuples():
print(tup[1], tup[2])
alice 19
eric 22
iter using loc[i, …]: read and write
for i in range(df.shape[0]):
df.loc[i, "age"] *= 10 # df is modified
Rows selection (filtering)¶
simple logical filtering on numerical values
users[users.age < 20] # only show users with age < 20
young_bool = users.age < 20 # or, create a Series of booleans...
young = users[young_bool] # ...and use that Series to filter rows
users[users.age < 20].job # select one column from the filtered results
print(young)
name age gender job height
0 alice 19 F student 165.0
simple logical filtering on categorial values
users[users.job == 'student']
users[users.job.isin(['student', 'engineer'])]
users[users['job'].str.contains("stu|scient")]
Advanced logical filtering
users[users.age < 20][['age', 'job']] # select multiple columns
users[(users.age > 20) & (users.gender == 'M')] # use multiple conditions
Sorting¶
df = users.copy()
df.age.sort_values() # only works for a Series
df.sort_values(by='age') # sort rows by a specific column
df.sort_values(by='age', ascending=False) # use descending order instead
df.sort_values(by=['job', 'age']) # sort by multiple columns
df.sort_values(by=['job', 'age'], inplace=True) # modify df
print(df)
name age gender job height
5 peter 33 M engineer NaN
4 paul 58 F manager NaN
3 julie 44 F scientist 171.0
0 alice 19 F student 165.0
1 eric 22 M student 175.0
2 john 26 M student 180.0
Descriptive statistics¶
Summarize all numeric columns
print(df.describe())
age height
count 6.000000 4.000000
mean 33.666667 172.750000
std 14.895189 6.344289
min 19.000000 165.000000
25% 23.000000 169.500000
50% 29.500000 173.000000
75% 41.250000 176.250000
max 58.000000 180.000000
Summarize all columns
print(df.describe(include='all'))
print(df.describe(include=['object'])) # limit to one (or more) types
name age gender job height
count 6 6.000000 6 6 4.000000
unique 6 NaN 2 4 NaN
top peter NaN M student NaN
freq 1 NaN 3 3 NaN
mean NaN 33.666667 NaN NaN 172.750000
std NaN 14.895189 NaN NaN 6.344289
min NaN 19.000000 NaN NaN 165.000000
25% NaN 23.000000 NaN NaN 169.500000
50% NaN 29.500000 NaN NaN 173.000000
75% NaN 41.250000 NaN NaN 176.250000
max NaN 58.000000 NaN NaN 180.000000
name gender job
count 6 6 6
unique 6 2 4
top peter M student
freq 1 3 3
Categorical columns: count and proportions of values
df['job'].value_counts()
df['job'].value_counts(normalize=True).round(2)
job
student 0.50
engineer 0.17
manager 0.17
scientist 0.17
Name: proportion, dtype: float64
Categorical columns: length of strings
df['job'].str.len()
5 8
4 7
3 9
0 7
1 7
2 7
Name: job, dtype: int64
Statistics per group (groupby)
print(df.groupby("job")["age"].mean())
# print(df.groupby("job").describe(include='all'))
job
engineer 33.000000
manager 58.000000
scientist 44.000000
student 22.333333
Name: age, dtype: float64
Groupby in a loop
for grp, data in df.groupby("job"):
print(grp, data)
engineer name age gender job height
5 peter 33 M engineer NaN
manager name age gender job height
4 paul 58 F manager NaN
scientist name age gender job height
3 julie 44 F scientist 171.0
student name age gender job height
0 alice 19 F student 165.0
1 eric 22 M student 175.0
2 john 26 M student 180.0
Quality check¶
Remove duplicate data¶
df = users.copy()
# Create a duplicate: Append the first at the end
df.loc[len(df.index)] = users.iloc[0]
print(df.duplicated()) # Series of booleans
# (True if a row is identical to a previous row)
df.duplicated().sum() # count of duplicates
df[df.duplicated()] # only show duplicates
df.age.duplicated() # check a single column for duplicates
df.duplicated(['age', 'gender']).sum() # specify columns for finding duplicates
df = df.drop_duplicates() # drop duplicate rows
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
Missing data¶
# Missing values are often just excluded
df = users.copy()
df.describe(include='all')
# find missing values in a Series
df.height.isnull() # True if NaN, False otherwise
df.height.notnull() # False if NaN, True otherwise
df[df.height.notnull()] # only show rows where age is not NaN
df.height.isnull().sum() # count the missing values
# find missing values in a DataFrame
df.isnull() # DataFrame of booleans
df.isnull().sum() # calculate the sum of each column
name 0
age 0
gender 0
job 0
height 2
dtype: int64
Strategy 1: drop missing values
df.dropna() # drop a row if ANY values are missing
df.dropna(how='all') # drop a row only if ALL values are missing
Strategy 2: fill in missing values
df.height.mean()
df = users.copy()
df.loc[df.height.isnull(), "height"] = df["height"].mean()
print(df)
name age gender job height
0 alice 19 F student 165.00
1 eric 22 M student 175.00
2 john 26 M student 180.00
3 julie 44 F scientist 171.00
4 paul 58 F manager 172.75
5 peter 33 M engineer 172.75
Operation: multiplication¶
Multiplication of dataframe and other, element-wise
df = users.dropna()
df.insert(0, 'random', np.arange(df.shape[0]))
print(df)
df[["age", "height"]].multiply(df["random"], axis="index")
random name age gender job height
0 0 alice 19 F student 165.0
1 1 eric 22 M student 175.0
2 2 john 26 M student 180.0
3 3 julie 44 F scientist 171.0
Renaming¶
Rename columns
df = users.copy()
df.rename(columns={'name': 'NAME'})
Rename values
df.job = df.job.map({'student': 'etudiant', 'manager': 'manager',
'engineer': 'ingenieur', 'scientist': 'scientific'})
Dealing with outliers¶
size = pd.Series(np.random.normal(loc=175, size=20, scale=10))
# Corrupt the first 3 measures
size[:3] += 500
Based on parametric statistics: use the mean¶
Assume random variable follows the normal distribution Exclude data outside 3 standard-deviations: - Probability that a sample lies within 1 sd: 68.27% - Probability that a sample lies within 3 sd: 99.73% (68.27 + 2 * 15.73)
size_outlr_mean = size.copy()
size_outlr_mean[((size - size.mean()).abs() > 3 * size.std())] = size.mean()
print(size_outlr_mean.mean())
248.48963819938044
Based on non-parametric statistics: use the median¶
Median absolute deviation (MAD) is based on the median, is a robust non-parametric statistics.
mad = 1.4826 * np.median(np.abs(size - size.median()))
size_outlr_mad = size.copy()
size_outlr_mad[((size - size.median()).abs() > 3 * mad)] = size.median()
print(size_outlr_mad.mean(), size_outlr_mad.median())
173.80000467192673 178.7023568870694
File I/O¶
csv¶
import tempfile, os.path
tmpdir = tempfile.gettempdir()
csv_filename = os.path.join(tmpdir, "users.csv")
users.to_csv(csv_filename, index=False)
other = pd.read_csv(csv_filename)
Read csv from url¶
url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
salary = pd.read_csv(url)
Excel¶
Package openpyxl is required. To install type:
conda install -c conda-forge openpyxl
xls_filename = os.path.join(tmpdir, "users.xlsx")
# Write
users.to_excel(xls_filename, sheet_name='users', index=False)
# Read
pd.read_excel(xls_filename, sheet_name='users')
# Multiple sheets
with pd.ExcelWriter(xls_filename) as writer:
users.to_excel(writer, sheet_name='users', index=False)
df.to_excel(writer, sheet_name='salary', index=False)
pd.read_excel(xls_filename, sheet_name='users')
pd.read_excel(xls_filename, sheet_name='salary')
SQL (SQLite)¶
import pandas as pd
import sqlite3
db_filename = os.path.join(tmpdir, "users.db")
Connect
conn = sqlite3.connect(db_filename)
Creating tables with pandas
url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
salary = pd.read_csv(url)
salary.to_sql("salary", conn, if_exists="replace")
46
Push modifications
cur = conn.cursor()
values = (100, 14000, 5, 'Bachelor', 'N')
cur.execute("insert into salary values (?, ?, ?, ?, ?)", values)
conn.commit()
Reading results into a pandas DataFrame
salary_sql = pd.read_sql_query("select * from salary;", conn)
print(salary_sql.head())
pd.read_sql_query("select * from salary;", conn).tail()
pd.read_sql_query('select * from salary where salary>25000;', conn)
pd.read_sql_query('select * from salary where experience=16;', conn)
pd.read_sql_query('select * from salary where education="Master";', conn)
index salary experience education management
0 0 13876 1 Bachelor Y
1 1 11608 1 Ph.D N
2 2 18701 1 Ph.D Y
3 3 11283 1 Master N
4 4 11767 1 Ph.D N
Exercises¶
Data Frame¶
Read the iris dataset at ‘https://github.com/duchesnay/pystatsml/raw/master/datasets/iris.csv’
Print column names
Get numerical columns
For each species compute the mean of numerical columns and store it in a
stats
table like:
species sepal_length sepal_width petal_length petal_width
0 setosa 5.006 3.428 1.462 0.246
1 versicolor 5.936 2.770 4.260 1.326
2 virginica 6.588 2.974 5.552 2.026
Missing data¶
Add some missing data to the previous table users
:
df = users.copy()
df.loc[[0, 2], "age"] = None
df.loc[[1, 3], "gender"] = None
1. Write a function fillmissing_with_mean(df)
that fill all missing
value of numerical column with the mean of the current columns.
2. Save the original users and “imputed” frame in a single excel file “users.xlsx” with 2 sheets: original, imputed.
Total running time of the script: (0 minutes 0.854 seconds)