Note
Click here to download the full example code
Pandas: data manipulation¶
It is often said that 80% of data analysis is spent on the cleaning and small, but important, aspect of data manipulation and cleaning with Pandas.
Sources:
Kevin Markham: https://github.com/justmarkham
Pandas doc: http://pandas.pydata.org/pandas-docs/stable/index.html
Data structures
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call pd.Series([1,3,5,np.nan,6,8])
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It stems from the R data.frame() object.
import pandas as pd
import numpy as np
Create DataFrame¶
columns = ['name', 'age', 'gender', 'job']
user1 = pd.DataFrame([['alice', 19, "F", "student"],
['john', 26, "M", "student"]],
columns=columns)
user2 = pd.DataFrame([['eric', 22, "M", "student"],
['paul', 58, "F", "manager"]],
columns=columns)
user3 = pd.DataFrame(dict(name=['peter', 'julie'],
age=[33, 44], gender=['M', 'F'],
job=['engineer', 'scientist']))
print(user3)
Out:
name age gender job
0 peter 33 M engineer
1 julie 44 F scientist
Combining DataFrames¶
Concatenate DataFrame¶
Concatenate columns (axis = 1).
height = pd.DataFrame(dict(height=[1.65, 1.8]))
print(user1, "\n", height)
print(pd.concat([user1, height], axis=1))
Out:
name age gender job
0 alice 19 F student
1 john 26 M student
height
0 1.65
1 1.80
name age gender job height
0 alice 19 F student 1.65
1 john 26 M student 1.80
Concatenate rows (default: axis = 0)
users = pd.concat([user1, user2, user3])
print(users)
Out:
name age gender job
0 alice 19 F student
1 john 26 M student
0 eric 22 M student
1 paul 58 F manager
0 peter 33 M engineer
1 julie 44 F scientist
Concatenate rows: append
user1.append(user2)
Join DataFrame¶
user4 = pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'],
height=[165, 180, 175, 171]))
print(user4)
Out:
name height
0 alice 165
1 john 180
2 eric 175
3 julie 171
Use intersection of keys from both frames
merge_inter = pd.merge(users, user4)
print(merge_inter)
Out:
name age gender job height
0 alice 19 F student 165
1 john 26 M student 180
2 eric 22 M student 175
3 julie 44 F scientist 171
Use union of keys from both frames
users = pd.merge(users, user4, on="name", how='outer')
print(users)
Out:
name age gender job height
0 alice 19 F student 165.0
1 john 26 M student 180.0
2 eric 22 M student 175.0
3 paul 58 F manager NaN
4 peter 33 M engineer NaN
5 julie 44 F scientist 171.0
Reshaping by pivoting¶
“Unpivots” a DataFrame from wide format to long (stacked) format,
staked = pd.melt(users, id_vars="name", var_name="variable", value_name="value")
print(staked)
Out:
name variable value
0 alice age 19
1 john age 26
2 eric age 22
3 paul age 58
4 peter age 33
5 julie age 44
6 alice gender F
7 john gender M
8 eric gender M
9 paul gender F
10 peter gender M
11 julie gender F
12 alice job student
13 john job student
14 eric job student
15 paul job manager
16 peter job engineer
17 julie job scientist
18 alice height 165
19 john height 180
20 eric height 175
21 paul height NaN
22 peter height NaN
23 julie height 171
“pivots” a DataFrame from long (stacked) format to wide format,
print(staked.pivot(index='name', columns='variable', values='value'))
Out:
variable age gender height job
name
alice 19 F 165 student
eric 22 M 175 student
john 26 M 180 student
julie 44 F 171 scientist
paul 58 F NaN manager
peter 33 M NaN engineer
Summarizing¶
users # print the first 30 and last 30 rows
type(users) # DataFrame
users.head() # print the first 5 rows
users.tail() # print the last 5 rows
Descriptive statistics
users.describe(include="all")
Meta-information
users.index # "Row names"
users.columns # column names
users.dtypes # data types of each column
users.values # underlying numpy array
users.shape # number of rows and columns
Out:
(6, 5)
Columns selection¶
users['gender'] # select one column
type(users['gender']) # Series
users.gender # select one column using the DataFrame
# select multiple columns
users[['age', 'gender']] # select two columns
my_cols = ['age', 'gender'] # or, create a list...
users[my_cols] # ...and use that list to select columns
type(users[my_cols]) # DataFrame
Rows selection (basic)¶
iloc is strictly integer position based
df = users.copy()
df.iloc[0] # first row
df.iloc[0, :] # first row
df.iloc[0, 0] # first item of first row
df.iloc[0, 0] = 55
loc supports mixed integer and label based access.
df.loc[0] # first row
df.loc[0, :] # first row
df.loc[0, "age"] # age item of first row
df.loc[0, "age"] = 55
Selection and index
Select females into a new DataFrame
df = users[users.gender == "F"]
print(df)
Out:
name age gender job height
0 alice 19 F student 165.0
3 paul 58 F manager NaN
5 julie 44 F scientist 171.0
Get the two first rows using iloc (strictly integer position)
df.iloc[[0, 1], :] # Ok, but watch the index: 0, 3
Use loc
try:
df.loc[[0, 1], :] # Failed
except KeyError as err:
print(err)
Out:
"Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([1], dtype='int64'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
Reset index
df = df.reset_index(drop=True) # Watch the index
print(df)
print(df.loc[[0, 1], :])
Out:
name age gender job height
0 alice 19 F student 165.0
1 paul 58 F manager NaN
2 julie 44 F scientist 171.0
name age gender job height
0 alice 19 F student 165.0
1 paul 58 F manager NaN
Sorting¶
Rows iteration¶
df = users[:2].copy()
iterrows(): slow, get series, read-only
Returns (index, Series) pairs.
Slow because iterrows boxes the data into a Series.
Retrieve fields with column name
Don’t modify something you are iterating over. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
for idx, row in df.iterrows():
print(row["name"], row["age"])
Out:
alice 19
john 26
itertuples(): fast, get namedtuples, read-only
Returns namedtuples of the values and which is generally faster than iterrows.
Fast, because itertuples does not box the data into a Series.
Retrieve fields with integer index starting from 0.
Names will be renamed to positional names if they are invalid Python
identifier
for tup in df.itertuples():
print(tup[1], tup[2])
Out:
alice 19
john 26
iter using loc[i, …]: read and write
for i in range(df.shape[0]):
df.loc[i, "age"] *= 10 # df is modified
Rows selection (filtering)¶
simple logical filtering on numerical values
users[users.age < 20] # only show users with age < 20
young_bool = users.age < 20 # or, create a Series of booleans...
young = users[young_bool] # ...and use that Series to filter rows
users[users.age < 20].job # select one column from the filtered results
print(young)
Out:
name age gender job height
0 alice 19 F student 165.0
simple logical filtering on categorial values
users[users.job == 'student']
users[users.job.isin(['student', 'engineer'])]
users[users['job'].str.contains("stu|scient")]
Advanced logical filtering
users[users.age < 20][['age', 'job']] # select multiple columns
users[(users.age > 20) & (users.gender == 'M')] # use multiple conditions
Sorting¶
df = users.copy()
df.age.sort_values() # only works for a Series
df.sort_values(by='age') # sort rows by a specific column
df.sort_values(by='age', ascending=False) # use descending order instead
df.sort_values(by=['job', 'age']) # sort by multiple columns
df.sort_values(by=['job', 'age'], inplace=True) # modify df
print(df)
Out:
name age gender job height
4 peter 33 M engineer NaN
3 paul 58 F manager NaN
5 julie 44 F scientist 171.0
0 alice 19 F student 165.0
2 eric 22 M student 175.0
1 john 26 M student 180.0
Descriptive statistics¶
Summarize all numeric columns
print(df.describe())
Out:
age height
count 6.000000 4.000000
mean 33.666667 172.750000
std 14.895189 6.344289
min 19.000000 165.000000
25% 23.000000 169.500000
50% 29.500000 173.000000
75% 41.250000 176.250000
max 58.000000 180.000000
Summarize all columns
print(df.describe(include='all'))
print(df.describe(include=['object'])) # limit to one (or more) types
Out:
name age gender job height
count 6 6.000000 6 6 4.000000
unique 6 NaN 2 4 NaN
top peter NaN F student NaN
freq 1 NaN 3 3 NaN
mean NaN 33.666667 NaN NaN 172.750000
std NaN 14.895189 NaN NaN 6.344289
min NaN 19.000000 NaN NaN 165.000000
25% NaN 23.000000 NaN NaN 169.500000
50% NaN 29.500000 NaN NaN 173.000000
75% NaN 41.250000 NaN NaN 176.250000
max NaN 58.000000 NaN NaN 180.000000
name gender job
count 6 6 6
unique 6 2 4
top peter F student
freq 1 3 3
Statistics per group (groupby)
print(df.groupby("job").mean())
print(df.groupby("job")["age"].mean())
print(df.groupby("job").describe(include='all'))
Out:
age height
job
engineer 33.000000 NaN
manager 58.000000 NaN
scientist 44.000000 171.000000
student 22.333333 173.333333
job
engineer 33.000000
manager 58.000000
scientist 44.000000
student 22.333333
Name: age, dtype: float64
name ... height
count unique top freq mean std ... std min 25% 50% 75% max
job ...
engineer 1 1 peter 1 NaN NaN ... NaN NaN NaN NaN NaN NaN
manager 1 1 paul 1 NaN NaN ... NaN NaN NaN NaN NaN NaN
scientist 1 1 julie 1 NaN NaN ... NaN 171.0 171.0 171.0 171.0 171.0
student 3 3 eric 1 NaN NaN ... 7.637626 165.0 170.0 175.0 177.5 180.0
[4 rows x 44 columns]
Groupby in a loop
for grp, data in df.groupby("job"):
print(grp, data)
Out:
engineer name age gender job height
4 peter 33 M engineer NaN
manager name age gender job height
3 paul 58 F manager NaN
scientist name age gender job height
5 julie 44 F scientist 171.0
student name age gender job height
0 alice 19 F student 165.0
2 eric 22 M student 175.0
1 john 26 M student 180.0
Quality check¶
Remove duplicate data¶
df = users.append(users.iloc[0], ignore_index=True)
print(df.duplicated()) # Series of booleans
# (True if a row is identical to a previous row)
df.duplicated().sum() # count of duplicates
df[df.duplicated()] # only show duplicates
df.age.duplicated() # check a single column for duplicates
df.duplicated(['age', 'gender']).sum() # specify columns for finding duplicates
df = df.drop_duplicates() # drop duplicate rows
Out:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
Missing data¶
# Missing values are often just excluded
df = users.copy()
df.describe(include='all')
# find missing values in a Series
df.height.isnull() # True if NaN, False otherwise
df.height.notnull() # False if NaN, True otherwise
df[df.height.notnull()] # only show rows where age is not NaN
df.height.isnull().sum() # count the missing values
# find missing values in a DataFrame
df.isnull() # DataFrame of booleans
df.isnull().sum() # calculate the sum of each column
Out:
name 0
age 0
gender 0
job 0
height 2
dtype: int64
Strategy 1: drop missing values
df.dropna() # drop a row if ANY values are missing
df.dropna(how='all') # drop a row only if ALL values are missing
Strategy 2: fill in missing values
df.height.mean()
df = users.copy()
df.loc[df.height.isnull(), "height"] = df["height"].mean()
print(df)
Out:
name age gender job height
0 alice 19 F student 165.00
1 john 26 M student 180.00
2 eric 22 M student 175.00
3 paul 58 F manager 172.75
4 peter 33 M engineer 172.75
5 julie 44 F scientist 171.00
Operation: multiplication¶
Multiplication of dataframe and other, element-wise
df = users.dropna()
df.insert(0, 'random', np.arange(df.shape[0]))
print(df)
df[["age", "height"]].multiply(df["random"], axis="index")
Out:
random name age gender job height
0 0 alice 19 F student 165.0
1 1 john 26 M student 180.0
2 2 eric 22 M student 175.0
5 3 julie 44 F scientist 171.0
Renaming¶
Rename columns
df = users.copy()
df.rename(columns={'name': 'NAME'})
Rename values
df.job = df.job.map({'student': 'etudiant', 'manager': 'manager',
'engineer': 'ingenieur', 'scientist': 'scientific'})
Dealing with outliers¶
size = pd.Series(np.random.normal(loc=175, size=20, scale=10))
# Corrupt the first 3 measures
size[:3] += 500
Based on parametric statistics: use the mean¶
Assume random variable follows the normal distribution Exclude data outside 3 standard-deviations: - Probability that a sample lies within 1 sd: 68.27% - Probability that a sample lies within 3 sd: 99.73% (68.27 + 2 * 15.73)
size_outlr_mean = size.copy()
size_outlr_mean[((size - size.mean()).abs() > 3 * size.std())] = size.mean()
print(size_outlr_mean.mean())
Out:
248.48963819938044
Based on non-parametric statistics: use the median¶
Median absolute deviation (MAD), based on the median, is a robust non-parametric statistics. https://en.wikipedia.org/wiki/Median_absolute_deviation
mad = 1.4826 * np.median(np.abs(size - size.median()))
size_outlr_mad = size.copy()
size_outlr_mad[((size - size.median()).abs() > 3 * mad)] = size.median()
print(size_outlr_mad.mean(), size_outlr_mad.median())
Out:
173.80000467192673 178.7023568870694
File I/O¶
csv¶
import tempfile, os.path
tmpdir = tempfile.gettempdir()
csv_filename = os.path.join(tmpdir, "users.csv")
users.to_csv(csv_filename, index=False)
other = pd.read_csv(csv_filename)
Read csv from url¶
url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
salary = pd.read_csv(url)
Excel¶
xls_filename = os.path.join(tmpdir, "users.xlsx")
users.to_excel(xls_filename, sheet_name='users', index=False)
pd.read_excel(xls_filename, sheet_name='users')
# Multiple sheets
with pd.ExcelWriter(xls_filename) as writer:
users.to_excel(writer, sheet_name='users', index=False)
df.to_excel(writer, sheet_name='salary', index=False)
pd.read_excel(xls_filename, sheet_name='users')
pd.read_excel(xls_filename, sheet_name='salary')
SQL (SQLite)¶
import pandas as pd
import sqlite3
db_filename = os.path.join(tmpdir, "users.db")
Connect
conn = sqlite3.connect(db_filename)
Creating tables with pandas
url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
salary = pd.read_csv(url)
salary.to_sql("salary", conn, if_exists="replace")
Push modifications
cur = conn.cursor()
values = (100, 14000, 5, 'Bachelor', 'N')
cur.execute("insert into salary values (?, ?, ?, ?, ?)", values)
conn.commit()
Reading results into a pandas DataFrame
salary_sql = pd.read_sql_query("select * from salary;", conn)
print(salary_sql.head())
pd.read_sql_query("select * from salary;", conn).tail()
pd.read_sql_query('select * from salary where salary>25000;', conn)
pd.read_sql_query('select * from salary where experience=16;', conn)
pd.read_sql_query('select * from salary where education="Master";', conn)
Out:
index salary experience education management
0 0 13876 1 Bachelor Y
1 1 11608 1 Ph.D N
2 2 18701 1 Ph.D Y
3 3 11283 1 Master N
4 4 11767 1 Ph.D N
Exercises¶
Data Frame¶
Read the iris dataset at ‘https://github.com/neurospin/pystatsml/tree/master/datasets/iris.csv’
Print column names
Get numerical columns
For each species compute the mean of numerical columns and store it in a
stats
table like:
species sepal_length sepal_width petal_length petal_width
0 setosa 5.006 3.428 1.462 0.246
1 versicolor 5.936 2.770 4.260 1.326
2 virginica 6.588 2.974 5.552 2.026
Missing data¶
Add some missing data to the previous table users
:
df = users.copy()
df.loc[[0, 2], "age"] = None
df.loc[[1, 3], "gender"] = None
1. Write a function fillmissing_with_mean(df)
that fill all missing
value of numerical column with the mean of the current columns.
2. Save the original users and “imputed” frame in a single excel file “users.xlsx” with 2 sheets: original, imputed.
Total running time of the script: ( 0 minutes 1.125 seconds)