Pandas: data manipulation

It is often said that 80% of data analysis is spent on the cleaning and small, but important, aspect of data manipulation and cleaning with Pandas.

Sources:

Data structures

  • Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call pd.Series([1,3,5,np.nan,6,8])

  • DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It stems from the R data.frame() object.

import pandas as pd
import numpy as np

Create DataFrame

columns = ['name', 'age', 'gender', 'job']

user1 = pd.DataFrame([['alice', 19, "F", "student"],
                      ['john', 26, "M", "student"]],
                     columns=columns)

user2 = pd.DataFrame([['eric', 22, "M", "student"],
                      ['paul', 58, "F", "manager"]],
                     columns=columns)

user3 = pd.DataFrame(dict(name=['peter', 'julie'],
                          age=[33, 44], gender=['M', 'F'],
                          job=['engineer', 'scientist']))

print(user3)
    name  age gender        job
0  peter   33      M   engineer
1  julie   44      F  scientist

Combining DataFrames

Concatenate DataFrame

Concatenate columns (axis = 1).

height = pd.DataFrame(dict(height=[1.65, 1.8]))
print(user1, "\n", height)

print(pd.concat([user1, height], axis=1))
    name  age gender      job
0  alice   19      F  student
1   john   26      M  student
    height
0    1.65
1    1.80
    name  age gender      job  height
0  alice   19      F  student    1.65
1   john   26      M  student    1.80

Concatenate rows (default: axis = 0)

users = pd.concat([user1, user2, user3])
print(users)
    name  age gender        job
0  alice   19      F    student
1   john   26      M    student
0   eric   22      M    student
1   paul   58      F    manager
0  peter   33      M   engineer
1  julie   44      F  scientist

Join DataFrame

user4 = pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'],
                          height=[165, 180, 175, 171]))
print(user4)
    name  height
0  alice     165
1   john     180
2   eric     175
3  julie     171

Use intersection of keys from both frames

merge_inter = pd.merge(users, user4)
print(merge_inter)
    name  age gender        job  height
0  alice   19      F    student     165
1   john   26      M    student     180
2   eric   22      M    student     175
3  julie   44      F  scientist     171

Use union of keys from both frames

users = pd.merge(users, user4, on="name", how='outer')
print(users)
    name  age gender        job  height
0  alice   19      F    student   165.0
1   eric   22      M    student   175.0
2   john   26      M    student   180.0
3  julie   44      F  scientist   171.0
4   paul   58      F    manager     NaN
5  peter   33      M   engineer     NaN

Reshaping by pivoting

“Unpivots” a DataFrame from wide format to long (stacked) format,

staked = pd.melt(users, id_vars="name", var_name="variable", value_name="value")
print(staked)
     name variable      value
0   alice      age         19
1    eric      age         22
2    john      age         26
3   julie      age         44
4    paul      age         58
5   peter      age         33
6   alice   gender          F
7    eric   gender          M
8    john   gender          M
9   julie   gender          F
10   paul   gender          F
11  peter   gender          M
12  alice      job    student
13   eric      job    student
14   john      job    student
15  julie      job  scientist
16   paul      job    manager
17  peter      job   engineer
18  alice   height      165.0
19   eric   height      175.0
20   john   height      180.0
21  julie   height      171.0
22   paul   height        NaN
23  peter   height        NaN

“pivots” a DataFrame from long (stacked) format to wide format,

wide = staked.pivot(index='name', columns='variable', values='value')
print(wide)
variable age gender height        job
name
alice     19      F  165.0    student
eric      22      M  175.0    student
john      26      M  180.0    student
julie     44      F  171.0  scientist
paul      58      F    NaN    manager
peter     33      M    NaN   engineer

Summarizing

users                   # print the first 30 and last 30 rows
type(users)             # DataFrame
users.head()            # print the first 5 rows
users.tail()            # print the last 5 rows
name age gender job height
1 eric 22 M student 175.0
2 john 26 M student 180.0
3 julie 44 F scientist 171.0
4 paul 58 F manager NaN
5 peter 33 M engineer NaN


Meta-information

users.columns           # Column names
users.index             # Row name"
users.shape             # number of rows and columns
users.dtypes            # data types of each column
users.values            # underlying numpy array
array([['alice', 19, 'F', 'student', 165.0],
       ['eric', 22, 'M', 'student', 175.0],
       ['john', 26, 'M', 'student', 180.0],
       ['julie', 44, 'F', 'scientist', 171.0],
       ['paul', 58, 'F', 'manager', nan],
       ['peter', 33, 'M', 'engineer', nan]], dtype=object)

Columns selection

print(users.columns)

users['gender']         # select one column
type(users['gender'])   # Series
users.gender            # select one column using the DataFrame

# select multiple columns
users[['age', 'gender']]        # select two columns
my_cols = ['age', 'gender']     # or, create a list...
users[my_cols]                  # ...and use that list to select columns
type(users[my_cols])            # DataFrame
Index(['name', 'age', 'gender', 'job', 'height'], dtype='object')

iloc is strictly integer position based

users.iloc[:, 2] # select third column
0    F
1    M
2    M
3    F
4    F
5    M
Name: gender, dtype: object

Rows selection (basic)

iloc is strictly integer position based

df = users.copy()
df.iloc[0]     # first row
df.iloc[0, :]  # first row
df.iloc[[0, 1], :]  # Two first row

df.iloc[0, 0]  # first item of first row
df.iloc[0, 0] = 55

loc supports mixed integer and label based access.

df.loc[0]         # first row
df.loc[0, :]      # first row
df.loc[0, "age"]  # age item of first row
df.loc[0, "age"] = 55

Selection and index

Select females into a new DataFrame

df = users[users.gender == "F"]
print(df)
    name  age gender        job  height
0  alice   19      F    student   165.0
3  julie   44      F  scientist   171.0
4   paul   58      F    manager     NaN

Reset index, useful when index is meaningless

df = df.reset_index(drop=True)  # Watch the index
print(df)
    name  age gender        job  height
0  alice   19      F    student   165.0
1  julie   44      F  scientist   171.0
2   paul   58      F    manager     NaN

Rows iteration

df = users[:2].copy()

iterrows(): slow, get series, read-only

  • Returns (index, Series) pairs.

  • Slow because iterrows boxes the data into a Series.

  • Retrieve fields with column name

  • Don’t modify something you are iterating over. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

for idx, row in df.iterrows():
    print(row["name"], row["age"])
alice 19
eric 22

itertuples(): fast, get namedtuples, read-only

  • Returns namedtuples of the values and which is generally faster than iterrows.

  • Fast, because itertuples does not box the data into a Series.

  • Retrieve fields with integer index starting from 0.

  • Names will be renamed to positional names if they are invalid Python

identifier

for tup in df.itertuples():
    print(tup[1], tup[2])
alice 19
eric 22

iter using loc[i, …]: read and write

for i in range(df.shape[0]):
    df.loc[i, "age"] *= 10  # df is modified

Rows selection (filtering)

simple logical filtering on numerical values

users[users.age < 20]        # only show users with age < 20
young_bool = users.age < 20  # or, create a Series of booleans...
young = users[young_bool]            # ...and use that Series to filter rows
users[users.age < 20].job    # select one column from the filtered results
print(young)
    name  age gender      job  height
0  alice   19      F  student   165.0

simple logical filtering on categorial values

users[users.job == 'student']
users[users.job.isin(['student', 'engineer'])]
users[users['job'].str.contains("stu|scient")]
name age gender job height
0 alice 19 F student 165.0
1 eric 22 M student 175.0
2 john 26 M student 180.0
3 julie 44 F scientist 171.0


Advanced logical filtering

users[users.age < 20][['age', 'job']]            # select multiple columns
users[(users.age > 20) & (users.gender == 'M')]  # use multiple conditions
name age gender job height
1 eric 22 M student 175.0
2 john 26 M student 180.0
5 peter 33 M engineer NaN


Sorting

df = users.copy()

df.age.sort_values()                      # only works for a Series
df.sort_values(by='age')                  # sort rows by a specific column
df.sort_values(by='age', ascending=False) # use descending order instead
df.sort_values(by=['job', 'age'])         # sort by multiple columns
df.sort_values(by=['job', 'age'], inplace=True) # modify df

print(df)
    name  age gender        job  height
5  peter   33      M   engineer     NaN
4   paul   58      F    manager     NaN
3  julie   44      F  scientist   171.0
0  alice   19      F    student   165.0
1   eric   22      M    student   175.0
2   john   26      M    student   180.0

Descriptive statistics

Summarize all numeric columns

print(df.describe())
             age      height
count   6.000000    4.000000
mean   33.666667  172.750000
std    14.895189    6.344289
min    19.000000  165.000000
25%    23.000000  169.500000
50%    29.500000  173.000000
75%    41.250000  176.250000
max    58.000000  180.000000

Summarize all columns

print(df.describe(include='all'))
print(df.describe(include=['object']))  # limit to one (or more) types
         name        age gender      job      height
count       6   6.000000      6        6    4.000000
unique      6        NaN      2        4         NaN
top     peter        NaN      M  student         NaN
freq        1        NaN      3        3         NaN
mean      NaN  33.666667    NaN      NaN  172.750000
std       NaN  14.895189    NaN      NaN    6.344289
min       NaN  19.000000    NaN      NaN  165.000000
25%       NaN  23.000000    NaN      NaN  169.500000
50%       NaN  29.500000    NaN      NaN  173.000000
75%       NaN  41.250000    NaN      NaN  176.250000
max       NaN  58.000000    NaN      NaN  180.000000
         name gender      job
count       6      6        6
unique      6      2        4
top     peter      M  student
freq        1      3        3

Categorical columns: count and proportions of values

df['job'].value_counts()
df['job'].value_counts(normalize=True).round(2)
job
student      0.50
engineer     0.17
manager      0.17
scientist    0.17
Name: proportion, dtype: float64

Categorical columns: length of strings

df['job'].str.len()
5    8
4    7
3    9
0    7
1    7
2    7
Name: job, dtype: int64

Statistics per group (groupby)

print(df.groupby("job")["age"].mean())
# print(df.groupby("job").describe(include='all'))
job
engineer     33.000000
manager      58.000000
scientist    44.000000
student      22.333333
Name: age, dtype: float64

Groupby in a loop

for grp, data in df.groupby("job"):
    print(grp, data)
engineer     name  age gender       job  height
5  peter   33      M  engineer     NaN
manager    name  age gender      job  height
4  paul   58      F  manager     NaN
scientist     name  age gender        job  height
3  julie   44      F  scientist   171.0
student     name  age gender      job  height
0  alice   19      F  student   165.0
1   eric   22      M  student   175.0
2   john   26      M  student   180.0

Quality check

Remove duplicate data

df = users.copy()

# Create a duplicate: Append the first at the end
df.loc[len(df.index)] = users.iloc[0]

print(df.duplicated())                 # Series of booleans
# (True if a row is identical to a previous row)
df.duplicated().sum()                  # count of duplicates
df[df.duplicated()]                    # only show duplicates
df.age.duplicated()                    # check a single column for duplicates
df.duplicated(['age', 'gender']).sum() # specify columns for finding duplicates
df = df.drop_duplicates()              # drop duplicate rows
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Missing data

# Missing values are often just excluded
df = users.copy()

df.describe(include='all')

# find missing values in a Series
df.height.isnull()           # True if NaN, False otherwise
df.height.notnull()          # False if NaN, True otherwise
df[df.height.notnull()]      # only show rows where age is not NaN
df.height.isnull().sum()     # count the missing values

# find missing values in a DataFrame
df.isnull()             # DataFrame of booleans
df.isnull().sum()       # calculate the sum of each column
name      0
age       0
gender    0
job       0
height    2
dtype: int64

Strategy 1: drop missing values

df.dropna()             # drop a row if ANY values are missing
df.dropna(how='all')    # drop a row only if ALL values are missing
name age gender job height
0 alice 19 F student 165.0
1 eric 22 M student 175.0
2 john 26 M student 180.0
3 julie 44 F scientist 171.0
4 paul 58 F manager NaN
5 peter 33 M engineer NaN


Strategy 2: fill in missing values

df.height.mean()
df = users.copy()
df.loc[df.height.isnull(), "height"] = df["height"].mean()

print(df)
    name  age gender        job  height
0  alice   19      F    student  165.00
1   eric   22      M    student  175.00
2   john   26      M    student  180.00
3  julie   44      F  scientist  171.00
4   paul   58      F    manager  172.75
5  peter   33      M   engineer  172.75

Operation: multiplication

Multiplication of dataframe and other, element-wise

df = users.dropna()
df.insert(0, 'random', np.arange(df.shape[0]))
print(df)
df[["age", "height"]].multiply(df["random"], axis="index")
   random   name  age gender        job  height
0       0  alice   19      F    student   165.0
1       1   eric   22      M    student   175.0
2       2   john   26      M    student   180.0
3       3  julie   44      F  scientist   171.0
age height
0 0 0.0
1 22 175.0
2 52 360.0
3 132 513.0


Renaming

Rename columns

df = users.copy()
df.rename(columns={'name': 'NAME'})
NAME age gender job height
0 alice 19 F student 165.0
1 eric 22 M student 175.0
2 john 26 M student 180.0
3 julie 44 F scientist 171.0
4 paul 58 F manager NaN
5 peter 33 M engineer NaN


Rename values

df.job = df.job.map({'student': 'etudiant', 'manager': 'manager',
                     'engineer': 'ingenieur', 'scientist': 'scientific'})

Dealing with outliers

size = pd.Series(np.random.normal(loc=175, size=20, scale=10))
# Corrupt the first 3 measures
size[:3] += 500

Based on parametric statistics: use the mean

Assume random variable follows the normal distribution Exclude data outside 3 standard-deviations: - Probability that a sample lies within 1 sd: 68.27% - Probability that a sample lies within 3 sd: 99.73% (68.27 + 2 * 15.73)

size_outlr_mean = size.copy()
size_outlr_mean[((size - size.mean()).abs() > 3 * size.std())] = size.mean()
print(size_outlr_mean.mean())
248.48963819938044

Based on non-parametric statistics: use the median

Median absolute deviation (MAD) is based on the median, is a robust non-parametric statistics.

mad = 1.4826 * np.median(np.abs(size - size.median()))
size_outlr_mad = size.copy()

size_outlr_mad[((size - size.median()).abs() > 3 * mad)] = size.median()
print(size_outlr_mad.mean(), size_outlr_mad.median())
173.80000467192673 178.7023568870694

File I/O

csv

import tempfile, os.path

tmpdir = tempfile.gettempdir()
csv_filename = os.path.join(tmpdir, "users.csv")
users.to_csv(csv_filename, index=False)
other = pd.read_csv(csv_filename)

Read csv from url

url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
salary = pd.read_csv(url)

Excel

Package openpyxl is required. To install type:

conda install -c conda-forge openpyxl
xls_filename = os.path.join(tmpdir, "users.xlsx")

# Write
users.to_excel(xls_filename, sheet_name='users', index=False)

# Read
pd.read_excel(xls_filename, sheet_name='users')

# Multiple sheets
with pd.ExcelWriter(xls_filename) as writer:
    users.to_excel(writer, sheet_name='users', index=False)
    df.to_excel(writer, sheet_name='salary', index=False)

pd.read_excel(xls_filename, sheet_name='users')
pd.read_excel(xls_filename, sheet_name='salary')
name age gender job height
0 alice 19 F etudiant 165.0
1 eric 22 M etudiant 175.0
2 john 26 M etudiant 180.0
3 julie 44 F scientific 171.0
4 paul 58 F manager NaN
5 peter 33 M ingenieur NaN


SQL (SQLite)

import pandas as pd
import sqlite3

db_filename = os.path.join(tmpdir, "users.db")

Connect

conn = sqlite3.connect(db_filename)

Creating tables with pandas

url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
salary = pd.read_csv(url)

salary.to_sql("salary", conn, if_exists="replace")
46

Push modifications

cur = conn.cursor()
values = (100, 14000, 5,  'Bachelor', 'N')
cur.execute("insert into salary values (?, ?, ?, ?, ?)", values)
conn.commit()

Reading results into a pandas DataFrame

salary_sql = pd.read_sql_query("select * from salary;", conn)
print(salary_sql.head())

pd.read_sql_query("select * from salary;", conn).tail()
pd.read_sql_query('select * from salary where salary>25000;', conn)
pd.read_sql_query('select * from salary where experience=16;', conn)
pd.read_sql_query('select * from salary where education="Master";', conn)
   index  salary  experience education management
0      0   13876           1  Bachelor          Y
1      1   11608           1      Ph.D          N
2      2   18701           1      Ph.D          Y
3      3   11283           1    Master          N
4      4   11767           1      Ph.D          N
index salary experience education management
0 3 11283 1 Master N
1 5 20872 2 Master Y
2 6 11772 2 Master N
3 9 12313 3 Master N
4 11 21371 3 Master Y
5 16 12884 4 Master N
6 17 13245 5 Master N
7 22 13839 6 Master N
8 23 22884 6 Master Y
9 25 14803 8 Master N
10 30 15942 10 Master N
11 32 23780 10 Master Y
12 33 25410 11 Master Y
13 35 16882 12 Master N
14 38 26330 13 Master Y
15 39 17949 14 Master N
16 41 27837 16 Master Y
17 42 18838 16 Master N
18 44 19207 17 Master N


Exercises

Data Frame

  1. Read the iris dataset at ‘https://github.com/duchesnay/pystatsml/raw/master/datasets/iris.csv

  2. Print column names

  3. Get numerical columns

  4. For each species compute the mean of numerical columns and store it in a stats table like:

      species  sepal_length  sepal_width  petal_length  petal_width
0      setosa         5.006        3.428         1.462        0.246
1  versicolor         5.936        2.770         4.260        1.326
2   virginica         6.588        2.974         5.552        2.026

Missing data

Add some missing data to the previous table users:

df = users.copy()
df.loc[[0, 2], "age"] = None
df.loc[[1, 3], "gender"] = None

1. Write a function fillmissing_with_mean(df) that fill all missing value of numerical column with the mean of the current columns.

2. Save the original users and “imputed” frame in a single excel file “users.xlsx” with 2 sheets: original, imputed.

Total running time of the script: (0 minutes 0.854 seconds)

Gallery generated by Sphinx-Gallery