Note

Click here to download the full example code

Pandas: data manipulation¶

It is often said that 80% of data analysis is spent on the cleaning and small, but important, aspect of data manipulation and cleaning with Pandas.

Sources:

Kevin Markham: https://github.com/justmarkham
Pandas doc: http://pandas.pydata.org/pandas-docs/stable/index.html

Data structures

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call pd.Series([1,3,5,np.nan,6,8])
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It stems from the R data.frame() object.

import pandas as pd
import numpy as np

Create DataFrame¶

columns = ['name', 'age', 'gender', 'job']

user1 = pd.DataFrame([['alice', 19, "F", "student"],
                      ['john', 26, "M", "student"]],
                     columns=columns)

user2 = pd.DataFrame([['eric', 22, "M", "student"],
                      ['paul', 58, "F", "manager"]],
                     columns=columns)

user3 = pd.DataFrame(dict(name=['peter', 'julie'],
                          age=[33, 44], gender=['M', 'F'],
                          job=['engineer', 'scientist']))

print(user3)

Out:

    name  age gender        job
0  peter   33      M   engineer
1  julie   44      F  scientist

Combining DataFrames¶

Concatenate DataFrame¶

Concatenate columns (axis = 1).

height = pd.DataFrame(dict(height=[1.65, 1.8]))
print(user1, "\n", height)

print(pd.concat([user1, height], axis=1))

Out:

    name  age gender      job
0  alice   19      F  student
1   john   26      M  student
    height
0    1.65
1    1.80
    name  age gender      job  height
0  alice   19      F  student    1.65
1   john   26      M  student    1.80

Concatenate rows (default: axis = 0)

users = pd.concat([user1, user2, user3])
print(users)

Out:

    name  age gender        job
alice   19      F    student
 john   26      M    student
 eric   22      M    student
 paul   58      F    manager
peter   33      M   engineer
julie   44      F  scientist

Concatenate rows: append

user1.append(user2)

	name	age	gender	job
0	alice	19	F	student
1	john	26	M	student
0	eric	22	M	student
1	paul	58	F	manager

Join DataFrame¶

user4 = pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'],
                          height=[165, 180, 175, 171]))
print(user4)

Out:

    name  height
alice     165
 john     180
 eric     175
julie     171

Use intersection of keys from both frames

merge_inter = pd.merge(users, user4)

print(merge_inter)

Out:

    name  age gender        job  height
alice   19      F    student     165
 john   26      M    student     180
 eric   22      M    student     175
julie   44      F  scientist     171

Use union of keys from both frames

users = pd.merge(users, user4, on="name", how='outer')
print(users)

Out:

    name  age gender        job  height
alice   19      F    student   165.0
 john   26      M    student   180.0
 eric   22      M    student   175.0
 paul   58      F    manager     NaN
peter   33      M   engineer     NaN
julie   44      F  scientist   171.0

Reshaping by pivoting¶

“Unpivots” a DataFrame from wide format to long (stacked) format,

staked = pd.melt(users, id_vars="name", var_name="variable", value_name="value")
print(staked)

Out:

     name variable      value
 alice      age         19
  john      age         26
  eric      age         22
  paul      age         58
 peter      age         33
 julie      age         44
 alice   gender          F
  john   gender          M
  eric   gender          M
  paul   gender          F
peter   gender          M
julie   gender          F
alice      job    student
 john      job    student
 eric      job    student
 paul      job    manager
peter      job   engineer
julie      job  scientist
alice   height        165
 john   height        180
 eric   height        175
 paul   height        NaN
peter   height        NaN
julie   height        171

“pivots” a DataFrame from long (stacked) format to wide format,

print(staked.pivot(index='name', columns='variable', values='value'))

Out:

variable age gender height        job
name
alice     19      F    165    student
eric      22      M    175    student
john      26      M    180    student
julie     44      F    171  scientist
paul      58      F    NaN    manager
peter     33      M    NaN   engineer

Summarizing¶

users                   # print the first 30 and last 30 rows
type(users)             # DataFrame
users.head()            # print the first 5 rows
users.tail()            # print the last 5 rows

	name	age	gender	job	height
1	john	26	M	student	180.0
2	eric	22	M	student	175.0
3	paul	58	F	manager	NaN
4	peter	33	M	engineer	NaN
5	julie	44	F	scientist	171.0

Descriptive statistics

users.describe(include="all")

	name	age	gender	job	height
count	6	6.000000	6	6	4.000000
unique	6	NaN	2	4	NaN
top	julie	NaN	F	student	NaN
freq	1	NaN	3	3	NaN
mean	NaN	33.666667	NaN	NaN	172.750000
std	NaN	14.895189	NaN	NaN	6.344289
min	NaN	19.000000	NaN	NaN	165.000000
25%	NaN	23.000000	NaN	NaN	169.500000
50%	NaN	29.500000	NaN	NaN	173.000000
75%	NaN	41.250000	NaN	NaN	176.250000
max	NaN	58.000000	NaN	NaN	180.000000

Meta-information

users.index             # "Row names"
users.columns           # column names
users.dtypes            # data types of each column
users.values            # underlying numpy array
users.shape             # number of rows and columns

Out:

(6, 5)

Columns selection¶

users['gender']         # select one column
type(users['gender'])   # Series
users.gender            # select one column using the DataFrame

# select multiple columns
users[['age', 'gender']]        # select two columns
my_cols = ['age', 'gender']     # or, create a list...
users[my_cols]                  # ...and use that list to select columns
type(users[my_cols])            # DataFrame

Rows selection (basic)¶

iloc is strictly integer position based

df = users.copy()
df.iloc[0]     # first row
df.iloc[0, :]  # first row
df.iloc[0, 0]  # first item of first row
df.iloc[0, 0] = 55

loc supports mixed integer and label based access.

df.loc[0]         # first row
df.loc[0, :]      # first row
df.loc[0, "age"]  # age item of first row
df.loc[0, "age"] = 55

Selection and index

Select females into a new DataFrame

df = users[users.gender == "F"]
print(df)

Out:

    name  age gender        job  height
alice   19      F    student   165.0
 paul   58      F    manager     NaN
julie   44      F  scientist   171.0

Get the two first rows using iloc (strictly integer position)

df.iloc[[0, 1], :]  # Ok, but watch the index: 0, 3

	name	age	gender	job	height
0	alice	19	F	student	165.0
3	paul	58	F	manager	NaN

Use loc

try:
    df.loc[[0, 1], :]  # Failed
except KeyError as err:
    print(err)

Out:

"Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([1], dtype='int64'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

Reset index

df = df.reset_index(drop=True)  # Watch the index
print(df)
print(df.loc[[0, 1], :])

Out:

    name  age gender        job  height
alice   19      F    student   165.0
 paul   58      F    manager     NaN
julie   44      F  scientist   171.0
    name  age gender      job  height
alice   19      F  student   165.0
 paul   58      F  manager     NaN

Sorting¶

Rows iteration¶

df = users[:2].copy()

iterrows(): slow, get series, read-only

Returns (index, Series) pairs.
Slow because iterrows boxes the data into a Series.
Retrieve fields with column name
Don’t modify something you are iterating over. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

for idx, row in df.iterrows():
    print(row["name"], row["age"])

Out:

alice 19
john 26

itertuples(): fast, get namedtuples, read-only

Returns namedtuples of the values and which is generally faster than iterrows.
Fast, because itertuples does not box the data into a Series.
Retrieve fields with integer index starting from 0.
Names will be renamed to positional names if they are invalid Python

identifier

for tup in df.itertuples():
    print(tup[1], tup[2])

Out:

alice 19
john 26

iter using loc[i, …]: read and write

for i in range(df.shape[0]):
    df.loc[i, "age"] *= 10  # df is modified

Rows selection (filtering)¶

simple logical filtering on numerical values

users[users.age < 20]        # only show users with age < 20
young_bool = users.age < 20  # or, create a Series of booleans...
young = users[young_bool]            # ...and use that Series to filter rows
users[users.age < 20].job    # select one column from the filtered results
print(young)

Out:

    name  age gender      job  height
0  alice   19      F  student   165.0

simple logical filtering on categorial values

users[users.job == 'student']
users[users.job.isin(['student', 'engineer'])]
users[users['job'].str.contains("stu|scient")]

	name	age	gender	job	height
0	alice	19	F	student	165.0
1	john	26	M	student	180.0
2	eric	22	M	student	175.0
5	julie	44	F	scientist	171.0

Advanced logical filtering

users[users.age < 20][['age', 'job']]            # select multiple columns
users[(users.age > 20) & (users.gender == 'M')]  # use multiple conditions

	name	age	gender	job	height
1	john	26	M	student	180.0
2	eric	22	M	student	175.0
4	peter	33	M	engineer	NaN

Sorting¶

df = users.copy()

df.age.sort_values()                      # only works for a Series
df.sort_values(by='age')                  # sort rows by a specific column
df.sort_values(by='age', ascending=False) # use descending order instead
df.sort_values(by=['job', 'age'])         # sort by multiple columns
df.sort_values(by=['job', 'age'], inplace=True) # modify df

print(df)

Out:

    name  age gender        job  height
peter   33      M   engineer     NaN
 paul   58      F    manager     NaN
julie   44      F  scientist   171.0
alice   19      F    student   165.0
 eric   22      M    student   175.0
 john   26      M    student   180.0

Descriptive statistics¶

Summarize all numeric columns

print(df.describe())

Out:

             age      height
count   6.000000    4.000000
mean   33.666667  172.750000
std    14.895189    6.344289
min    19.000000  165.000000
25%    23.000000  169.500000
50%    29.500000  173.000000
75%    41.250000  176.250000
max    58.000000  180.000000

Summarize all columns

print(df.describe(include='all'))
print(df.describe(include=['object']))  # limit to one (or more) types

Out:

         name        age gender      job      height
count       6   6.000000      6        6    4.000000
unique      6        NaN      2        4         NaN
top     peter        NaN      F  student         NaN
freq        1        NaN      3        3         NaN
mean      NaN  33.666667    NaN      NaN  172.750000
std       NaN  14.895189    NaN      NaN    6.344289
min       NaN  19.000000    NaN      NaN  165.000000
25%       NaN  23.000000    NaN      NaN  169.500000
50%       NaN  29.500000    NaN      NaN  173.000000
75%       NaN  41.250000    NaN      NaN  176.250000
max       NaN  58.000000    NaN      NaN  180.000000
         name gender      job
count       6      6        6
unique      6      2        4
top     peter      F  student
freq        1      3        3

Statistics per group (groupby)

print(df.groupby("job").mean())

print(df.groupby("job")["age"].mean())

print(df.groupby("job").describe(include='all'))

Out:

                 age      height
job
engineer   33.000000         NaN
manager    58.000000         NaN
scientist  44.000000  171.000000
student    22.333333  173.333333
job
engineer     33.000000
manager      58.000000
scientist    44.000000
student      22.333333
Name: age, dtype: float64
           name                               ...    height
          count unique    top freq mean  std  ...       std    min    25%    50%    75%    max
job                                           ...
engineer      1      1  peter    1  NaN  NaN  ...       NaN    NaN    NaN    NaN    NaN    NaN
manager       1      1   paul    1  NaN  NaN  ...       NaN    NaN    NaN    NaN    NaN    NaN
scientist     1      1  julie    1  NaN  NaN  ...       NaN  171.0  171.0  171.0  171.0  171.0
student       3      3   eric    1  NaN  NaN  ...  7.637626  165.0  170.0  175.0  177.5  180.0

[4 rows x 44 columns]

Groupby in a loop

for grp, data in df.groupby("job"):
    print(grp, data)

Out:

engineer     name  age gender       job  height
4  peter   33      M  engineer     NaN
manager    name  age gender      job  height
3  paul   58      F  manager     NaN
scientist     name  age gender        job  height
5  julie   44      F  scientist   171.0
student     name  age gender      job  height
0  alice   19      F  student   165.0
2   eric   22      M  student   175.0
1   john   26      M  student   180.0

Quality check¶

Remove duplicate data¶

df = users.append(users.iloc[0], ignore_index=True)

print(df.duplicated())                 # Series of booleans
# (True if a row is identical to a previous row)
df.duplicated().sum()                  # count of duplicates
df[df.duplicated()]                    # only show duplicates
df.age.duplicated()                    # check a single column for duplicates
df.duplicated(['age', 'gender']).sum() # specify columns for finding duplicates
df = df.drop_duplicates()              # drop duplicate rows

Out:

  False
  False
  False
  False
  False
  False
   True
dtype: bool

Missing data¶

# Missing values are often just excluded
df = users.copy()

df.describe(include='all')

# find missing values in a Series
df.height.isnull()           # True if NaN, False otherwise
df.height.notnull()          # False if NaN, True otherwise
df[df.height.notnull()]      # only show rows where age is not NaN
df.height.isnull().sum()     # count the missing values

# find missing values in a DataFrame
df.isnull()             # DataFrame of booleans
df.isnull().sum()       # calculate the sum of each column

Out:

name      0
age       0
gender    0
job       0
height    2
dtype: int64

Strategy 1: drop missing values

df.dropna()             # drop a row if ANY values are missing
df.dropna(how='all')    # drop a row only if ALL values are missing

	name	age	gender	job	height
0	alice	19	F	student	165.0
1	john	26	M	student	180.0
2	eric	22	M	student	175.0
3	paul	58	F	manager	NaN
4	peter	33	M	engineer	NaN
5	julie	44	F	scientist	171.0

Strategy 2: fill in missing values

df.height.mean()
df = users.copy()
df.loc[df.height.isnull(), "height"] = df["height"].mean()

print(df)

Out:

    name  age gender        job  height
alice   19      F    student  165.00
 john   26      M    student  180.00
 eric   22      M    student  175.00
 paul   58      F    manager  172.75
peter   33      M   engineer  172.75
julie   44      F  scientist  171.00

Operation: multiplication¶

Multiplication of dataframe and other, element-wise

df = users.dropna()
df.insert(0, 'random', np.arange(df.shape[0]))
print(df)
df[["age", "height"]].multiply(df["random"], axis="index")

Out:

   random   name  age gender        job  height
     0  alice   19      F    student   165.0
     1   john   26      M    student   180.0
     2   eric   22      M    student   175.0
     3  julie   44      F  scientist   171.0

	age	height
0	0	0.0
1	26	180.0
2	44	350.0
5	132	513.0

Renaming¶

Rename columns

df = users.copy()
df.rename(columns={'name': 'NAME'})

	NAME	age	gender	job	height
0	alice	19	F	student	165.0
1	john	26	M	student	180.0
2	eric	22	M	student	175.0
3	paul	58	F	manager	NaN
4	peter	33	M	engineer	NaN
5	julie	44	F	scientist	171.0

Rename values

df.job = df.job.map({'student': 'etudiant', 'manager': 'manager',
                     'engineer': 'ingenieur', 'scientist': 'scientific'})

Dealing with outliers¶

size = pd.Series(np.random.normal(loc=175, size=20, scale=10))
# Corrupt the first 3 measures
size[:3] += 500

Based on parametric statistics: use the mean¶

Assume random variable follows the normal distribution Exclude data outside 3 standard-deviations: - Probability that a sample lies within 1 sd: 68.27% - Probability that a sample lies within 3 sd: 99.73% (68.27 + 2 * 15.73)

size_outlr_mean = size.copy()
size_outlr_mean[((size - size.mean()).abs() > 3 * size.std())] = size.mean()
print(size_outlr_mean.mean())

Out:

248.48963819938044

Based on non-parametric statistics: use the median¶

Median absolute deviation (MAD), based on the median, is a robust non-parametric statistics. https://en.wikipedia.org/wiki/Median_absolute_deviation

mad = 1.4826 * np.median(np.abs(size - size.median()))
size_outlr_mad = size.copy()

size_outlr_mad[((size - size.median()).abs() > 3 * mad)] = size.median()
print(size_outlr_mad.mean(), size_outlr_mad.median())

Out:

173.80000467192673 178.7023568870694

File I/O¶

csv¶

import tempfile, os.path

tmpdir = tempfile.gettempdir()
csv_filename = os.path.join(tmpdir, "users.csv")
users.to_csv(csv_filename, index=False)
other = pd.read_csv(csv_filename)

Read csv from url¶

url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
salary = pd.read_csv(url)

Excel¶

xls_filename = os.path.join(tmpdir, "users.xlsx")
users.to_excel(xls_filename, sheet_name='users', index=False)

pd.read_excel(xls_filename, sheet_name='users')

# Multiple sheets
with pd.ExcelWriter(xls_filename) as writer:
    users.to_excel(writer, sheet_name='users', index=False)
    df.to_excel(writer, sheet_name='salary', index=False)

pd.read_excel(xls_filename, sheet_name='users')
pd.read_excel(xls_filename, sheet_name='salary')

	name	age	gender	job	height
0	alice	19	F	etudiant	165.0
1	john	26	M	etudiant	180.0
2	eric	22	M	etudiant	175.0
3	paul	58	F	manager	NaN
4	peter	33	M	ingenieur	NaN
5	julie	44	F	scientific	171.0

SQL (SQLite)¶

import pandas as pd
import sqlite3

db_filename = os.path.join(tmpdir, "users.db")

Connect

conn = sqlite3.connect(db_filename)

Creating tables with pandas

url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
salary = pd.read_csv(url)

salary.to_sql("salary", conn, if_exists="replace")

Push modifications

cur = conn.cursor()
values = (100, 14000, 5,  'Bachelor', 'N')
cur.execute("insert into salary values (?, ?, ?, ?, ?)", values)
conn.commit()

Reading results into a pandas DataFrame

salary_sql = pd.read_sql_query("select * from salary;", conn)
print(salary_sql.head())

pd.read_sql_query("select * from salary;", conn).tail()
pd.read_sql_query('select * from salary where salary>25000;', conn)
pd.read_sql_query('select * from salary where experience=16;', conn)
pd.read_sql_query('select * from salary where education="Master";', conn)

Out:

   index  salary  experience education management
    0   13876           1  Bachelor          Y
    1   11608           1      Ph.D          N
    2   18701           1      Ph.D          Y
    3   11283           1    Master          N
    4   11767           1      Ph.D          N

	index	salary	experience	education	management
0	3	11283	1	Master	N
1	5	20872	2	Master	Y
2	6	11772	2	Master	N
3	9	12313	3	Master	N
4	11	21371	3	Master	Y
5	16	12884	4	Master	N
6	17	13245	5	Master	N
7	22	13839	6	Master	N
8	23	22884	6	Master	Y
9	25	14803	8	Master	N
10	30	15942	10	Master	N
11	32	23780	10	Master	Y
12	33	25410	11	Master	Y
13	35	16882	12	Master	N
14	38	26330	13	Master	Y
15	39	17949	14	Master	N
16	41	27837	16	Master	Y
17	42	18838	16	Master	N
18	44	19207	17	Master	N

Exercises¶

Data Frame¶

Read the iris dataset at ‘https://github.com/neurospin/pystatsml/tree/master/datasets/iris.csv’
Print column names
Get numerical columns
For each species compute the mean of numerical columns and store it in a stats table like:

      species  sepal_length  sepal_width  petal_length  petal_width
    setosa         5.006        3.428         1.462        0.246
versicolor         5.936        2.770         4.260        1.326
 virginica         6.588        2.974         5.552        2.026

Missing data¶

Add some missing data to the previous table users:

df = users.copy()
df.loc[[0, 2], "age"] = None
df.loc[[1, 3], "gender"] = None

1. Write a function fillmissing_with_mean(df) that fill all missing value of numerical column with the mean of the current columns.

2. Save the original users and “imputed” frame in a single excel file “users.xlsx” with 2 sheets: original, imputed.

Total running time of the script: ( 0 minutes 1.125 seconds)

Gallery generated by Sphinx-Gallery

Pandas: data manipulation¶

Create DataFrame¶

Combining DataFrames¶

Concatenate DataFrame¶

Join DataFrame¶

Reshaping by pivoting¶

Summarizing¶

Columns selection¶

Rows selection (basic)¶

Sorting¶

Rows iteration¶

Rows selection (filtering)¶

Sorting¶

Descriptive statistics¶

Quality check¶

Remove duplicate data¶

Missing data¶

Operation: multiplication¶

Renaming¶

Dealing with outliers¶

Based on parametric statistics: use the mean¶

Based on non-parametric statistics: use the median¶

File I/O¶

csv¶

Read csv from url¶

Excel¶

SQL (SQLite)¶

Exercises¶

Data Frame¶

Missing data¶

Statistics and Machine Learning in Python

Navigation

Related Topics