.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_gallery/scipy_pandas.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_gallery_scipy_pandas.py: Pandas: data manipulation ========================= It is often said that 80% of data analysis is spent on the cleaning and small, but important, aspect of data manipulation and cleaning with Pandas. **Sources**: - Kevin Markham: https://github.com/justmarkham - Pandas doc: http://pandas.pydata.org/pandas-docs/stable/index.html **Data structures** - **Series** is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call `pd.Series([1,3,5,np.nan,6,8])` - **DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It stems from the `R data.frame()` object. .. GENERATED FROM PYTHON SOURCE LINES 25-29 .. code-block:: default import pandas as pd import numpy as np .. GENERATED FROM PYTHON SOURCE LINES 30-32 Create DataFrame ---------------- .. GENERATED FROM PYTHON SOURCE LINES 32-49 .. code-block:: default columns = ['name', 'age', 'gender', 'job'] user1 = pd.DataFrame([['alice', 19, "F", "student"], ['john', 26, "M", "student"]], columns=columns) user2 = pd.DataFrame([['eric', 22, "M", "student"], ['paul', 58, "F", "manager"]], columns=columns) user3 = pd.DataFrame(dict(name=['peter', 'julie'], age=[33, 44], gender=['M', 'F'], job=['engineer', 'scientist'])) print(user3) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name age gender job 0 peter 33 M engineer 1 julie 44 F scientist .. GENERATED FROM PYTHON SOURCE LINES 50-52 Combining DataFrames -------------------- .. GENERATED FROM PYTHON SOURCE LINES 54-56 Concatenate DataFrame ~~~~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 58-59 Concatenate columns (axis = 1). .. GENERATED FROM PYTHON SOURCE LINES 59-65 .. code-block:: default height = pd.DataFrame(dict(height=[1.65, 1.8])) print(user1, "\n", height) print(pd.concat([user1, height], axis=1)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name age gender job 0 alice 19 F student 1 john 26 M student height 0 1.65 1 1.80 name age gender job height 0 alice 19 F student 1.65 1 john 26 M student 1.80 .. GENERATED FROM PYTHON SOURCE LINES 66-67 Concatenate rows (default: axis = 0) .. GENERATED FROM PYTHON SOURCE LINES 67-71 .. code-block:: default users = pd.concat([user1, user2, user3]) print(users) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name age gender job 0 alice 19 F student 1 john 26 M student 0 eric 22 M student 1 paul 58 F manager 0 peter 33 M engineer 1 julie 44 F scientist .. GENERATED FROM PYTHON SOURCE LINES 72-73 Concatenate rows: append .. GENERATED FROM PYTHON SOURCE LINES 73-77 .. code-block:: default user1.append(user2) .. raw:: html

	name	age	gender	job
0	alice	19	F	student
1	john	26	M	student
0	eric	22	M	student
1	paul	58	F	manager

.. GENERATED FROM PYTHON SOURCE LINES 78-80 Join DataFrame ~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 80-86 .. code-block:: default user4 = pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'], height=[165, 180, 175, 171])) print(user4) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name height 0 alice 165 1 john 180 2 eric 175 3 julie 171 .. GENERATED FROM PYTHON SOURCE LINES 87-88 Use intersection of keys from both frames .. GENERATED FROM PYTHON SOURCE LINES 88-94 .. code-block:: default merge_inter = pd.merge(users, user4) print(merge_inter) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name age gender job height 0 alice 19 F student 165 1 john 26 M student 180 2 eric 22 M student 175 3 julie 44 F scientist 171 .. GENERATED FROM PYTHON SOURCE LINES 95-96 Use union of keys from both frames .. GENERATED FROM PYTHON SOURCE LINES 96-101 .. code-block:: default users = pd.merge(users, user4, on="name", how='outer') print(users) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name age gender job height 0 alice 19 F student 165.0 1 john 26 M student 180.0 2 eric 22 M student 175.0 3 paul 58 F manager NaN 4 peter 33 M engineer NaN 5 julie 44 F scientist 171.0 .. GENERATED FROM PYTHON SOURCE LINES 102-106 Reshaping by pivoting ~~~~~~~~~~~~~~~~~~~~~ “Unpivots” a DataFrame from wide format to long (stacked) format, .. GENERATED FROM PYTHON SOURCE LINES 106-111 .. code-block:: default staked = pd.melt(users, id_vars="name", var_name="variable", value_name="value") print(staked) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name variable value 0 alice age 19 1 john age 26 2 eric age 22 3 paul age 58 4 peter age 33 5 julie age 44 6 alice gender F 7 john gender M 8 eric gender M 9 paul gender F 10 peter gender M 11 julie gender F 12 alice job student 13 john job student 14 eric job student 15 paul job manager 16 peter job engineer 17 julie job scientist 18 alice height 165 19 john height 180 20 eric height 175 21 paul height NaN 22 peter height NaN 23 julie height 171 .. GENERATED FROM PYTHON SOURCE LINES 112-113 “pivots” a DataFrame from long (stacked) format to wide format, .. GENERATED FROM PYTHON SOURCE LINES 113-117 .. code-block:: default print(staked.pivot(index='name', columns='variable', values='value')) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none variable age gender height job name alice 19 F 165 student eric 22 M 175 student john 26 M 180 student julie 44 F 171 scientist paul 58 F NaN manager peter 33 M NaN engineer .. GENERATED FROM PYTHON SOURCE LINES 118-121 Summarizing ----------- .. GENERATED FROM PYTHON SOURCE LINES 121-128 .. code-block:: default users # print the first 30 and last 30 rows type(users) # DataFrame users.head() # print the first 5 rows users.tail() # print the last 5 rows .. raw:: html

	name	age	gender	job	height
1	john	26	M	student	180.0
2	eric	22	M	student	175.0
3	paul	58	F	manager	NaN
4	peter	33	M	engineer	NaN
5	julie	44	F	scientist	171.0

.. GENERATED FROM PYTHON SOURCE LINES 129-130 Descriptive statistics .. GENERATED FROM PYTHON SOURCE LINES 130-133 .. code-block:: default users.describe(include="all") .. raw:: html

	name	age	gender	job	height
count	6	6.000000	6	6	4.000000
unique	6	NaN	2	4	NaN
top	julie	NaN	F	student	NaN
freq	1	NaN	3	3	NaN
mean	NaN	33.666667	NaN	NaN	172.750000
std	NaN	14.895189	NaN	NaN	6.344289
min	NaN	19.000000	NaN	NaN	165.000000
25%	NaN	23.000000	NaN	NaN	169.500000
50%	NaN	29.500000	NaN	NaN	173.000000
75%	NaN	41.250000	NaN	NaN	176.250000
max	NaN	58.000000	NaN	NaN	180.000000

.. GENERATED FROM PYTHON SOURCE LINES 134-135 Meta-information .. GENERATED FROM PYTHON SOURCE LINES 135-142 .. code-block:: default users.index # "Row names" users.columns # column names users.dtypes # data types of each column users.values # underlying numpy array users.shape # number of rows and columns .. rst-class:: sphx-glr-script-out Out: .. code-block:: none (6, 5) .. GENERATED FROM PYTHON SOURCE LINES 143-145 Columns selection ----------------- .. GENERATED FROM PYTHON SOURCE LINES 145-156 .. code-block:: default users['gender'] # select one column type(users['gender']) # Series users.gender # select one column using the DataFrame # select multiple columns users[['age', 'gender']] # select two columns my_cols = ['age', 'gender'] # or, create a list... users[my_cols] # ...and use that list to select columns type(users[my_cols]) # DataFrame .. GENERATED FROM PYTHON SOURCE LINES 157-159 Rows selection (basic) ---------------------- .. GENERATED FROM PYTHON SOURCE LINES 161-162 `iloc` is strictly integer position based .. GENERATED FROM PYTHON SOURCE LINES 162-169 .. code-block:: default df = users.copy() df.iloc[0] # first row df.iloc[0, :] # first row df.iloc[0, 0] # first item of first row df.iloc[0, 0] = 55 .. GENERATED FROM PYTHON SOURCE LINES 170-171 `loc` supports mixed integer and label based access. .. GENERATED FROM PYTHON SOURCE LINES 171-177 .. code-block:: default df.loc[0] # first row df.loc[0, :] # first row df.loc[0, "age"] # age item of first row df.loc[0, "age"] = 55 .. GENERATED FROM PYTHON SOURCE LINES 178-181 Selection and index Select females into a new DataFrame .. GENERATED FROM PYTHON SOURCE LINES 181-185 .. code-block:: default df = users[users.gender == "F"] print(df) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name age gender job height 0 alice 19 F student 165.0 3 paul 58 F manager NaN 5 julie 44 F scientist 171.0 .. GENERATED FROM PYTHON SOURCE LINES 186-187 Get the two first rows using `iloc` (strictly integer position) .. GENERATED FROM PYTHON SOURCE LINES 187-190 .. code-block:: default df.iloc[[0, 1], :] # Ok, but watch the index: 0, 3 .. raw:: html

	name	age	gender	job	height
0	alice	19	F	student	165.0
3	paul	58	F	manager	NaN

.. GENERATED FROM PYTHON SOURCE LINES 191-192 Use `loc` .. GENERATED FROM PYTHON SOURCE LINES 192-198 .. code-block:: default try: df.loc[[0, 1], :] # Failed except KeyError as err: print(err) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([1], dtype='int64'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike" .. GENERATED FROM PYTHON SOURCE LINES 199-200 Reset index .. GENERATED FROM PYTHON SOURCE LINES 200-206 .. code-block:: default df = df.reset_index(drop=True) # Watch the index print(df) print(df.loc[[0, 1], :]) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name age gender job height 0 alice 19 F student 165.0 1 paul 58 F manager NaN 2 julie 44 F scientist 171.0 name age gender job height 0 alice 19 F student 165.0 1 paul 58 F manager NaN .. GENERATED FROM PYTHON SOURCE LINES 207-209 Sorting ------- .. GENERATED FROM PYTHON SOURCE LINES 211-213 Rows iteration -------------- .. GENERATED FROM PYTHON SOURCE LINES 213-216 .. code-block:: default df = users[:2].copy() .. GENERATED FROM PYTHON SOURCE LINES 217-225 `iterrows()`: slow, get series, **read-only** - Returns (index, Series) pairs. - Slow because iterrows boxes the data into a Series. - Retrieve fields with column name - **Don't modify something you are iterating over**. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect. .. GENERATED FROM PYTHON SOURCE LINES 225-229 .. code-block:: default for idx, row in df.iterrows(): print(row["name"], row["age"]) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none alice 19 john 26 .. GENERATED FROM PYTHON SOURCE LINES 230-237 `itertuples()`: fast, get namedtuples, **read-only** - Returns namedtuples of the values and which is generally faster than iterrows. - Fast, because itertuples does not box the data into a Series. - Retrieve fields with integer index starting from 0. - Names will be renamed to positional names if they are invalid Python identifier .. GENERATED FROM PYTHON SOURCE LINES 237-242 .. code-block:: default for tup in df.itertuples(): print(tup[1], tup[2]) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none alice 19 john 26 .. GENERATED FROM PYTHON SOURCE LINES 243-244 iter using `loc[i, ...]`: read and **write** .. GENERATED FROM PYTHON SOURCE LINES 244-249 .. code-block:: default for i in range(df.shape[0]): df.loc[i, "age"] *= 10 # df is modified .. GENERATED FROM PYTHON SOURCE LINES 250-252 Rows selection (filtering) -------------------------- .. GENERATED FROM PYTHON SOURCE LINES 254-255 simple logical filtering on numerical values .. GENERATED FROM PYTHON SOURCE LINES 255-263 .. code-block:: default users[users.age < 20] # only show users with age < 20 young_bool = users.age < 20 # or, create a Series of booleans... young = users[young_bool] # ...and use that Series to filter rows users[users.age < 20].job # select one column from the filtered results print(young) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name age gender job height 0 alice 19 F student 165.0 .. GENERATED FROM PYTHON SOURCE LINES 264-265 simple logical filtering on categorial values .. GENERATED FROM PYTHON SOURCE LINES 265-271 .. code-block:: default users[users.job == 'student'] users[users.job.isin(['student', 'engineer'])] users[users['job'].str.contains("stu|scient")] .. raw:: html

	name	age	gender	job	height
0	alice	19	F	student	165.0
1	john	26	M	student	180.0
2	eric	22	M	student	175.0
5	julie	44	F	scientist	171.0

.. GENERATED FROM PYTHON SOURCE LINES 272-273 Advanced logical filtering .. GENERATED FROM PYTHON SOURCE LINES 273-278 .. code-block:: default users[users.age < 20][['age', 'job']] # select multiple columns users[(users.age > 20) & (users.gender == 'M')] # use multiple conditions .. raw:: html

	name	age	gender	job	height
1	john	26	M	student	180.0
2	eric	22	M	student	175.0
4	peter	33	M	engineer	NaN

.. GENERATED FROM PYTHON SOURCE LINES 279-281 Sorting ------- .. GENERATED FROM PYTHON SOURCE LINES 281-293 .. code-block:: default df = users.copy() df.age.sort_values() # only works for a Series df.sort_values(by='age') # sort rows by a specific column df.sort_values(by='age', ascending=False) # use descending order instead df.sort_values(by=['job', 'age']) # sort by multiple columns df.sort_values(by=['job', 'age'], inplace=True) # modify df print(df) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name age gender job height 4 peter 33 M engineer NaN 3 paul 58 F manager NaN 5 julie 44 F scientist 171.0 0 alice 19 F student 165.0 2 eric 22 M student 175.0 1 john 26 M student 180.0 .. GENERATED FROM PYTHON SOURCE LINES 294-298 Descriptive statistics ---------------------- Summarize all numeric columns .. GENERATED FROM PYTHON SOURCE LINES 298-301 .. code-block:: default print(df.describe()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none age height count 6.000000 4.000000 mean 33.666667 172.750000 std 14.895189 6.344289 min 19.000000 165.000000 25% 23.000000 169.500000 50% 29.500000 173.000000 75% 41.250000 176.250000 max 58.000000 180.000000 .. GENERATED FROM PYTHON SOURCE LINES 302-303 Summarize all columns .. GENERATED FROM PYTHON SOURCE LINES 303-307 .. code-block:: default print(df.describe(include='all')) print(df.describe(include=['object'])) # limit to one (or more) types .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name age gender job height count 6 6.000000 6 6 4.000000 unique 6 NaN 2 4 NaN top peter NaN F student NaN freq 1 NaN 3 3 NaN mean NaN 33.666667 NaN NaN 172.750000 std NaN 14.895189 NaN NaN 6.344289 min NaN 19.000000 NaN NaN 165.000000 25% NaN 23.000000 NaN NaN 169.500000 50% NaN 29.500000 NaN NaN 173.000000 75% NaN 41.250000 NaN NaN 176.250000 max NaN 58.000000 NaN NaN 180.000000 name gender job count 6 6 6 unique 6 2 4 top peter F student freq 1 3 3 .. GENERATED FROM PYTHON SOURCE LINES 308-309 Statistics per group (groupby) .. GENERATED FROM PYTHON SOURCE LINES 309-317 .. code-block:: default print(df.groupby("job").mean()) print(df.groupby("job")["age"].mean()) print(df.groupby("job").describe(include='all')) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none age height job engineer 33.000000 NaN manager 58.000000 NaN scientist 44.000000 171.000000 student 22.333333 173.333333 job engineer 33.000000 manager 58.000000 scientist 44.000000 student 22.333333 Name: age, dtype: float64 name ... height count unique top freq mean std ... std min 25% 50% 75% max job ... engineer 1 1 peter 1 NaN NaN ... NaN NaN NaN NaN NaN NaN manager 1 1 paul 1 NaN NaN ... NaN NaN NaN NaN NaN NaN scientist 1 1 julie 1 NaN NaN ... NaN 171.0 171.0 171.0 171.0 171.0 student 3 3 eric 1 NaN NaN ... 7.637626 165.0 170.0 175.0 177.5 180.0 [4 rows x 44 columns] .. GENERATED FROM PYTHON SOURCE LINES 318-319 Groupby in a loop .. GENERATED FROM PYTHON SOURCE LINES 319-324 .. code-block:: default for grp, data in df.groupby("job"): print(grp, data) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none engineer name age gender job height 4 peter 33 M engineer NaN manager name age gender job height 3 paul 58 F manager NaN scientist name age gender job height 5 julie 44 F scientist 171.0 student name age gender job height 0 alice 19 F student 165.0 2 eric 22 M student 175.0 1 john 26 M student 180.0 .. GENERATED FROM PYTHON SOURCE LINES 325-330 Quality check ------------- Remove duplicate data ~~~~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 330-343 .. code-block:: default df = users.append(users.iloc[0], ignore_index=True) print(df.duplicated()) # Series of booleans # (True if a row is identical to a previous row) df.duplicated().sum() # count of duplicates df[df.duplicated()] # only show duplicates df.age.duplicated() # check a single column for duplicates df.duplicated(['age', 'gender']).sum() # specify columns for finding duplicates df = df.drop_duplicates() # drop duplicate rows .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 0 False 1 False 2 False 3 False 4 False 5 False 6 True dtype: bool .. GENERATED FROM PYTHON SOURCE LINES 344-346 Missing data ~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 346-363 .. code-block:: default # Missing values are often just excluded df = users.copy() df.describe(include='all') # find missing values in a Series df.height.isnull() # True if NaN, False otherwise df.height.notnull() # False if NaN, True otherwise df[df.height.notnull()] # only show rows where age is not NaN df.height.isnull().sum() # count the missing values # find missing values in a DataFrame df.isnull() # DataFrame of booleans df.isnull().sum() # calculate the sum of each column .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name 0 age 0 gender 0 job 0 height 2 dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 364-365 Strategy 1: drop missing values .. GENERATED FROM PYTHON SOURCE LINES 365-370 .. code-block:: default df.dropna() # drop a row if ANY values are missing df.dropna(how='all') # drop a row only if ALL values are missing .. raw:: html

	name	age	gender	job	height
0	alice	19	F	student	165.0
1	john	26	M	student	180.0
2	eric	22	M	student	175.0
3	paul	58	F	manager	NaN
4	peter	33	M	engineer	NaN
5	julie	44	F	scientist	171.0

.. GENERATED FROM PYTHON SOURCE LINES 371-372 Strategy 2: fill in missing values .. GENERATED FROM PYTHON SOURCE LINES 372-380 .. code-block:: default df.height.mean() df = users.copy() df.loc[df.height.isnull(), "height"] = df["height"].mean() print(df) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none name age gender job height 0 alice 19 F student 165.00 1 john 26 M student 180.00 2 eric 22 M student 175.00 3 paul 58 F manager 172.75 4 peter 33 M engineer 172.75 5 julie 44 F scientist 171.00 .. GENERATED FROM PYTHON SOURCE LINES 381-385 Operation: multiplication ------------------------- Multiplication of dataframe and other, element-wise .. GENERATED FROM PYTHON SOURCE LINES 385-392 .. code-block:: default df = users.dropna() df.insert(0, 'random', np.arange(df.shape[0])) print(df) df[["age", "height"]].multiply(df["random"], axis="index") .. rst-class:: sphx-glr-script-out Out: .. code-block:: none random name age gender job height 0 0 alice 19 F student 165.0 1 1 john 26 M student 180.0 2 2 eric 22 M student 175.0 5 3 julie 44 F scientist 171.0 .. raw:: html

	age	height
0	0	0.0
1	26	180.0
2	44	350.0
5	132	513.0

.. GENERATED FROM PYTHON SOURCE LINES 393-397 Renaming -------- Rename columns .. GENERATED FROM PYTHON SOURCE LINES 397-401 .. code-block:: default df = users.copy() df.rename(columns={'name': 'NAME'}) .. raw:: html

	NAME	age	gender	job	height
0	alice	19	F	student	165.0
1	john	26	M	student	180.0
2	eric	22	M	student	175.0
3	paul	58	F	manager	NaN
4	peter	33	M	engineer	NaN
5	julie	44	F	scientist	171.0

.. GENERATED FROM PYTHON SOURCE LINES 402-403 Rename values .. GENERATED FROM PYTHON SOURCE LINES 403-408 .. code-block:: default df.job = df.job.map({'student': 'etudiant', 'manager': 'manager', 'engineer': 'ingenieur', 'scientist': 'scientific'}) .. GENERATED FROM PYTHON SOURCE LINES 409-411 Dealing with outliers --------------------- .. GENERATED FROM PYTHON SOURCE LINES 411-416 .. code-block:: default size = pd.Series(np.random.normal(loc=175, size=20, scale=10)) # Corrupt the first 3 measures size[:3] += 500 .. GENERATED FROM PYTHON SOURCE LINES 417-424 Based on parametric statistics: use the mean ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Assume random variable follows the normal distribution Exclude data outside 3 standard-deviations: - Probability that a sample lies within 1 sd: 68.27% - Probability that a sample lies within 3 sd: 99.73% (68.27 + 2 * 15.73) .. GENERATED FROM PYTHON SOURCE LINES 424-430 .. code-block:: default size_outlr_mean = size.copy() size_outlr_mean[((size - size.mean()).abs() > 3 * size.std())] = size.mean() print(size_outlr_mean.mean()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 248.48963819938044 .. GENERATED FROM PYTHON SOURCE LINES 431-436 Based on non-parametric statistics: use the median ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Median absolute deviation (MAD), based on the median, is a robust non-parametric statistics. https://en.wikipedia.org/wiki/Median_absolute_deviation .. GENERATED FROM PYTHON SOURCE LINES 436-444 .. code-block:: default mad = 1.4826 * np.median(np.abs(size - size.median())) size_outlr_mad = size.copy() size_outlr_mad[((size - size.median()).abs() > 3 * mad)] = size.median() print(size_outlr_mad.mean(), size_outlr_mad.median()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 173.80000467192673 178.7023568870694 .. GENERATED FROM PYTHON SOURCE LINES 445-450 File I/O -------- csv ~~~ .. GENERATED FROM PYTHON SOURCE LINES 450-458 .. code-block:: default import tempfile, os.path tmpdir = tempfile.gettempdir() csv_filename = os.path.join(tmpdir, "users.csv") users.to_csv(csv_filename, index=False) other = pd.read_csv(csv_filename) .. GENERATED FROM PYTHON SOURCE LINES 459-461 Read csv from url ~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 461-466 .. code-block:: default url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv' salary = pd.read_csv(url) .. GENERATED FROM PYTHON SOURCE LINES 467-469 Excel ~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 469-484 .. code-block:: default xls_filename = os.path.join(tmpdir, "users.xlsx") users.to_excel(xls_filename, sheet_name='users', index=False) pd.read_excel(xls_filename, sheet_name='users') # Multiple sheets with pd.ExcelWriter(xls_filename) as writer: users.to_excel(writer, sheet_name='users', index=False) df.to_excel(writer, sheet_name='salary', index=False) pd.read_excel(xls_filename, sheet_name='users') pd.read_excel(xls_filename, sheet_name='salary') .. raw:: html

	name	age	gender	job	height
0	alice	19	F	etudiant	165.0
1	john	26	M	etudiant	180.0
2	eric	22	M	etudiant	175.0
3	paul	58	F	manager	NaN
4	peter	33	M	ingenieur	NaN
5	julie	44	F	scientific	171.0

.. GENERATED FROM PYTHON SOURCE LINES 485-487 SQL (SQLite) ~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 487-493 .. code-block:: default import pandas as pd import sqlite3 db_filename = os.path.join(tmpdir, "users.db") .. GENERATED FROM PYTHON SOURCE LINES 494-495 Connect .. GENERATED FROM PYTHON SOURCE LINES 495-498 .. code-block:: default conn = sqlite3.connect(db_filename) .. GENERATED FROM PYTHON SOURCE LINES 499-500 Creating tables with pandas .. GENERATED FROM PYTHON SOURCE LINES 500-506 .. code-block:: default url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv' salary = pd.read_csv(url) salary.to_sql("salary", conn, if_exists="replace") .. GENERATED FROM PYTHON SOURCE LINES 507-508 Push modifications .. GENERATED FROM PYTHON SOURCE LINES 508-515 .. code-block:: default cur = conn.cursor() values = (100, 14000, 5, 'Bachelor', 'N') cur.execute("insert into salary values (?, ?, ?, ?, ?)", values) conn.commit() .. GENERATED FROM PYTHON SOURCE LINES 516-517 Reading results into a pandas DataFrame .. GENERATED FROM PYTHON SOURCE LINES 517-527 .. code-block:: default salary_sql = pd.read_sql_query("select * from salary;", conn) print(salary_sql.head()) pd.read_sql_query("select * from salary;", conn).tail() pd.read_sql_query('select * from salary where salary>25000;', conn) pd.read_sql_query('select * from salary where experience=16;', conn) pd.read_sql_query('select * from salary where education="Master";', conn) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none index salary experience education management 0 0 13876 1 Bachelor Y 1 1 11608 1 Ph.D N 2 2 18701 1 Ph.D Y 3 3 11283 1 Master N 4 4 11767 1 Ph.D N .. raw:: html

	index	salary	experience	education	management
0	3	11283	1	Master	N
1	5	20872	2	Master	Y
2	6	11772	2	Master	N
3	9	12313	3	Master	N
4	11	21371	3	Master	Y
5	16	12884	4	Master	N
6	17	13245	5	Master	N
7	22	13839	6	Master	N
8	23	22884	6	Master	Y
9	25	14803	8	Master	N
10	30	15942	10	Master	N
11	32	23780	10	Master	Y
12	33	25410	11	Master	Y
13	35	16882	12	Master	N
14	38	26330	13	Master	Y
15	39	17949	14	Master	N
16	41	27837	16	Master	Y
17	42	18838	16	Master	N
18	44	19207	17	Master	N

.. GENERATED FROM PYTHON SOURCE LINES 528-554 Exercises --------- Data Frame ~~~~~~~~~~ 1. Read the iris dataset at 'https://github.com/neurospin/pystatsml/tree/master/datasets/iris.csv' 2. Print column names 3. Get numerical columns 4. For each species compute the mean of numerical columns and store it in a ``stats`` table like: :: species sepal_length sepal_width petal_length petal_width 0 setosa 5.006 3.428 1.462 0.246 1 versicolor 5.936 2.770 4.260 1.326 2 virginica 6.588 2.974 5.552 2.026 Missing data ~~~~~~~~~~~~ Add some missing data to the previous table ``users``: .. GENERATED FROM PYTHON SOURCE LINES 554-559 .. code-block:: default df = users.copy() df.loc[[0, 2], "age"] = None df.loc[[1, 3], "gender"] = None .. GENERATED FROM PYTHON SOURCE LINES 560-565 1. Write a function ``fillmissing_with_mean(df)`` that fill all missing value of numerical column with the mean of the current columns. 2. Save the original users and "imputed" frame in a single excel file "users.xlsx" with 2 sheets: original, imputed. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 1.125 seconds) .. _sphx_glr_download_auto_gallery_scipy_pandas.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: scipy_pandas.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: scipy_pandas.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_