.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_gallery/data_pandas.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_gallery_data_pandas.py: Pandas: data manipulation ========================= It is often said that 80% of data analysis is spent on the cleaning and small, but important, aspect of data manipulation and cleaning with Pandas. **Sources**: - Kevin Markham: https://github.com/justmarkham - Pandas doc: http://pandas.pydata.org/pandas-docs/stable/index.html **Data structures** - **Series** is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call `pd.Series([1,3,5,np.nan,6,8])` - **DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It stems from the `R data.frame()` object. .. GENERATED FROM PYTHON SOURCE LINES 25-29 .. code-block:: Python import pandas as pd import numpy as np .. GENERATED FROM PYTHON SOURCE LINES 30-32 Create DataFrame ---------------- .. GENERATED FROM PYTHON SOURCE LINES 32-49 .. code-block:: Python columns = ['name', 'age', 'gender', 'job'] user1 = pd.DataFrame([['alice', 19, "F", "student"], ['john', 26, "M", "student"]], columns=columns) user2 = pd.DataFrame([['eric', 22, "M", "student"], ['paul', 58, "F", "manager"]], columns=columns) user3 = pd.DataFrame(dict(name=['peter', 'julie'], age=[33, 44], gender=['M', 'F'], job=['engineer', 'scientist'])) print(user3) .. rst-class:: sphx-glr-script-out .. code-block:: none name age gender job 0 peter 33 M engineer 1 julie 44 F scientist .. GENERATED FROM PYTHON SOURCE LINES 50-52 Combining DataFrames -------------------- .. GENERATED FROM PYTHON SOURCE LINES 54-56 Concatenate DataFrame ~~~~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 58-59 Concatenate columns (axis = 1). .. GENERATED FROM PYTHON SOURCE LINES 59-65 .. code-block:: Python height = pd.DataFrame(dict(height=[1.65, 1.8])) print(user1, "\n", height) print(pd.concat([user1, height], axis=1)) .. rst-class:: sphx-glr-script-out .. code-block:: none name age gender job 0 alice 19 F student 1 john 26 M student height 0 1.65 1 1.80 name age gender job height 0 alice 19 F student 1.65 1 john 26 M student 1.80 .. GENERATED FROM PYTHON SOURCE LINES 66-67 Concatenate rows (default: axis = 0) .. GENERATED FROM PYTHON SOURCE LINES 67-72 .. code-block:: Python users = pd.concat([user1, user2, user3]) print(users) .. rst-class:: sphx-glr-script-out .. code-block:: none name age gender job 0 alice 19 F student 1 john 26 M student 0 eric 22 M student 1 paul 58 F manager 0 peter 33 M engineer 1 julie 44 F scientist .. GENERATED FROM PYTHON SOURCE LINES 73-75 Join DataFrame ~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 75-81 .. code-block:: Python user4 = pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'], height=[165, 180, 175, 171])) print(user4) .. rst-class:: sphx-glr-script-out .. code-block:: none name height 0 alice 165 1 john 180 2 eric 175 3 julie 171 .. GENERATED FROM PYTHON SOURCE LINES 82-83 Use intersection of keys from both frames .. GENERATED FROM PYTHON SOURCE LINES 83-88 .. code-block:: Python merge_inter = pd.merge(users, user4) print(merge_inter) .. rst-class:: sphx-glr-script-out .. code-block:: none name age gender job height 0 alice 19 F student 165 1 john 26 M student 180 2 eric 22 M student 175 3 julie 44 F scientist 171 .. GENERATED FROM PYTHON SOURCE LINES 89-90 Use union of keys from both frames .. GENERATED FROM PYTHON SOURCE LINES 90-95 .. code-block:: Python users = pd.merge(users, user4, on="name", how='outer') print(users) .. rst-class:: sphx-glr-script-out .. code-block:: none name age gender job height 0 alice 19 F student 165.0 1 eric 22 M student 175.0 2 john 26 M student 180.0 3 julie 44 F scientist 171.0 4 paul 58 F manager NaN 5 peter 33 M engineer NaN .. GENERATED FROM PYTHON SOURCE LINES 96-100 Reshaping by pivoting ~~~~~~~~~~~~~~~~~~~~~ “Unpivots” a DataFrame from wide format to long (stacked) format, .. GENERATED FROM PYTHON SOURCE LINES 100-105 .. code-block:: Python staked = pd.melt(users, id_vars="name", var_name="variable", value_name="value") print(staked) .. rst-class:: sphx-glr-script-out .. code-block:: none name variable value 0 alice age 19 1 eric age 22 2 john age 26 3 julie age 44 4 paul age 58 5 peter age 33 6 alice gender F 7 eric gender M 8 john gender M 9 julie gender F 10 paul gender F 11 peter gender M 12 alice job student 13 eric job student 14 john job student 15 julie job scientist 16 paul job manager 17 peter job engineer 18 alice height 165.0 19 eric height 175.0 20 john height 180.0 21 julie height 171.0 22 paul height NaN 23 peter height NaN .. GENERATED FROM PYTHON SOURCE LINES 106-107 “pivots” a DataFrame from long (stacked) format to wide format, .. GENERATED FROM PYTHON SOURCE LINES 107-111 .. code-block:: Python wide = staked.pivot(index='name', columns='variable', values='value') print(wide) .. rst-class:: sphx-glr-script-out .. code-block:: none variable age gender height job name alice 19 F 165.0 student eric 22 M 175.0 student john 26 M 180.0 student julie 44 F 171.0 scientist paul 58 F NaN manager peter 33 M NaN engineer .. GENERATED FROM PYTHON SOURCE LINES 112-115 Summarizing ----------- .. GENERATED FROM PYTHON SOURCE LINES 115-122 .. code-block:: Python users # print the first 30 and last 30 rows type(users) # DataFrame users.head() # print the first 5 rows users.tail() # print the last 5 rows .. raw:: html
name age gender job height
1 eric 22 M student 175.0
2 john 26 M student 180.0
3 julie 44 F scientist 171.0
4 paul 58 F manager NaN
5 peter 33 M engineer NaN


.. GENERATED FROM PYTHON SOURCE LINES 123-124 Meta-information .. GENERATED FROM PYTHON SOURCE LINES 124-131 .. code-block:: Python users.columns # Column names users.index # Row name" users.shape # number of rows and columns users.dtypes # data types of each column users.values # underlying numpy array .. rst-class:: sphx-glr-script-out .. code-block:: none array([['alice', 19, 'F', 'student', 165.0], ['eric', 22, 'M', 'student', 175.0], ['john', 26, 'M', 'student', 180.0], ['julie', 44, 'F', 'scientist', 171.0], ['paul', 58, 'F', 'manager', nan], ['peter', 33, 'M', 'engineer', nan]], dtype=object) .. GENERATED FROM PYTHON SOURCE LINES 132-134 Columns selection ----------------- .. GENERATED FROM PYTHON SOURCE LINES 134-147 .. code-block:: Python print(users.columns) users['gender'] # select one column type(users['gender']) # Series users.gender # select one column using the DataFrame # select multiple columns users[['age', 'gender']] # select two columns my_cols = ['age', 'gender'] # or, create a list... users[my_cols] # ...and use that list to select columns type(users[my_cols]) # DataFrame .. rst-class:: sphx-glr-script-out .. code-block:: none Index(['name', 'age', 'gender', 'job', 'height'], dtype='object') .. GENERATED FROM PYTHON SOURCE LINES 148-149 `iloc` is strictly integer position based .. GENERATED FROM PYTHON SOURCE LINES 149-153 .. code-block:: Python users.iloc[:, 2] # select third column .. rst-class:: sphx-glr-script-out .. code-block:: none 0 F 1 M 2 M 3 F 4 F 5 M Name: gender, dtype: object .. GENERATED FROM PYTHON SOURCE LINES 154-156 Rows selection (basic) ---------------------- .. GENERATED FROM PYTHON SOURCE LINES 158-159 `iloc` is strictly integer position based .. GENERATED FROM PYTHON SOURCE LINES 159-169 .. code-block:: Python df = users.copy() df.iloc[0] # first row df.iloc[0, :] # first row df.iloc[[0, 1], :] # Two first row df.iloc[0, 0] # first item of first row df.iloc[0, 0] = 55 .. GENERATED FROM PYTHON SOURCE LINES 170-171 `loc` supports mixed integer and label based access. .. GENERATED FROM PYTHON SOURCE LINES 171-177 .. code-block:: Python df.loc[0] # first row df.loc[0, :] # first row df.loc[0, "age"] # age item of first row df.loc[0, "age"] = 55 .. GENERATED FROM PYTHON SOURCE LINES 178-181 Selection and index Select females into a new DataFrame .. GENERATED FROM PYTHON SOURCE LINES 181-186 .. code-block:: Python df = users[users.gender == "F"] print(df) .. rst-class:: sphx-glr-script-out .. code-block:: none name age gender job height 0 alice 19 F student 165.0 3 julie 44 F scientist 171.0 4 paul 58 F manager NaN .. GENERATED FROM PYTHON SOURCE LINES 187-188 Reset index, useful when index is meaningless .. GENERATED FROM PYTHON SOURCE LINES 188-192 .. code-block:: Python df = df.reset_index(drop=True) # Watch the index print(df) .. rst-class:: sphx-glr-script-out .. code-block:: none name age gender job height 0 alice 19 F student 165.0 1 julie 44 F scientist 171.0 2 paul 58 F manager NaN .. GENERATED FROM PYTHON SOURCE LINES 193-195 Rows iteration -------------- .. GENERATED FROM PYTHON SOURCE LINES 195-198 .. code-block:: Python df = users[:2].copy() .. GENERATED FROM PYTHON SOURCE LINES 199-207 `iterrows()`: slow, get series, **read-only** - Returns (index, Series) pairs. - Slow because iterrows boxes the data into a Series. - Retrieve fields with column name - **Don't modify something you are iterating over**. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect. .. GENERATED FROM PYTHON SOURCE LINES 207-211 .. code-block:: Python for idx, row in df.iterrows(): print(row["name"], row["age"]) .. rst-class:: sphx-glr-script-out .. code-block:: none alice 19 eric 22 .. GENERATED FROM PYTHON SOURCE LINES 212-219 `itertuples()`: fast, get namedtuples, **read-only** - Returns namedtuples of the values and which is generally faster than iterrows. - Fast, because itertuples does not box the data into a Series. - Retrieve fields with integer index starting from 0. - Names will be renamed to positional names if they are invalid Python identifier .. GENERATED FROM PYTHON SOURCE LINES 219-224 .. code-block:: Python for tup in df.itertuples(): print(tup[1], tup[2]) .. rst-class:: sphx-glr-script-out .. code-block:: none alice 19 eric 22 .. GENERATED FROM PYTHON SOURCE LINES 225-226 iter using `loc[i, ...]`: read and **write** .. GENERATED FROM PYTHON SOURCE LINES 226-231 .. code-block:: Python for i in range(df.shape[0]): df.loc[i, "age"] *= 10 # df is modified .. GENERATED FROM PYTHON SOURCE LINES 232-234 Rows selection (filtering) -------------------------- .. GENERATED FROM PYTHON SOURCE LINES 236-237 simple logical filtering on numerical values .. GENERATED FROM PYTHON SOURCE LINES 237-245 .. code-block:: Python users[users.age < 20] # only show users with age < 20 young_bool = users.age < 20 # or, create a Series of booleans... young = users[young_bool] # ...and use that Series to filter rows users[users.age < 20].job # select one column from the filtered results print(young) .. rst-class:: sphx-glr-script-out .. code-block:: none name age gender job height 0 alice 19 F student 165.0 .. GENERATED FROM PYTHON SOURCE LINES 246-247 simple logical filtering on categorial values .. GENERATED FROM PYTHON SOURCE LINES 247-253 .. code-block:: Python users[users.job == 'student'] users[users.job.isin(['student', 'engineer'])] users[users['job'].str.contains("stu|scient")] .. raw:: html
name age gender job height
0 alice 19 F student 165.0
1 eric 22 M student 175.0
2 john 26 M student 180.0
3 julie 44 F scientist 171.0


.. GENERATED FROM PYTHON SOURCE LINES 254-255 Advanced logical filtering .. GENERATED FROM PYTHON SOURCE LINES 255-260 .. code-block:: Python users[users.age < 20][['age', 'job']] # select multiple columns users[(users.age > 20) & (users.gender == 'M')] # use multiple conditions .. raw:: html
name age gender job height
1 eric 22 M student 175.0
2 john 26 M student 180.0
5 peter 33 M engineer NaN


.. GENERATED FROM PYTHON SOURCE LINES 261-263 Sorting ------- .. GENERATED FROM PYTHON SOURCE LINES 263-275 .. code-block:: Python df = users.copy() df.age.sort_values() # only works for a Series df.sort_values(by='age') # sort rows by a specific column df.sort_values(by='age', ascending=False) # use descending order instead df.sort_values(by=['job', 'age']) # sort by multiple columns df.sort_values(by=['job', 'age'], inplace=True) # modify df print(df) .. rst-class:: sphx-glr-script-out .. code-block:: none name age gender job height 5 peter 33 M engineer NaN 4 paul 58 F manager NaN 3 julie 44 F scientist 171.0 0 alice 19 F student 165.0 1 eric 22 M student 175.0 2 john 26 M student 180.0 .. GENERATED FROM PYTHON SOURCE LINES 276-280 Descriptive statistics ---------------------- Summarize all numeric columns .. GENERATED FROM PYTHON SOURCE LINES 280-283 .. code-block:: Python print(df.describe()) .. rst-class:: sphx-glr-script-out .. code-block:: none age height count 6.000000 4.000000 mean 33.666667 172.750000 std 14.895189 6.344289 min 19.000000 165.000000 25% 23.000000 169.500000 50% 29.500000 173.000000 75% 41.250000 176.250000 max 58.000000 180.000000 .. GENERATED FROM PYTHON SOURCE LINES 284-285 Summarize all columns .. GENERATED FROM PYTHON SOURCE LINES 285-290 .. code-block:: Python print(df.describe(include='all')) print(df.describe(include=['object'])) # limit to one (or more) types .. rst-class:: sphx-glr-script-out .. code-block:: none name age gender job height count 6 6.000000 6 6 4.000000 unique 6 NaN 2 4 NaN top peter NaN M student NaN freq 1 NaN 3 3 NaN mean NaN 33.666667 NaN NaN 172.750000 std NaN 14.895189 NaN NaN 6.344289 min NaN 19.000000 NaN NaN 165.000000 25% NaN 23.000000 NaN NaN 169.500000 50% NaN 29.500000 NaN NaN 173.000000 75% NaN 41.250000 NaN NaN 176.250000 max NaN 58.000000 NaN NaN 180.000000 name gender job count 6 6 6 unique 6 2 4 top peter M student freq 1 3 3 .. GENERATED FROM PYTHON SOURCE LINES 291-292 Categorical columns: count and proportions of values .. GENERATED FROM PYTHON SOURCE LINES 292-297 .. code-block:: Python df['job'].value_counts() df['job'].value_counts(normalize=True).round(2) .. rst-class:: sphx-glr-script-out .. code-block:: none job student 0.50 engineer 0.17 manager 0.17 scientist 0.17 Name: proportion, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 298-299 Categorical columns: length of strings .. GENERATED FROM PYTHON SOURCE LINES 299-303 .. code-block:: Python df['job'].str.len() .. rst-class:: sphx-glr-script-out .. code-block:: none 5 8 4 7 3 9 0 7 1 7 2 7 Name: job, dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 304-305 Statistics per group (groupby) .. GENERATED FROM PYTHON SOURCE LINES 305-310 .. code-block:: Python print(df.groupby("job")["age"].mean()) # print(df.groupby("job").describe(include='all')) .. rst-class:: sphx-glr-script-out .. code-block:: none job engineer 33.000000 manager 58.000000 scientist 44.000000 student 22.333333 Name: age, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 311-312 Groupby in a loop .. GENERATED FROM PYTHON SOURCE LINES 312-317 .. code-block:: Python for grp, data in df.groupby("job"): print(grp, data) .. rst-class:: sphx-glr-script-out .. code-block:: none engineer name age gender job height 5 peter 33 M engineer NaN manager name age gender job height 4 paul 58 F manager NaN scientist name age gender job height 3 julie 44 F scientist 171.0 student name age gender job height 0 alice 19 F student 165.0 1 eric 22 M student 175.0 2 john 26 M student 180.0 .. GENERATED FROM PYTHON SOURCE LINES 318-323 Quality check ------------- Remove duplicate data ~~~~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 323-338 .. code-block:: Python df = users.copy() # Create a duplicate: Append the first at the end df.loc[len(df.index)] = users.iloc[0] print(df.duplicated()) # Series of booleans # (True if a row is identical to a previous row) df.duplicated().sum() # count of duplicates df[df.duplicated()] # only show duplicates df.age.duplicated() # check a single column for duplicates df.duplicated(['age', 'gender']).sum() # specify columns for finding duplicates df = df.drop_duplicates() # drop duplicate rows .. rst-class:: sphx-glr-script-out .. code-block:: none 0 False 1 False 2 False 3 False 4 False 5 False 6 True dtype: bool .. GENERATED FROM PYTHON SOURCE LINES 339-341 Missing data ~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 341-358 .. code-block:: Python # Missing values are often just excluded df = users.copy() df.describe(include='all') # find missing values in a Series df.height.isnull() # True if NaN, False otherwise df.height.notnull() # False if NaN, True otherwise df[df.height.notnull()] # only show rows where age is not NaN df.height.isnull().sum() # count the missing values # find missing values in a DataFrame df.isnull() # DataFrame of booleans df.isnull().sum() # calculate the sum of each column .. rst-class:: sphx-glr-script-out .. code-block:: none name 0 age 0 gender 0 job 0 height 2 dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 359-360 Strategy 1: drop missing values .. GENERATED FROM PYTHON SOURCE LINES 360-365 .. code-block:: Python df.dropna() # drop a row if ANY values are missing df.dropna(how='all') # drop a row only if ALL values are missing .. raw:: html
name age gender job height
0 alice 19 F student 165.0
1 eric 22 M student 175.0
2 john 26 M student 180.0
3 julie 44 F scientist 171.0
4 paul 58 F manager NaN
5 peter 33 M engineer NaN


.. GENERATED FROM PYTHON SOURCE LINES 366-367 Strategy 2: fill in missing values .. GENERATED FROM PYTHON SOURCE LINES 367-375 .. code-block:: Python df.height.mean() df = users.copy() df.loc[df.height.isnull(), "height"] = df["height"].mean() print(df) .. rst-class:: sphx-glr-script-out .. code-block:: none name age gender job height 0 alice 19 F student 165.00 1 eric 22 M student 175.00 2 john 26 M student 180.00 3 julie 44 F scientist 171.00 4 paul 58 F manager 172.75 5 peter 33 M engineer 172.75 .. GENERATED FROM PYTHON SOURCE LINES 376-380 Operation: multiplication ------------------------- Multiplication of dataframe and other, element-wise .. GENERATED FROM PYTHON SOURCE LINES 380-387 .. code-block:: Python df = users.dropna() df.insert(0, 'random', np.arange(df.shape[0])) print(df) df[["age", "height"]].multiply(df["random"], axis="index") .. rst-class:: sphx-glr-script-out .. code-block:: none random name age gender job height 0 0 alice 19 F student 165.0 1 1 eric 22 M student 175.0 2 2 john 26 M student 180.0 3 3 julie 44 F scientist 171.0 .. raw:: html
age height
0 0 0.0
1 22 175.0
2 52 360.0
3 132 513.0


.. GENERATED FROM PYTHON SOURCE LINES 388-392 Renaming -------- Rename columns .. GENERATED FROM PYTHON SOURCE LINES 392-396 .. code-block:: Python df = users.copy() df.rename(columns={'name': 'NAME'}) .. raw:: html
NAME age gender job height
0 alice 19 F student 165.0
1 eric 22 M student 175.0
2 john 26 M student 180.0
3 julie 44 F scientist 171.0
4 paul 58 F manager NaN
5 peter 33 M engineer NaN


.. GENERATED FROM PYTHON SOURCE LINES 397-398 Rename values .. GENERATED FROM PYTHON SOURCE LINES 398-403 .. code-block:: Python df.job = df.job.map({'student': 'etudiant', 'manager': 'manager', 'engineer': 'ingenieur', 'scientist': 'scientific'}) .. GENERATED FROM PYTHON SOURCE LINES 404-406 Dealing with outliers --------------------- .. GENERATED FROM PYTHON SOURCE LINES 406-411 .. code-block:: Python size = pd.Series(np.random.normal(loc=175, size=20, scale=10)) # Corrupt the first 3 measures size[:3] += 500 .. GENERATED FROM PYTHON SOURCE LINES 412-419 Based on parametric statistics: use the mean ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Assume random variable follows the normal distribution Exclude data outside 3 standard-deviations: - Probability that a sample lies within 1 sd: 68.27% - Probability that a sample lies within 3 sd: 99.73% (68.27 + 2 * 15.73) .. GENERATED FROM PYTHON SOURCE LINES 419-425 .. code-block:: Python size_outlr_mean = size.copy() size_outlr_mean[((size - size.mean()).abs() > 3 * size.std())] = size.mean() print(size_outlr_mean.mean()) .. rst-class:: sphx-glr-script-out .. code-block:: none 248.48963819938044 .. GENERATED FROM PYTHON SOURCE LINES 426-431 Based on non-parametric statistics: use the median ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `Median absolute deviation (MAD) `_ is based on the median, is a robust non-parametric statistics. .. GENERATED FROM PYTHON SOURCE LINES 431-440 .. code-block:: Python mad = 1.4826 * np.median(np.abs(size - size.median())) size_outlr_mad = size.copy() size_outlr_mad[((size - size.median()).abs() > 3 * mad)] = size.median() print(size_outlr_mad.mean(), size_outlr_mad.median()) .. rst-class:: sphx-glr-script-out .. code-block:: none 173.80000467192673 178.7023568870694 .. GENERATED FROM PYTHON SOURCE LINES 441-446 File I/O -------- csv ~~~ .. GENERATED FROM PYTHON SOURCE LINES 446-454 .. code-block:: Python import tempfile, os.path tmpdir = tempfile.gettempdir() csv_filename = os.path.join(tmpdir, "users.csv") users.to_csv(csv_filename, index=False) other = pd.read_csv(csv_filename) .. GENERATED FROM PYTHON SOURCE LINES 455-457 Read csv from url ~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 457-462 .. code-block:: Python url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv' salary = pd.read_csv(url) .. GENERATED FROM PYTHON SOURCE LINES 463-471 Excel ~~~~~ Package `openpyxl` is required. To install type: :: conda install -c conda-forge openpyxl .. GENERATED FROM PYTHON SOURCE LINES 471-489 .. code-block:: Python xls_filename = os.path.join(tmpdir, "users.xlsx") # Write users.to_excel(xls_filename, sheet_name='users', index=False) # Read pd.read_excel(xls_filename, sheet_name='users') # Multiple sheets with pd.ExcelWriter(xls_filename) as writer: users.to_excel(writer, sheet_name='users', index=False) df.to_excel(writer, sheet_name='salary', index=False) pd.read_excel(xls_filename, sheet_name='users') pd.read_excel(xls_filename, sheet_name='salary') .. raw:: html
name age gender job height
0 alice 19 F etudiant 165.0
1 eric 22 M etudiant 175.0
2 john 26 M etudiant 180.0
3 julie 44 F scientific 171.0
4 paul 58 F manager NaN
5 peter 33 M ingenieur NaN


.. GENERATED FROM PYTHON SOURCE LINES 490-492 SQL (SQLite) ~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 492-498 .. code-block:: Python import pandas as pd import sqlite3 db_filename = os.path.join(tmpdir, "users.db") .. GENERATED FROM PYTHON SOURCE LINES 499-500 Connect .. GENERATED FROM PYTHON SOURCE LINES 500-503 .. code-block:: Python conn = sqlite3.connect(db_filename) .. GENERATED FROM PYTHON SOURCE LINES 504-505 Creating tables with pandas .. GENERATED FROM PYTHON SOURCE LINES 505-511 .. code-block:: Python url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv' salary = pd.read_csv(url) salary.to_sql("salary", conn, if_exists="replace") .. rst-class:: sphx-glr-script-out .. code-block:: none 46 .. GENERATED FROM PYTHON SOURCE LINES 512-513 Push modifications .. GENERATED FROM PYTHON SOURCE LINES 513-520 .. code-block:: Python cur = conn.cursor() values = (100, 14000, 5, 'Bachelor', 'N') cur.execute("insert into salary values (?, ?, ?, ?, ?)", values) conn.commit() .. GENERATED FROM PYTHON SOURCE LINES 521-522 Reading results into a pandas DataFrame .. GENERATED FROM PYTHON SOURCE LINES 522-532 .. code-block:: Python salary_sql = pd.read_sql_query("select * from salary;", conn) print(salary_sql.head()) pd.read_sql_query("select * from salary;", conn).tail() pd.read_sql_query('select * from salary where salary>25000;', conn) pd.read_sql_query('select * from salary where experience=16;', conn) pd.read_sql_query('select * from salary where education="Master";', conn) .. rst-class:: sphx-glr-script-out .. code-block:: none index salary experience education management 0 0 13876 1 Bachelor Y 1 1 11608 1 Ph.D N 2 2 18701 1 Ph.D Y 3 3 11283 1 Master N 4 4 11767 1 Ph.D N .. raw:: html
index salary experience education management
0 3 11283 1 Master N
1 5 20872 2 Master Y
2 6 11772 2 Master N
3 9 12313 3 Master N
4 11 21371 3 Master Y
5 16 12884 4 Master N
6 17 13245 5 Master N
7 22 13839 6 Master N
8 23 22884 6 Master Y
9 25 14803 8 Master N
10 30 15942 10 Master N
11 32 23780 10 Master Y
12 33 25410 11 Master Y
13 35 16882 12 Master N
14 38 26330 13 Master Y
15 39 17949 14 Master N
16 41 27837 16 Master Y
17 42 18838 16 Master N
18 44 19207 17 Master N


.. GENERATED FROM PYTHON SOURCE LINES 533-559 Exercises --------- Data Frame ~~~~~~~~~~ 1. Read the iris dataset at 'https://github.com/duchesnay/pystatsml/raw/master/datasets/iris.csv' 2. Print column names 3. Get numerical columns 4. For each species compute the mean of numerical columns and store it in a ``stats`` table like: :: species sepal_length sepal_width petal_length petal_width 0 setosa 5.006 3.428 1.462 0.246 1 versicolor 5.936 2.770 4.260 1.326 2 virginica 6.588 2.974 5.552 2.026 Missing data ~~~~~~~~~~~~ Add some missing data to the previous table ``users``: .. GENERATED FROM PYTHON SOURCE LINES 559-564 .. code-block:: Python df = users.copy() df.loc[[0, 2], "age"] = None df.loc[[1, 3], "gender"] = None .. GENERATED FROM PYTHON SOURCE LINES 565-570 1. Write a function ``fillmissing_with_mean(df)`` that fill all missing value of numerical column with the mean of the current columns. 2. Save the original users and "imputed" frame in a single excel file "users.xlsx" with 2 sheets: original, imputed. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.854 seconds) .. _sphx_glr_download_auto_gallery_data_pandas.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: data_pandas.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: data_pandas.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: data_pandas.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_