.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_gallery/data_pandas.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_gallery_data_pandas.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_gallery_data_pandas.py:


Pandas: data manipulation
=========================

It is often said that 80% of data analysis is spent on the cleaning and
small, but important, aspect of data manipulation and cleaning with Pandas.

**Sources**:

- Kevin Markham: https://github.com/justmarkham
- Pandas doc: http://pandas.pydata.org/pandas-docs/stable/index.html

**Data structures**

- **Series** is a one-dimensional labeled array capable of holding any data
  type (integers, strings, floating point numbers, Python objects, etc.).
  The axis labels are collectively referred to as the index. The basic method
  to create a Series is to call `pd.Series([1,3,5,np.nan,6,8])`

- **DataFrame** is a 2-dimensional labeled data structure with columns of
  potentially different types. You can think of it like a spreadsheet or SQL
  table, or a dict of Series objects. It stems from the `R data.frame()`
  object.

.. GENERATED FROM PYTHON SOURCE LINES 25-29

.. code-block:: Python


    import pandas as pd
    import numpy as np


.. GENERATED FROM PYTHON SOURCE LINES 30-32

Create DataFrame
----------------

.. GENERATED FROM PYTHON SOURCE LINES 32-49

.. code-block:: Python


    columns = ['name', 'age', 'gender', 'job']

    user1 = pd.DataFrame([['alice', 19, "F", "student"],
                          ['john', 26, "M", "student"]],
                         columns=columns)

    user2 = pd.DataFrame([['eric', 22, "M", "student"],
                          ['paul', 58, "F", "manager"]],
                         columns=columns)

    user3 = pd.DataFrame(dict(name=['peter', 'julie'],
                              age=[33, 44], gender=['M', 'F'],
                              job=['engineer', 'scientist']))

    print(user3)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

        name  age gender        job
    0  peter   33      M   engineer
    1  julie   44      F  scientist


.. GENERATED FROM PYTHON SOURCE LINES 50-52

Combining DataFrames
--------------------

.. GENERATED FROM PYTHON SOURCE LINES 54-56

Concatenate DataFrame
~~~~~~~~~~~~~~~~~~~~~

.. GENERATED FROM PYTHON SOURCE LINES 58-59

Concatenate columns (axis = 1).

.. GENERATED FROM PYTHON SOURCE LINES 59-65

.. code-block:: Python


    height = pd.DataFrame(dict(height=[1.65, 1.8]))
    print(user1, "\n", height)

    print(pd.concat([user1, height], axis=1))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

        name  age gender      job
    0  alice   19      F  student
    1   john   26      M  student 
        height
    0    1.65
    1    1.80
        name  age gender      job  height
    0  alice   19      F  student    1.65
    1   john   26      M  student    1.80


.. GENERATED FROM PYTHON SOURCE LINES 66-67

Concatenate rows (default: axis = 0)

.. GENERATED FROM PYTHON SOURCE LINES 67-72

.. code-block:: Python


    users = pd.concat([user1, user2, user3])
    print(users)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

        name  age gender        job
    0  alice   19      F    student
    1   john   26      M    student
    0   eric   22      M    student
    1   paul   58      F    manager
    0  peter   33      M   engineer
    1  julie   44      F  scientist


.. GENERATED FROM PYTHON SOURCE LINES 73-75

Join DataFrame
~~~~~~~~~~~~~~

.. GENERATED FROM PYTHON SOURCE LINES 75-81

.. code-block:: Python


    user4 = pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'],
                              height=[165, 180, 175, 171]))
    print(user4)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

        name  height
    0  alice     165
    1   john     180
    2   eric     175
    3  julie     171


.. GENERATED FROM PYTHON SOURCE LINES 82-83

Use intersection of keys from both frames

.. GENERATED FROM PYTHON SOURCE LINES 83-88

.. code-block:: Python


    merge_inter = pd.merge(users, user4)
    print(merge_inter)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

        name  age gender        job  height
    0  alice   19      F    student     165
    1   john   26      M    student     180
    2   eric   22      M    student     175
    3  julie   44      F  scientist     171


.. GENERATED FROM PYTHON SOURCE LINES 89-90

Use union of keys from both frames

.. GENERATED FROM PYTHON SOURCE LINES 90-95

.. code-block:: Python


    users = pd.merge(users, user4, on="name", how='outer')
    print(users)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

        name  age gender        job  height
    0  alice   19      F    student   165.0
    1   eric   22      M    student   175.0
    2   john   26      M    student   180.0
    3  julie   44      F  scientist   171.0
    4   paul   58      F    manager     NaN
    5  peter   33      M   engineer     NaN


.. GENERATED FROM PYTHON SOURCE LINES 96-100

Reshaping by pivoting
~~~~~~~~~~~~~~~~~~~~~

“Unpivots” a DataFrame from wide format to long (stacked) format,

.. GENERATED FROM PYTHON SOURCE LINES 100-105

.. code-block:: Python


    staked = pd.melt(users, id_vars="name", var_name="variable", value_name="value")
    print(staked)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

         name variable      value
    0   alice      age         19
    1    eric      age         22
    2    john      age         26
    3   julie      age         44
    4    paul      age         58
    5   peter      age         33
    6   alice   gender          F
    7    eric   gender          M
    8    john   gender          M
    9   julie   gender          F
    10   paul   gender          F
    11  peter   gender          M
    12  alice      job    student
    13   eric      job    student
    14   john      job    student
    15  julie      job  scientist
    16   paul      job    manager
    17  peter      job   engineer
    18  alice   height      165.0
    19   eric   height      175.0
    20   john   height      180.0
    21  julie   height      171.0
    22   paul   height        NaN
    23  peter   height        NaN


.. GENERATED FROM PYTHON SOURCE LINES 106-107

“pivots” a DataFrame from long (stacked) format to wide format,

.. GENERATED FROM PYTHON SOURCE LINES 107-111

.. code-block:: Python

    wide = staked.pivot(index='name', columns='variable', values='value')
    print(wide)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    variable age gender height        job
    name                                 
    alice     19      F  165.0    student
    eric      22      M  175.0    student
    john      26      M  180.0    student
    julie     44      F  171.0  scientist
    paul      58      F    NaN    manager
    peter     33      M    NaN   engineer


.. GENERATED FROM PYTHON SOURCE LINES 112-115

Summarizing
-----------


.. GENERATED FROM PYTHON SOURCE LINES 115-122

.. code-block:: Python


    users                   # print the first 30 and last 30 rows
    type(users)             # DataFrame
    users.head()            # print the first 5 rows
    users.tail()            # print the last 5 rows


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>name</th>
          <th>age</th>
          <th>gender</th>
          <th>job</th>
          <th>height</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>1</th>
          <td>eric</td>
          <td>22</td>
          <td>M</td>
          <td>student</td>
          <td>175.0</td>
        </tr>
        <tr>
          <th>2</th>
          <td>john</td>
          <td>26</td>
          <td>M</td>
          <td>student</td>
          <td>180.0</td>
        </tr>
        <tr>
          <th>3</th>
          <td>julie</td>
          <td>44</td>
          <td>F</td>
          <td>scientist</td>
          <td>171.0</td>
        </tr>
        <tr>
          <th>4</th>
          <td>paul</td>
          <td>58</td>
          <td>F</td>
          <td>manager</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>5</th>
          <td>peter</td>
          <td>33</td>
          <td>M</td>
          <td>engineer</td>
          <td>NaN</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 123-124

Meta-information

.. GENERATED FROM PYTHON SOURCE LINES 124-131

.. code-block:: Python


    users.columns           # Column names
    users.index             # Row name"
    users.shape             # number of rows and columns
    users.dtypes            # data types of each column
    users.values            # underlying numpy array


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([['alice', 19, 'F', 'student', 165.0],
           ['eric', 22, 'M', 'student', 175.0],
           ['john', 26, 'M', 'student', 180.0],
           ['julie', 44, 'F', 'scientist', 171.0],
           ['paul', 58, 'F', 'manager', nan],
           ['peter', 33, 'M', 'engineer', nan]], dtype=object)


.. GENERATED FROM PYTHON SOURCE LINES 132-134

Columns selection
-----------------

.. GENERATED FROM PYTHON SOURCE LINES 134-147

.. code-block:: Python


    print(users.columns)

    users['gender']         # select one column
    type(users['gender'])   # Series
    users.gender            # select one column using the DataFrame

    # select multiple columns
    users[['age', 'gender']]        # select two columns
    my_cols = ['age', 'gender']     # or, create a list...
    users[my_cols]                  # ...and use that list to select columns
    type(users[my_cols])            # DataFrame


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Index(['name', 'age', 'gender', 'job', 'height'], dtype='object')


.. GENERATED FROM PYTHON SOURCE LINES 148-149

`iloc` is strictly integer position based

.. GENERATED FROM PYTHON SOURCE LINES 149-153

.. code-block:: Python


    users.iloc[:, 2] # select third column


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0    F
    1    M
    2    M
    3    F
    4    F
    5    M
    Name: gender, dtype: object


.. GENERATED FROM PYTHON SOURCE LINES 154-156

Rows selection (basic)
----------------------

.. GENERATED FROM PYTHON SOURCE LINES 158-159

`iloc` is strictly integer position based

.. GENERATED FROM PYTHON SOURCE LINES 159-169

.. code-block:: Python


    df = users.copy()
    df.iloc[0]     # first row
    df.iloc[0, :]  # first row
    df.iloc[[0, 1], :]  # Two first row

    df.iloc[0, 0]  # first item of first row
    df.iloc[0, 0] = 55


.. GENERATED FROM PYTHON SOURCE LINES 170-171

`loc` supports mixed integer and label based access.

.. GENERATED FROM PYTHON SOURCE LINES 171-177

.. code-block:: Python


    df.loc[0]         # first row
    df.loc[0, :]      # first row
    df.loc[0, "age"]  # age item of first row
    df.loc[0, "age"] = 55


.. GENERATED FROM PYTHON SOURCE LINES 178-181

Selection and index

Select females into a new DataFrame

.. GENERATED FROM PYTHON SOURCE LINES 181-186

.. code-block:: Python


    df = users[users.gender == "F"]
    print(df)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

        name  age gender        job  height
    0  alice   19      F    student   165.0
    3  julie   44      F  scientist   171.0
    4   paul   58      F    manager     NaN


.. GENERATED FROM PYTHON SOURCE LINES 187-188

Reset index, useful when index is meaningless

.. GENERATED FROM PYTHON SOURCE LINES 188-192

.. code-block:: Python


    df = df.reset_index(drop=True)  # Watch the index
    print(df)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

        name  age gender        job  height
    0  alice   19      F    student   165.0
    1  julie   44      F  scientist   171.0
    2   paul   58      F    manager     NaN


.. GENERATED FROM PYTHON SOURCE LINES 193-195

Rows iteration
--------------

.. GENERATED FROM PYTHON SOURCE LINES 195-198

.. code-block:: Python


    df = users[:2].copy()


.. GENERATED FROM PYTHON SOURCE LINES 199-207

`iterrows()`: slow, get series, **read-only**

- Returns (index, Series) pairs.
- Slow because iterrows boxes the data into a Series.
- Retrieve fields with column name
- **Don't modify something you are iterating over**. Depending on the data types,
  the iterator returns a copy and not a view, and writing to it will have no
  effect.

.. GENERATED FROM PYTHON SOURCE LINES 207-211

.. code-block:: Python


    for idx, row in df.iterrows():
        print(row["name"], row["age"])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    alice 19
    eric 22


.. GENERATED FROM PYTHON SOURCE LINES 212-219

`itertuples()`: fast, get namedtuples, **read-only**

- Returns namedtuples of the values and which is generally faster than iterrows.
- Fast, because itertuples does not box the data into a Series.
- Retrieve fields with integer index starting from 0.
- Names will be renamed to positional names if they are invalid Python
identifier

.. GENERATED FROM PYTHON SOURCE LINES 219-224

.. code-block:: Python


    for tup in df.itertuples():
        print(tup[1], tup[2])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    alice 19
    eric 22


.. GENERATED FROM PYTHON SOURCE LINES 225-226

iter using `loc[i, ...]`: read and **write**

.. GENERATED FROM PYTHON SOURCE LINES 226-231

.. code-block:: Python


    for i in range(df.shape[0]):
        df.loc[i, "age"] *= 10  # df is modified


.. GENERATED FROM PYTHON SOURCE LINES 232-234

Rows selection (filtering)
--------------------------

.. GENERATED FROM PYTHON SOURCE LINES 236-237

simple logical filtering on numerical values

.. GENERATED FROM PYTHON SOURCE LINES 237-245

.. code-block:: Python


    users[users.age < 20]        # only show users with age < 20
    young_bool = users.age < 20  # or, create a Series of booleans...
    young = users[young_bool]            # ...and use that Series to filter rows
    users[users.age < 20].job    # select one column from the filtered results
    print(young)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

        name  age gender      job  height
    0  alice   19      F  student   165.0


.. GENERATED FROM PYTHON SOURCE LINES 246-247

simple logical filtering on categorial values

.. GENERATED FROM PYTHON SOURCE LINES 247-253

.. code-block:: Python


    users[users.job == 'student']
    users[users.job.isin(['student', 'engineer'])]
    users[users['job'].str.contains("stu|scient")]


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>name</th>
          <th>age</th>
          <th>gender</th>
          <th>job</th>
          <th>height</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>alice</td>
          <td>19</td>
          <td>F</td>
          <td>student</td>
          <td>165.0</td>
        </tr>
        <tr>
          <th>1</th>
          <td>eric</td>
          <td>22</td>
          <td>M</td>
          <td>student</td>
          <td>175.0</td>
        </tr>
        <tr>
          <th>2</th>
          <td>john</td>
          <td>26</td>
          <td>M</td>
          <td>student</td>
          <td>180.0</td>
        </tr>
        <tr>
          <th>3</th>
          <td>julie</td>
          <td>44</td>
          <td>F</td>
          <td>scientist</td>
          <td>171.0</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 254-255

Advanced logical filtering

.. GENERATED FROM PYTHON SOURCE LINES 255-260

.. code-block:: Python


    users[users.age < 20][['age', 'job']]            # select multiple columns
    users[(users.age > 20) & (users.gender == 'M')]  # use multiple conditions


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>name</th>
          <th>age</th>
          <th>gender</th>
          <th>job</th>
          <th>height</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>1</th>
          <td>eric</td>
          <td>22</td>
          <td>M</td>
          <td>student</td>
          <td>175.0</td>
        </tr>
        <tr>
          <th>2</th>
          <td>john</td>
          <td>26</td>
          <td>M</td>
          <td>student</td>
          <td>180.0</td>
        </tr>
        <tr>
          <th>5</th>
          <td>peter</td>
          <td>33</td>
          <td>M</td>
          <td>engineer</td>
          <td>NaN</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 261-263

Sorting
-------

.. GENERATED FROM PYTHON SOURCE LINES 263-275

.. code-block:: Python


    df = users.copy()

    df.age.sort_values()                      # only works for a Series
    df.sort_values(by='age')                  # sort rows by a specific column
    df.sort_values(by='age', ascending=False) # use descending order instead
    df.sort_values(by=['job', 'age'])         # sort by multiple columns
    df.sort_values(by=['job', 'age'], inplace=True) # modify df

    print(df)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

        name  age gender        job  height
    5  peter   33      M   engineer     NaN
    4   paul   58      F    manager     NaN
    3  julie   44      F  scientist   171.0
    0  alice   19      F    student   165.0
    1   eric   22      M    student   175.0
    2   john   26      M    student   180.0


.. GENERATED FROM PYTHON SOURCE LINES 276-280

Descriptive statistics
----------------------

Summarize all numeric columns

.. GENERATED FROM PYTHON SOURCE LINES 280-283

.. code-block:: Python


    print(df.describe())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

                 age      height
    count   6.000000    4.000000
    mean   33.666667  172.750000
    std    14.895189    6.344289
    min    19.000000  165.000000
    25%    23.000000  169.500000
    50%    29.500000  173.000000
    75%    41.250000  176.250000
    max    58.000000  180.000000


.. GENERATED FROM PYTHON SOURCE LINES 284-285

Summarize all columns

.. GENERATED FROM PYTHON SOURCE LINES 285-290

.. code-block:: Python


    print(df.describe(include='all'))
    print(df.describe(include=['object']))  # limit to one (or more) types


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

             name        age gender      job      height
    count       6   6.000000      6        6    4.000000
    unique      6        NaN      2        4         NaN
    top     peter        NaN      M  student         NaN
    freq        1        NaN      3        3         NaN
    mean      NaN  33.666667    NaN      NaN  172.750000
    std       NaN  14.895189    NaN      NaN    6.344289
    min       NaN  19.000000    NaN      NaN  165.000000
    25%       NaN  23.000000    NaN      NaN  169.500000
    50%       NaN  29.500000    NaN      NaN  173.000000
    75%       NaN  41.250000    NaN      NaN  176.250000
    max       NaN  58.000000    NaN      NaN  180.000000
             name gender      job
    count       6      6        6
    unique      6      2        4
    top     peter      M  student
    freq        1      3        3


.. GENERATED FROM PYTHON SOURCE LINES 291-292

Categorical columns: count and proportions of values

.. GENERATED FROM PYTHON SOURCE LINES 292-297

.. code-block:: Python


    df['job'].value_counts()
    df['job'].value_counts(normalize=True).round(2)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    job
    student      0.50
    engineer     0.17
    manager      0.17
    scientist    0.17
    Name: proportion, dtype: float64


.. GENERATED FROM PYTHON SOURCE LINES 298-299

Categorical columns: length of strings

.. GENERATED FROM PYTHON SOURCE LINES 299-303

.. code-block:: Python


    df['job'].str.len()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    5    8
    4    7
    3    9
    0    7
    1    7
    2    7
    Name: job, dtype: int64


.. GENERATED FROM PYTHON SOURCE LINES 304-305

Statistics per group (groupby)

.. GENERATED FROM PYTHON SOURCE LINES 305-310

.. code-block:: Python


    print(df.groupby("job")["age"].mean())
    # print(df.groupby("job").describe(include='all'))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    job
    engineer     33.000000
    manager      58.000000
    scientist    44.000000
    student      22.333333
    Name: age, dtype: float64


.. GENERATED FROM PYTHON SOURCE LINES 311-312

Groupby in a loop

.. GENERATED FROM PYTHON SOURCE LINES 312-317

.. code-block:: Python


    for grp, data in df.groupby("job"):
        print(grp, data)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    engineer     name  age gender       job  height
    5  peter   33      M  engineer     NaN
    manager    name  age gender      job  height
    4  paul   58      F  manager     NaN
    scientist     name  age gender        job  height
    3  julie   44      F  scientist   171.0
    student     name  age gender      job  height
    0  alice   19      F  student   165.0
    1   eric   22      M  student   175.0
    2   john   26      M  student   180.0


.. GENERATED FROM PYTHON SOURCE LINES 318-323

Quality check
-------------

Remove duplicate data
~~~~~~~~~~~~~~~~~~~~~

.. GENERATED FROM PYTHON SOURCE LINES 323-338

.. code-block:: Python


    df = users.copy()

    # Create a duplicate: Append the first at the end
    df.loc[len(df.index)] = users.iloc[0]

    print(df.duplicated())                 # Series of booleans
    # (True if a row is identical to a previous row)
    df.duplicated().sum()                  # count of duplicates
    df[df.duplicated()]                    # only show duplicates
    df.age.duplicated()                    # check a single column for duplicates
    df.duplicated(['age', 'gender']).sum() # specify columns for finding duplicates
    df = df.drop_duplicates()              # drop duplicate rows


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0    False
    1    False
    2    False
    3    False
    4    False
    5    False
    6     True
    dtype: bool


.. GENERATED FROM PYTHON SOURCE LINES 339-341

Missing data
~~~~~~~~~~~~

.. GENERATED FROM PYTHON SOURCE LINES 341-358

.. code-block:: Python


    # Missing values are often just excluded
    df = users.copy()

    df.describe(include='all')

    # find missing values in a Series
    df.height.isnull()           # True if NaN, False otherwise
    df.height.notnull()          # False if NaN, True otherwise
    df[df.height.notnull()]      # only show rows where age is not NaN
    df.height.isnull().sum()     # count the missing values

    # find missing values in a DataFrame
    df.isnull()             # DataFrame of booleans
    df.isnull().sum()       # calculate the sum of each column


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    name      0
    age       0
    gender    0
    job       0
    height    2
    dtype: int64


.. GENERATED FROM PYTHON SOURCE LINES 359-360

Strategy 1: drop missing values

.. GENERATED FROM PYTHON SOURCE LINES 360-365

.. code-block:: Python


    df.dropna()             # drop a row if ANY values are missing
    df.dropna(how='all')    # drop a row only if ALL values are missing


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>name</th>
          <th>age</th>
          <th>gender</th>
          <th>job</th>
          <th>height</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>alice</td>
          <td>19</td>
          <td>F</td>
          <td>student</td>
          <td>165.0</td>
        </tr>
        <tr>
          <th>1</th>
          <td>eric</td>
          <td>22</td>
          <td>M</td>
          <td>student</td>
          <td>175.0</td>
        </tr>
        <tr>
          <th>2</th>
          <td>john</td>
          <td>26</td>
          <td>M</td>
          <td>student</td>
          <td>180.0</td>
        </tr>
        <tr>
          <th>3</th>
          <td>julie</td>
          <td>44</td>
          <td>F</td>
          <td>scientist</td>
          <td>171.0</td>
        </tr>
        <tr>
          <th>4</th>
          <td>paul</td>
          <td>58</td>
          <td>F</td>
          <td>manager</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>5</th>
          <td>peter</td>
          <td>33</td>
          <td>M</td>
          <td>engineer</td>
          <td>NaN</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 366-367

Strategy 2: fill in missing values

.. GENERATED FROM PYTHON SOURCE LINES 367-375

.. code-block:: Python


    df.height.mean()
    df = users.copy()
    df.loc[df.height.isnull(), "height"] = df["height"].mean()

    print(df)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

        name  age gender        job  height
    0  alice   19      F    student  165.00
    1   eric   22      M    student  175.00
    2   john   26      M    student  180.00
    3  julie   44      F  scientist  171.00
    4   paul   58      F    manager  172.75
    5  peter   33      M   engineer  172.75


.. GENERATED FROM PYTHON SOURCE LINES 376-380

Operation: multiplication
-------------------------

Multiplication of dataframe and other, element-wise

.. GENERATED FROM PYTHON SOURCE LINES 380-387

.. code-block:: Python


    df = users.dropna()
    df.insert(0, 'random', np.arange(df.shape[0]))
    print(df)
    df[["age", "height"]].multiply(df["random"], axis="index")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

       random   name  age gender        job  height
    0       0  alice   19      F    student   165.0
    1       1   eric   22      M    student   175.0
    2       2   john   26      M    student   180.0
    3       3  julie   44      F  scientist   171.0


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>age</th>
          <th>height</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>0</td>
          <td>0.0</td>
        </tr>
        <tr>
          <th>1</th>
          <td>22</td>
          <td>175.0</td>
        </tr>
        <tr>
          <th>2</th>
          <td>52</td>
          <td>360.0</td>
        </tr>
        <tr>
          <th>3</th>
          <td>132</td>
          <td>513.0</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 388-392

Renaming
--------

Rename columns

.. GENERATED FROM PYTHON SOURCE LINES 392-396

.. code-block:: Python


    df = users.copy()
    df.rename(columns={'name': 'NAME'})


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>NAME</th>
          <th>age</th>
          <th>gender</th>
          <th>job</th>
          <th>height</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>alice</td>
          <td>19</td>
          <td>F</td>
          <td>student</td>
          <td>165.0</td>
        </tr>
        <tr>
          <th>1</th>
          <td>eric</td>
          <td>22</td>
          <td>M</td>
          <td>student</td>
          <td>175.0</td>
        </tr>
        <tr>
          <th>2</th>
          <td>john</td>
          <td>26</td>
          <td>M</td>
          <td>student</td>
          <td>180.0</td>
        </tr>
        <tr>
          <th>3</th>
          <td>julie</td>
          <td>44</td>
          <td>F</td>
          <td>scientist</td>
          <td>171.0</td>
        </tr>
        <tr>
          <th>4</th>
          <td>paul</td>
          <td>58</td>
          <td>F</td>
          <td>manager</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>5</th>
          <td>peter</td>
          <td>33</td>
          <td>M</td>
          <td>engineer</td>
          <td>NaN</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 397-398

Rename values

.. GENERATED FROM PYTHON SOURCE LINES 398-403

.. code-block:: Python


    df.job = df.job.map({'student': 'etudiant', 'manager': 'manager',
                         'engineer': 'ingenieur', 'scientist': 'scientific'})


.. GENERATED FROM PYTHON SOURCE LINES 404-406

Dealing with outliers
---------------------

.. GENERATED FROM PYTHON SOURCE LINES 406-411

.. code-block:: Python


    size = pd.Series(np.random.normal(loc=175, size=20, scale=10))
    # Corrupt the first 3 measures
    size[:3] += 500


.. GENERATED FROM PYTHON SOURCE LINES 412-419

Based on parametric statistics: use the mean
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Assume random variable follows the normal distribution
Exclude data outside 3 standard-deviations:
- Probability that a sample lies within 1 sd: 68.27%
- Probability that a sample lies within 3 sd: 99.73% (68.27 + 2 * 15.73)

.. GENERATED FROM PYTHON SOURCE LINES 419-425

.. code-block:: Python


    size_outlr_mean = size.copy()
    size_outlr_mean[((size - size.mean()).abs() > 3 * size.std())] = size.mean()
    print(size_outlr_mean.mean())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    248.48963819938044


.. GENERATED FROM PYTHON SOURCE LINES 426-431

Based on non-parametric statistics: use the median
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

`Median absolute deviation (MAD) <https://en.wikipedia.org/wiki/Median_absolute_deviation>`_
is based on the median, is a robust non-parametric statistics.

.. GENERATED FROM PYTHON SOURCE LINES 431-440

.. code-block:: Python


    mad = 1.4826 * np.median(np.abs(size - size.median()))
    size_outlr_mad = size.copy()

    size_outlr_mad[((size - size.median()).abs() > 3 * mad)] = size.median()
    print(size_outlr_mad.mean(), size_outlr_mad.median())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    173.80000467192673 178.7023568870694


.. GENERATED FROM PYTHON SOURCE LINES 441-446

File I/O
--------

csv
~~~

.. GENERATED FROM PYTHON SOURCE LINES 446-454

.. code-block:: Python


    import tempfile, os.path

    tmpdir = tempfile.gettempdir()
    csv_filename = os.path.join(tmpdir, "users.csv")
    users.to_csv(csv_filename, index=False)
    other = pd.read_csv(csv_filename)


.. GENERATED FROM PYTHON SOURCE LINES 455-457

Read csv from url
~~~~~~~~~~~~~~~~~

.. GENERATED FROM PYTHON SOURCE LINES 457-462

.. code-block:: Python


    url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
    salary = pd.read_csv(url)


.. GENERATED FROM PYTHON SOURCE LINES 463-471

Excel
~~~~~

Package `openpyxl` is required. To install type:

::

   conda install -c conda-forge openpyxl

.. GENERATED FROM PYTHON SOURCE LINES 471-489

.. code-block:: Python


    xls_filename = os.path.join(tmpdir, "users.xlsx")

    # Write
    users.to_excel(xls_filename, sheet_name='users', index=False)

    # Read
    pd.read_excel(xls_filename, sheet_name='users')

    # Multiple sheets
    with pd.ExcelWriter(xls_filename) as writer:
        users.to_excel(writer, sheet_name='users', index=False)
        df.to_excel(writer, sheet_name='salary', index=False)

    pd.read_excel(xls_filename, sheet_name='users')
    pd.read_excel(xls_filename, sheet_name='salary')


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>name</th>
          <th>age</th>
          <th>gender</th>
          <th>job</th>
          <th>height</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>alice</td>
          <td>19</td>
          <td>F</td>
          <td>etudiant</td>
          <td>165.0</td>
        </tr>
        <tr>
          <th>1</th>
          <td>eric</td>
          <td>22</td>
          <td>M</td>
          <td>etudiant</td>
          <td>175.0</td>
        </tr>
        <tr>
          <th>2</th>
          <td>john</td>
          <td>26</td>
          <td>M</td>
          <td>etudiant</td>
          <td>180.0</td>
        </tr>
        <tr>
          <th>3</th>
          <td>julie</td>
          <td>44</td>
          <td>F</td>
          <td>scientific</td>
          <td>171.0</td>
        </tr>
        <tr>
          <th>4</th>
          <td>paul</td>
          <td>58</td>
          <td>F</td>
          <td>manager</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>5</th>
          <td>peter</td>
          <td>33</td>
          <td>M</td>
          <td>ingenieur</td>
          <td>NaN</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 490-492

SQL (SQLite)
~~~~~~~~~~~~

.. GENERATED FROM PYTHON SOURCE LINES 492-498

.. code-block:: Python


    import pandas as pd
    import sqlite3

    db_filename = os.path.join(tmpdir, "users.db")


.. GENERATED FROM PYTHON SOURCE LINES 499-500

Connect

.. GENERATED FROM PYTHON SOURCE LINES 500-503

.. code-block:: Python


    conn = sqlite3.connect(db_filename)


.. GENERATED FROM PYTHON SOURCE LINES 504-505

Creating tables with pandas

.. GENERATED FROM PYTHON SOURCE LINES 505-511

.. code-block:: Python


    url = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
    salary = pd.read_csv(url)

    salary.to_sql("salary", conn, if_exists="replace")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    46


.. GENERATED FROM PYTHON SOURCE LINES 512-513

Push modifications

.. GENERATED FROM PYTHON SOURCE LINES 513-520

.. code-block:: Python


    cur = conn.cursor()
    values = (100, 14000, 5,  'Bachelor', 'N')
    cur.execute("insert into salary values (?, ?, ?, ?, ?)", values)
    conn.commit()


.. GENERATED FROM PYTHON SOURCE LINES 521-522

Reading results into a pandas DataFrame

.. GENERATED FROM PYTHON SOURCE LINES 522-532

.. code-block:: Python


    salary_sql = pd.read_sql_query("select * from salary;", conn)
    print(salary_sql.head())

    pd.read_sql_query("select * from salary;", conn).tail()
    pd.read_sql_query('select * from salary where salary>25000;', conn)
    pd.read_sql_query('select * from salary where experience=16;', conn)
    pd.read_sql_query('select * from salary where education="Master";', conn)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

       index  salary  experience education management
    0      0   13876           1  Bachelor          Y
    1      1   11608           1      Ph.D          N
    2      2   18701           1      Ph.D          Y
    3      3   11283           1    Master          N
    4      4   11767           1      Ph.D          N


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>index</th>
          <th>salary</th>
          <th>experience</th>
          <th>education</th>
          <th>management</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>3</td>
          <td>11283</td>
          <td>1</td>
          <td>Master</td>
          <td>N</td>
        </tr>
        <tr>
          <th>1</th>
          <td>5</td>
          <td>20872</td>
          <td>2</td>
          <td>Master</td>
          <td>Y</td>
        </tr>
        <tr>
          <th>2</th>
          <td>6</td>
          <td>11772</td>
          <td>2</td>
          <td>Master</td>
          <td>N</td>
        </tr>
        <tr>
          <th>3</th>
          <td>9</td>
          <td>12313</td>
          <td>3</td>
          <td>Master</td>
          <td>N</td>
        </tr>
        <tr>
          <th>4</th>
          <td>11</td>
          <td>21371</td>
          <td>3</td>
          <td>Master</td>
          <td>Y</td>
        </tr>
        <tr>
          <th>5</th>
          <td>16</td>
          <td>12884</td>
          <td>4</td>
          <td>Master</td>
          <td>N</td>
        </tr>
        <tr>
          <th>6</th>
          <td>17</td>
          <td>13245</td>
          <td>5</td>
          <td>Master</td>
          <td>N</td>
        </tr>
        <tr>
          <th>7</th>
          <td>22</td>
          <td>13839</td>
          <td>6</td>
          <td>Master</td>
          <td>N</td>
        </tr>
        <tr>
          <th>8</th>
          <td>23</td>
          <td>22884</td>
          <td>6</td>
          <td>Master</td>
          <td>Y</td>
        </tr>
        <tr>
          <th>9</th>
          <td>25</td>
          <td>14803</td>
          <td>8</td>
          <td>Master</td>
          <td>N</td>
        </tr>
        <tr>
          <th>10</th>
          <td>30</td>
          <td>15942</td>
          <td>10</td>
          <td>Master</td>
          <td>N</td>
        </tr>
        <tr>
          <th>11</th>
          <td>32</td>
          <td>23780</td>
          <td>10</td>
          <td>Master</td>
          <td>Y</td>
        </tr>
        <tr>
          <th>12</th>
          <td>33</td>
          <td>25410</td>
          <td>11</td>
          <td>Master</td>
          <td>Y</td>
        </tr>
        <tr>
          <th>13</th>
          <td>35</td>
          <td>16882</td>
          <td>12</td>
          <td>Master</td>
          <td>N</td>
        </tr>
        <tr>
          <th>14</th>
          <td>38</td>
          <td>26330</td>
          <td>13</td>
          <td>Master</td>
          <td>Y</td>
        </tr>
        <tr>
          <th>15</th>
          <td>39</td>
          <td>17949</td>
          <td>14</td>
          <td>Master</td>
          <td>N</td>
        </tr>
        <tr>
          <th>16</th>
          <td>41</td>
          <td>27837</td>
          <td>16</td>
          <td>Master</td>
          <td>Y</td>
        </tr>
        <tr>
          <th>17</th>
          <td>42</td>
          <td>18838</td>
          <td>16</td>
          <td>Master</td>
          <td>N</td>
        </tr>
        <tr>
          <th>18</th>
          <td>44</td>
          <td>19207</td>
          <td>17</td>
          <td>Master</td>
          <td>N</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 533-559

Exercises
---------

Data Frame
~~~~~~~~~~

1. Read the iris dataset at 'https://github.com/duchesnay/pystatsml/raw/master/datasets/iris.csv'

2. Print column names

3. Get numerical columns

4. For each species compute the mean of numerical columns and store it in  a ``stats`` table like:

::

          species  sepal_length  sepal_width  petal_length  petal_width
    0      setosa         5.006        3.428         1.462        0.246
    1  versicolor         5.936        2.770         4.260        1.326
    2   virginica         6.588        2.974         5.552        2.026


Missing data
~~~~~~~~~~~~

Add some missing data to the previous table ``users``:

.. GENERATED FROM PYTHON SOURCE LINES 559-564

.. code-block:: Python


    df = users.copy()
    df.loc[[0, 2], "age"] = None
    df.loc[[1, 3], "gender"] = None


.. GENERATED FROM PYTHON SOURCE LINES 565-570

1. Write a function ``fillmissing_with_mean(df)`` that fill all missing
value of numerical column with the mean of the current columns.

2. Save the original users and "imputed" frame in a single excel file
"users.xlsx" with 2 sheets: original, imputed.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.683 seconds)


.. _sphx_glr_download_auto_gallery_data_pandas.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: data_pandas.ipynb <data_pandas.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: data_pandas.py <data_pandas.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: data_pandas.zip <data_pandas.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_