{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0b45c495",
   "metadata": {},
   "source": [
    "# Lecture 03. Data Loading and Cleaning\n",
    "\n",
    "### Instructor: Luping Yu\n",
    "\n",
    "### Mar 12, 2024\n",
    "\n",
    "***\n",
    "\n",
    "Accessing data is a necessary first step for using most of the tools in this course. We are going to be focused on data input and output using pandas.\n",
    "\n",
    "***\n",
    "## Reading and writing data in text format\n",
    "\n",
    "<code>pandas</code> features a number of functions for reading **tabular data** as a <code>DataFrame</code> object.\n",
    "\n",
    "![avatar](https://raw.githubusercontent.com/lazydingding/gallery/main/Screen%20Shot%202022-03-07%20at%2020.25.29.png)\n",
    "\n",
    "\n",
    "The following table summarizes some of them, though <code>read_csv</code> is likely the ones you'll use the most.\n",
    "\n",
    "|Function | Description |\n",
    "|:- | :- | \n",
    "|**read_csv** | Load delimited data from a file, URL, or file-like object; use comma as default delimiter\n",
    "|**read_excel** | Read tabular data from an Excel XLS or XLSX file\n",
    "|**read_stata** | Read a dataset from Stata file format\n",
    "|**read_sas** | Read a SAS dataset stored in one of the SAS system's custom storage formats\n",
    "|read_html | Read all tables found in the given HTML document\n",
    "|read_json | Read data from a JSON (JavaScript Object Notation) string representation\n",
    "|read_pickle | Read an arbitrary object stored in Python pickle format\n",
    "|read_sql | Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame\n",
    "\n",
    "***\n",
    "\n",
    "### Reading and Writing .csv (comma-separated values)\n",
    "\n",
    "<code>.csv</code> is a delimited text file that uses a **comma** to separate values. A <code>.csv</code> file typically stores **tabular data** (numbers and text) in **plain text**.\n",
    "\n",
    "Let‘s start with a small <code>.csv</code> text file: <u>ex1.csv</u>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6f0d159d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>d</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>hello</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>world</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "      <td>12</td>\n",
       "      <td>foo</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b   c   d message\n",
       "0  1   2   3   4   hello\n",
       "1  5   6   7   8   world\n",
       "2  9  10  11  12     foo"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pd.read_csv('examples/ex1.csv') # relative path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9d0a5112",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>d</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>hello</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>world</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "      <td>12</td>\n",
       "      <td>foo</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b   c   d message\n",
       "0  1   2   3   4   hello\n",
       "1  5   6   7   8   world\n",
       "2  9  10  11  12     foo"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# absolute path  (absolute path differs between Windows and Mac)\n",
    "pd.read_csv('/Users/luping/desktop/teaching/2024_fdap/examples/ex1.csv') "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "746c6213",
   "metadata": {},
   "source": [
    "<code>pandas.read_csv()</code> perform type inference. That means you don't necessarily have to specify which columns are numeric, integer, boolean, or string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ade494cd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a           int64\n",
       "b           int64\n",
       "c           int64\n",
       "d           int64\n",
       "message    object\n",
       "dtype: object"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('examples/ex1.csv')\n",
    "\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "110442cf",
   "metadata": {},
   "source": [
    "A file will not always have a **header row**. Consider this file: <u>ex2.csv</u>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "99d8859b-28c2-4977-97c0-57250bde0449",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>hello</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>world</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "      <td>12</td>\n",
       "      <td>foo</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   1   2   3   4  hello\n",
       "0  5   6   7   8  world\n",
       "1  9  10  11  12    foo"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('examples/ex2.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc13b36d",
   "metadata": {},
   "source": [
    "To read this file, you have a couple of options. You can allow pandas to assign default column names, or you can specify names yourself:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b9e2a695",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>d</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>hello</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>world</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "      <td>12</td>\n",
       "      <td>foo</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b   c   d message\n",
       "0  1   2   3   4   hello\n",
       "1  5   6   7   8   world\n",
       "2  9  10  11  12     foo"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66161810",
   "metadata": {
    "scrolled": true
   },
   "source": [
    "Suppose you wanted the message column to be the index of the returned <code>DataFrame</code>. You can use the <code>index_col</code> argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "623cae19",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>d</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>message</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>hello</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>world</th>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>foo</th>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         a   b   c   d\n",
       "message               \n",
       "hello    1   2   3   4\n",
       "world    5   6   7   8\n",
       "foo      9  10  11  12"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'], index_col='message')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c612fe7",
   "metadata": {},
   "source": [
    "Handling missing values is an important and frequently nuanced part of the file parsing process. Consider this file: <u>ex3.csv</u>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "2c41ff50",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>something</th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>d</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8</td>\n",
       "      <td>world</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>three</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11.0</td>\n",
       "      <td>12</td>\n",
       "      <td>foo</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  something  a   b     c   d message\n",
       "0       one  1   2   3.0   4     NaN\n",
       "1       two  5   6   NaN   8   world\n",
       "2     three  9  10  11.0  12     foo"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('examples/ex3.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce1cb635",
   "metadata": {},
   "source": [
    "Missing data is usually either not present (empty string) or marked by some **sentinel** value, such as **NA** and **NULL**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "9c6e56d7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>something</th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>d</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   something     a     b      c     d  message\n",
       "0       True  True  True   True  True    False\n",
       "1       True  True  True  False  True     True\n",
       "2       True  True  True   True  True     True"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('examples/ex3.csv')\n",
    "\n",
    "pd.notnull(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23b59c0f",
   "metadata": {},
   "source": [
    "Using <code>to_csv()</code> method, we can write the data out to a comma-separated file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "de1d01b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_csv('examples/out1.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92dd432a",
   "metadata": {},
   "source": [
    "With no other options specified, both the row and column labels are written. Both of these can be disabled:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "2cb0e34f",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_csv('examples/out2.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10ee9c54",
   "metadata": {},
   "source": [
    "***\n",
    "### Parameters of data loading functions\n",
    "\n",
    "Because of how messy data in the real world can be, data loading functions (especially <code>read_csv()</code>) have grown very complex in their options over time. The **online pandas documentation** has many examples about how each of them works.\n",
    "\n",
    "API reference (pandas documentation) of <code>read_csv()</code>: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html\n",
    "\n",
    "\n",
    "***\n",
    "\n",
    "### Reading Microsoft excel files\n",
    "\n",
    "<code>pandas</code> also supports reading tabular data stored in Excel 2003 (and higher) files using <code>pandas.read_excel()</code> function:\n",
    "\n",
    "Internally these tools use the add-on packages **xlrd** and **openpyxl** to read XLS and XLSX files, respectively. You may need to install these manually with pip."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60202b36",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df = pd.read_excel('examples/ex1.xlsx', 'Sheet1')\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4dbc40f",
   "metadata": {},
   "source": [
    "To write pandas data to Excel format, you can pass a file path to <code>to_excel()</code>:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d04c3dbb",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_excel('examples/out1.xlsx')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65c236ad",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "## Data cleaning and preparation\n",
    "\n",
    "During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: **loading**, **cleaning**, **transforming**, and **rearranging**. Such tasks are often reported to take up 80% or more of an analyst's time.\n",
    "\n",
    "Sometimes the way that data is stored in files or databases is not in the right format for a particular task. Fortunately, <code>pandas</code> provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form.\n",
    "\n",
    "***\n",
    "### Handling Missing Data\n",
    "\n",
    "Missing data (**NA**, which stands for **not available**) occurs commonly in many data analysis applications. For numeric data, pandas uses the floating-point value **NaN** (not a number) to represent missing data.\n",
    "\n",
    "With <code>DataFrame</code> objects, you may want to drop rows or columns that are all NA or only those containing any NAs.\n",
    "\n",
    "<code>dropna()</code> by default drops any row containing a missing value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "b1526cc1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0\n",
       "1  1.0  NaN  NaN\n",
       "2  NaN  NaN  NaN\n",
       "3  NaN  6.5  3.0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame([[1., 6.5, 3.],\n",
    "                   [1., None, None],\n",
    "                   [None, None, None],\n",
    "                   [None, 6.5, 3.]])\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "4f12887a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "770b2bb6",
   "metadata": {},
   "source": [
    "Passing <code>how='all'</code> will only drop rows that are all NA:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8e5719cb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0\n",
       "1  1.0  NaN  NaN\n",
       "3  NaN  6.5  3.0"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna(how='all')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c3a332e",
   "metadata": {},
   "source": [
    "To drop columns in the same way, pass <code>axis=1</code>:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "c943629f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2     3\n",
       "0  1.0  6.5  3.0  None\n",
       "1  1.0  NaN  NaN  None\n",
       "2  NaN  NaN  NaN  None\n",
       "3  NaN  6.5  3.0  None"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[3] = None\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "b2c3eebe",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0\n",
       "1  1.0  NaN  NaN\n",
       "2  NaN  NaN  NaN\n",
       "3  NaN  6.5  3.0"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna(axis=1, how='all')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd30cdd5",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "### Filling In Missing Data\n",
    "\n",
    "Rather than filtering out missing data, you may want to fill in the \"holes\" in any number of ways. The <code>fillna</code> method is the function to use. \n",
    "\n",
    "Calling <code>fillna</code> with a constant replaces missing values with that value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "3edb6bc9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class participation</th>\n",
       "      <th>homework</th>\n",
       "      <th>midterm</th>\n",
       "      <th>final</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>40.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.0</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   class participation  homework  midterm  final\n",
       "0                 10.0      30.0     20.0   40.0\n",
       "1                  8.0      25.0     15.0   35.0\n",
       "2                  6.0      20.0     10.0    NaN\n",
       "3                  NaN       NaN      NaN    NaN\n",
       "4                  NaN       NaN     10.0   30.0"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame([[10, 30, 20, 40],\n",
    "                   [8, 25, 15, 35],\n",
    "                   [6, 20, 10, None],\n",
    "                   [None, None, None, None], \n",
    "                   [None, None, 10, 30]],\n",
    "                  columns=['class participation', 'homework', 'midterm', 'final'])\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "dcae23ac",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class participation</th>\n",
       "      <th>homework</th>\n",
       "      <th>midterm</th>\n",
       "      <th>final</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>40.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   class participation  homework  midterm  final\n",
       "0                 10.0      30.0     20.0   40.0\n",
       "1                  8.0      25.0     15.0   35.0\n",
       "2                  6.0      20.0     10.0    5.0\n",
       "3                  5.0       5.0      5.0    5.0\n",
       "4                  5.0       5.0     10.0   30.0"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a033aee",
   "metadata": {},
   "source": [
    "Calling <code>fillna()</code> with a dict, you can use a different fill value for each column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "f516242f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class participation</th>\n",
       "      <th>homework</th>\n",
       "      <th>midterm</th>\n",
       "      <th>final</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>40.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.0</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   class participation  homework  midterm  final\n",
       "0                 10.0      30.0     20.0   40.0\n",
       "1                  8.0      25.0     15.0   35.0\n",
       "2                  6.0      20.0     10.0   30.0\n",
       "3                  5.0       NaN      NaN   30.0\n",
       "4                  5.0       NaN     10.0   30.0"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna({'class participation': 5, 'final': 30})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a66a5875",
   "metadata": {},
   "source": [
    "The **interpolation methods** can be used with fillna:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "c63182de",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class participation</th>\n",
       "      <th>homework</th>\n",
       "      <th>midterm</th>\n",
       "      <th>final</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>40.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>15.0</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   class participation  homework  midterm  final\n",
       "0                 10.0      30.0     20.0   40.0\n",
       "1                  8.0      25.0     15.0   35.0\n",
       "2                  6.0      20.0     10.0   35.0\n",
       "3                  6.0      20.0     10.0   35.0\n",
       "4                  6.0      20.0     10.0   30.0"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna(method='ffill')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a772446b",
   "metadata": {},
   "source": [
    "With fillna you can do lots of other things with a little creativity. For example, you might pass the mean or median value of a Series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "83c49612",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class participation</th>\n",
       "      <th>homework</th>\n",
       "      <th>midterm</th>\n",
       "      <th>final</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>8.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>13.750000</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>4.787136</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>6.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>7.0</td>\n",
       "      <td>22.5</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>32.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>8.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>12.500000</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>9.0</td>\n",
       "      <td>27.5</td>\n",
       "      <td>16.250000</td>\n",
       "      <td>37.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>10.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>20.000000</td>\n",
       "      <td>40.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       class participation  homework    midterm  final\n",
       "count                  3.0       3.0   4.000000    3.0\n",
       "mean                   8.0      25.0  13.750000   35.0\n",
       "std                    2.0       5.0   4.787136    5.0\n",
       "min                    6.0      20.0  10.000000   30.0\n",
       "25%                    7.0      22.5  10.000000   32.5\n",
       "50%                    8.0      25.0  12.500000   35.0\n",
       "75%                    9.0      27.5  16.250000   37.5\n",
       "max                   10.0      30.0  20.000000   40.0"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.describe() #summary statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "c4a33cfa",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class participation</th>\n",
       "      <th>homework</th>\n",
       "      <th>midterm</th>\n",
       "      <th>final</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>20.00</td>\n",
       "      <td>40.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>15.00</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>10.00</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>8.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>13.75</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>10.00</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   class participation  homework  midterm  final\n",
       "0                 10.0      30.0    20.00   40.0\n",
       "1                  8.0      25.0    15.00   35.0\n",
       "2                  6.0      20.0    10.00   35.0\n",
       "3                  8.0      25.0    13.75   35.0\n",
       "4                  8.0      25.0    10.00   30.0"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.fillna(df.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d30d19d",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "### Removing Duplicates\n",
    "\n",
    "Duplicate rows may be found in a <code>DataFrame</code> for any number of reasons. Here is an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "bdfbabd1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>two</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>one</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2\n",
       "0  one   1\n",
       "1  two   1\n",
       "2  one   2\n",
       "3  two   3\n",
       "4  one   3\n",
       "5  two   4\n",
       "6  two   4"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],\n",
    "                     'k2': [1, 1, 2, 3, 3, 4, 4]})\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6628633d",
   "metadata": {},
   "source": [
    "The <code>DataFrame</code> method <code>duplicated()</code> returns a boolean <code>Series</code> indicating whether each row is a duplicate (has been observed in a previous row) or not:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "c24a3e2b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1    False\n",
       "2    False\n",
       "3    False\n",
       "4    False\n",
       "5    False\n",
       "6     True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.duplicated()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "101da634",
   "metadata": {},
   "source": [
    "Relatedly, <code>drop_duplicates()</code> returns a <code>DataFrame</code> where the duplicated array is False:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "6ebf0302",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>two</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>one</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2\n",
       "0  one   1\n",
       "1  two   1\n",
       "2  one   2\n",
       "3  two   3\n",
       "4  one   3\n",
       "5  two   4"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1382c471",
   "metadata": {},
   "source": [
    "<code>drop_duplicates()</code> considers all of the columns; alternatively, you can specify any **subset** of them to detect duplicates.\n",
    "\n",
    "Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "620473a2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "      <th>k3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>two</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>one</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2  k3\n",
       "0  one   1   0\n",
       "1  two   1   1\n",
       "2  one   2   2\n",
       "3  two   3   3\n",
       "4  one   3   4\n",
       "5  two   4   5\n",
       "6  two   4   6"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['k3'] = range(7)\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "88613841",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "      <th>k3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2  k3\n",
       "0  one   1   0\n",
       "1  two   1   1"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.drop_duplicates(['k1'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e9c1b34",
   "metadata": {},
   "source": [
    "<code>duplicated()</code> and <code>drop_duplicates()</code> by default keep the **first** observed value combination. Passing <code>keep='last'</code> will return the last one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "08c442c0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>k1</th>\n",
       "      <th>k2</th>\n",
       "      <th>k3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>one</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>two</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>two</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>one</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>two</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    k1  k2  k3\n",
       "0  one   1   0\n",
       "1  two   1   1\n",
       "2  one   2   2\n",
       "3  two   3   3\n",
       "4  one   3   4\n",
       "6  two   4   6"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.drop_duplicates(['k1', 'k2'], keep='last')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b01c3321",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "### Vectorized string functions in pandas\n",
    "\n",
    "Cleaning up a messy dataset for analysis often requires a lot of string **regularization**. For example, a column containing strings will sometimes have missing data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "0d3090a7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     dave@google.com\n",
       "Jack     jack@xmu.edu.cn\n",
       "Steve    steve@gmail.com\n",
       "Rose     rose@xmu.edu.cn\n",
       "Tony                None\n",
       "dtype: object"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.Series({'Dave': 'dave@google.com',\n",
    "                'Jack': 'jack@xmu.edu.cn',\n",
    "                'Steve': 'steve@gmail.com',\n",
    "                'Rose': 'rose@xmu.edu.cn',\n",
    "                'Tony': None})\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51651957",
   "metadata": {},
   "source": [
    "<code>Series</code> has array-oriented methods for string operations that skip NA values. These are accessed through Series's <code>str</code> attribute.\n",
    "\n",
    "For example, we could check whether each email address has 'xmu.edu' in it with <code>str.contains</code>:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "2d0ec00d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     False\n",
       "Jack      True\n",
       "Steve    False\n",
       "Rose      True\n",
       "Tony      None\n",
       "dtype: object"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.str.contains('xmu.edu')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71c9b6cd",
   "metadata": {},
   "source": [
    "You can similarly **slice** strings using this syntax:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "3fbc0d6d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     dave@\n",
       "Jack     jack@\n",
       "Steve    steve\n",
       "Rose     rose@\n",
       "Tony      None\n",
       "dtype: object"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.str[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "288f88a1",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave     [dave, google.com]\n",
       "Jack     [jack, xmu.edu.cn]\n",
       "Steve    [steve, gmail.com]\n",
       "Rose     [rose, xmu.edu.cn]\n",
       "Tony                   None\n",
       "dtype: object"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.str.split('@')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "a6c9f11b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dave      dave\n",
       "Jack      jack\n",
       "Steve    steve\n",
       "Rose      rose\n",
       "Tony      None\n",
       "dtype: object"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.str.split('@').str.get(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95eceec9",
   "metadata": {},
   "source": [
    "Partial listing of vectorized string methods.\n",
    "\n",
    "|Method|Description|\n",
    "|:- | :- | \n",
    "|cat|Concatenate strings element-wise with optional delimiter\n",
    "|contains|Return boolean array if each string contains pattern/regex\n",
    "|count|Count occurrences of pattern\n",
    "|extract|Use a regular expression with groups to extract one or more strings from a Series of strings\n",
    "|endswith|Equivalent to x.endswith(pattern) for each element\n",
    "|startswith|Equivalent to x.startswith(pattern) for each element\n",
    "|findall|Compute list of all occurrences of pattern/regex for each string\n",
    "|get|Index into each element (retrieve i-th element)\n",
    "|join|Join strings in each element of the Series with passed separator\n",
    "|len|Compute length of each string\n",
    "|lower,upper|Convert cases;equivalent to x.lower() or x.upper() for each element\n",
    "|match|Use re.match with the passed regular expression on each element\n",
    "|replace|Replace occurrences of pattern/regex with some other string\n",
    "|slice|Slice each string in the Series\n",
    "|split|Split strings on delimiter or regular expression\n",
    "|strip|Trim whitespace from both sides, including newlines"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}