{ "cells": [ { "cell_type": "markdown", "id": "0b45c495", "metadata": {}, "source": [ "# Lecture 03. Data Loading and Cleaning\n", "\n", "### Instructor: Luping Yu\n", "\n", "### Mar 12, 2024\n", "\n", "***\n", "\n", "Accessing data is a necessary first step for using most of the tools in this course. We are going to be focused on data input and output using pandas.\n", "\n", "***\n", "## Reading and writing data in text format\n", "\n", "pandas features a number of functions for reading **tabular data** as a DataFrame object.\n", "\n", "![avatar](https://raw.githubusercontent.com/lazydingding/gallery/main/Screen%20Shot%202022-03-07%20at%2020.25.29.png)\n", "\n", "\n", "The following table summarizes some of them, though read_csv is likely the ones you'll use the most.\n", "\n", "|Function | Description |\n", "|:- | :- | \n", "|**read_csv** | Load delimited data from a file, URL, or file-like object; use comma as default delimiter\n", "|**read_excel** | Read tabular data from an Excel XLS or XLSX file\n", "|**read_stata** | Read a dataset from Stata file format\n", "|**read_sas** | Read a SAS dataset stored in one of the SAS system's custom storage formats\n", "|read_html | Read all tables found in the given HTML document\n", "|read_json | Read data from a JSON (JavaScript Object Notation) string representation\n", "|read_pickle | Read an arbitrary object stored in Python pickle format\n", "|read_sql | Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame\n", "\n", "***\n", "\n", "### Reading and Writing .csv (comma-separated values)\n", "\n", ".csv is a delimited text file that uses a **comma** to separate values. A .csv file typically stores **tabular data** (numbers and text) in **plain text**.\n", "\n", "Let‘s start with a small .csv text file: ex1.csv" ] }, { "cell_type": "code", "execution_count": 1, "id": "6f0d159d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcdmessage
01234hello
15678world
29101112foo
\n", "
" ], "text/plain": [ " a b c d message\n", "0 1 2 3 4 hello\n", "1 5 6 7 8 world\n", "2 9 10 11 12 foo" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "pd.read_csv('examples/ex1.csv') # relative path" ] }, { "cell_type": "code", "execution_count": 2, "id": "9d0a5112", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcdmessage
01234hello
15678world
29101112foo
\n", "
" ], "text/plain": [ " a b c d message\n", "0 1 2 3 4 hello\n", "1 5 6 7 8 world\n", "2 9 10 11 12 foo" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# absolute path (absolute path differs between Windows and Mac)\n", "pd.read_csv('/Users/luping/desktop/teaching/2024_fdap/examples/ex1.csv') " ] }, { "cell_type": "markdown", "id": "746c6213", "metadata": {}, "source": [ "pandas.read_csv() perform type inference. That means you don't necessarily have to specify which columns are numeric, integer, boolean, or string:" ] }, { "cell_type": "code", "execution_count": 3, "id": "ade494cd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a int64\n", "b int64\n", "c int64\n", "d int64\n", "message object\n", "dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('examples/ex1.csv')\n", "\n", "df.dtypes" ] }, { "cell_type": "markdown", "id": "110442cf", "metadata": {}, "source": [ "A file will not always have a **header row**. Consider this file: ex2.csv" ] }, { "cell_type": "code", "execution_count": 4, "id": "99d8859b-28c2-4977-97c0-57250bde0449", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1234hello
05678world
19101112foo
\n", "
" ], "text/plain": [ " 1 2 3 4 hello\n", "0 5 6 7 8 world\n", "1 9 10 11 12 foo" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv('examples/ex2.csv')" ] }, { "cell_type": "markdown", "id": "fc13b36d", "metadata": {}, "source": [ "To read this file, you have a couple of options. You can allow pandas to assign default column names, or you can specify names yourself:" ] }, { "cell_type": "code", "execution_count": 5, "id": "b9e2a695", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcdmessage
01234hello
15678world
29101112foo
\n", "
" ], "text/plain": [ " a b c d message\n", "0 1 2 3 4 hello\n", "1 5 6 7 8 world\n", "2 9 10 11 12 foo" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])" ] }, { "cell_type": "markdown", "id": "66161810", "metadata": { "scrolled": true }, "source": [ "Suppose you wanted the message column to be the index of the returned DataFrame. You can use the index_col argument:" ] }, { "cell_type": "code", "execution_count": 6, "id": "623cae19", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
message
hello1234
world5678
foo9101112
\n", "
" ], "text/plain": [ " a b c d\n", "message \n", "hello 1 2 3 4\n", "world 5 6 7 8\n", "foo 9 10 11 12" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'], index_col='message')" ] }, { "cell_type": "markdown", "id": "1c612fe7", "metadata": {}, "source": [ "Handling missing values is an important and frequently nuanced part of the file parsing process. Consider this file: ex3.csv" ] }, { "cell_type": "code", "execution_count": 7, "id": "2c41ff50", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
somethingabcdmessage
0one123.04NaN
1two56NaN8world
2three91011.012foo
\n", "
" ], "text/plain": [ " something a b c d message\n", "0 one 1 2 3.0 4 NaN\n", "1 two 5 6 NaN 8 world\n", "2 three 9 10 11.0 12 foo" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv('examples/ex3.csv')" ] }, { "cell_type": "markdown", "id": "ce1cb635", "metadata": {}, "source": [ "Missing data is usually either not present (empty string) or marked by some **sentinel** value, such as **NA** and **NULL**." ] }, { "cell_type": "code", "execution_count": 8, "id": "9c6e56d7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
somethingabcdmessage
0TrueTrueTrueTrueTrueFalse
1TrueTrueTrueFalseTrueTrue
2TrueTrueTrueTrueTrueTrue
\n", "
" ], "text/plain": [ " something a b c d message\n", "0 True True True True True False\n", "1 True True True False True True\n", "2 True True True True True True" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('examples/ex3.csv')\n", "\n", "pd.notnull(df)" ] }, { "cell_type": "markdown", "id": "23b59c0f", "metadata": {}, "source": [ "Using to_csv() method, we can write the data out to a comma-separated file:" ] }, { "cell_type": "code", "execution_count": 9, "id": "de1d01b3", "metadata": {}, "outputs": [], "source": [ "df.to_csv('examples/out1.csv')" ] }, { "cell_type": "markdown", "id": "92dd432a", "metadata": {}, "source": [ "With no other options specified, both the row and column labels are written. Both of these can be disabled:" ] }, { "cell_type": "code", "execution_count": 10, "id": "2cb0e34f", "metadata": {}, "outputs": [], "source": [ "df.to_csv('examples/out2.csv', index=False)" ] }, { "cell_type": "markdown", "id": "10ee9c54", "metadata": {}, "source": [ "***\n", "### Parameters of data loading functions\n", "\n", "Because of how messy data in the real world can be, data loading functions (especially read_csv()) have grown very complex in their options over time. The **online pandas documentation** has many examples about how each of them works.\n", "\n", "API reference (pandas documentation) of read_csv(): https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html\n", "\n", "\n", "***\n", "\n", "### Reading Microsoft excel files\n", "\n", "pandas also supports reading tabular data stored in Excel 2003 (and higher) files using pandas.read_excel() function:\n", "\n", "Internally these tools use the add-on packages **xlrd** and **openpyxl** to read XLS and XLSX files, respectively. You may need to install these manually with pip." ] }, { "cell_type": "code", "execution_count": null, "id": "60202b36", "metadata": { "scrolled": true }, "outputs": [], "source": [ "df = pd.read_excel('examples/ex1.xlsx', 'Sheet1')\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "e4dbc40f", "metadata": {}, "source": [ "To write pandas data to Excel format, you can pass a file path to to_excel():" ] }, { "cell_type": "code", "execution_count": null, "id": "d04c3dbb", "metadata": {}, "outputs": [], "source": [ "df.to_excel('examples/out1.xlsx')" ] }, { "cell_type": "markdown", "id": "65c236ad", "metadata": {}, "source": [ "***\n", "\n", "## Data cleaning and preparation\n", "\n", "During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: **loading**, **cleaning**, **transforming**, and **rearranging**. Such tasks are often reported to take up 80% or more of an analyst's time.\n", "\n", "Sometimes the way that data is stored in files or databases is not in the right format for a particular task. Fortunately, pandas provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form.\n", "\n", "***\n", "### Handling Missing Data\n", "\n", "Missing data (**NA**, which stands for **not available**) occurs commonly in many data analysis applications. For numeric data, pandas uses the floating-point value **NaN** (not a number) to represent missing data.\n", "\n", "With DataFrame objects, you may want to drop rows or columns that are all NA or only those containing any NAs.\n", "\n", "dropna() by default drops any row containing a missing value:" ] }, { "cell_type": "code", "execution_count": 13, "id": "b1526cc1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0\n", "1 1.0 NaN NaN\n", "2 NaN NaN NaN\n", "3 NaN 6.5 3.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame([[1., 6.5, 3.],\n", " [1., None, None],\n", " [None, None, None],\n", " [None, 6.5, 3.]])\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 14, "id": "4f12887a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.06.53.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna()" ] }, { "cell_type": "markdown", "id": "770b2bb6", "metadata": {}, "source": [ "Passing how='all' will only drop rows that are all NA:" ] }, { "cell_type": "code", "execution_count": 15, "id": "8e5719cb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.06.53.0
11.0NaNNaN
3NaN6.53.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0\n", "1 1.0 NaN NaN\n", "3 NaN 6.5 3.0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna(how='all')" ] }, { "cell_type": "markdown", "id": "8c3a332e", "metadata": {}, "source": [ "To drop columns in the same way, pass axis=1:" ] }, { "cell_type": "code", "execution_count": 16, "id": "c943629f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01.06.53.0None
11.0NaNNaNNone
2NaNNaNNaNNone
3NaN6.53.0None
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1.0 6.5 3.0 None\n", "1 1.0 NaN NaN None\n", "2 NaN NaN NaN None\n", "3 NaN 6.5 3.0 None" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[3] = None\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 17, "id": "b2c3eebe", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1.0 6.5 3.0\n", "1 1.0 NaN NaN\n", "2 NaN NaN NaN\n", "3 NaN 6.5 3.0" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna(axis=1, how='all')" ] }, { "cell_type": "markdown", "id": "bd30cdd5", "metadata": {}, "source": [ "***\n", "\n", "### Filling In Missing Data\n", "\n", "Rather than filtering out missing data, you may want to fill in the \"holes\" in any number of ways. The fillna method is the function to use. \n", "\n", "Calling fillna with a constant replaces missing values with that value:" ] }, { "cell_type": "code", "execution_count": 18, "id": "3edb6bc9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
class participationhomeworkmidtermfinal
010.030.020.040.0
18.025.015.035.0
26.020.010.0NaN
3NaNNaNNaNNaN
4NaNNaN10.030.0
\n", "
" ], "text/plain": [ " class participation homework midterm final\n", "0 10.0 30.0 20.0 40.0\n", "1 8.0 25.0 15.0 35.0\n", "2 6.0 20.0 10.0 NaN\n", "3 NaN NaN NaN NaN\n", "4 NaN NaN 10.0 30.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame([[10, 30, 20, 40],\n", " [8, 25, 15, 35],\n", " [6, 20, 10, None],\n", " [None, None, None, None], \n", " [None, None, 10, 30]],\n", " columns=['class participation', 'homework', 'midterm', 'final'])\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 19, "id": "dcae23ac", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
class participationhomeworkmidtermfinal
010.030.020.040.0
18.025.015.035.0
26.020.010.05.0
35.05.05.05.0
45.05.010.030.0
\n", "
" ], "text/plain": [ " class participation homework midterm final\n", "0 10.0 30.0 20.0 40.0\n", "1 8.0 25.0 15.0 35.0\n", "2 6.0 20.0 10.0 5.0\n", "3 5.0 5.0 5.0 5.0\n", "4 5.0 5.0 10.0 30.0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(5)" ] }, { "cell_type": "markdown", "id": "9a033aee", "metadata": {}, "source": [ "Calling fillna() with a dict, you can use a different fill value for each column:" ] }, { "cell_type": "code", "execution_count": 20, "id": "f516242f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
class participationhomeworkmidtermfinal
010.030.020.040.0
18.025.015.035.0
26.020.010.030.0
35.0NaNNaN30.0
45.0NaN10.030.0
\n", "
" ], "text/plain": [ " class participation homework midterm final\n", "0 10.0 30.0 20.0 40.0\n", "1 8.0 25.0 15.0 35.0\n", "2 6.0 20.0 10.0 30.0\n", "3 5.0 NaN NaN 30.0\n", "4 5.0 NaN 10.0 30.0" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna({'class participation': 5, 'final': 30})" ] }, { "cell_type": "markdown", "id": "a66a5875", "metadata": {}, "source": [ "The **interpolation methods** can be used with fillna:" ] }, { "cell_type": "code", "execution_count": 21, "id": "c63182de", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
class participationhomeworkmidtermfinal
010.030.020.040.0
18.025.015.035.0
26.020.010.035.0
36.020.010.035.0
46.020.010.030.0
\n", "
" ], "text/plain": [ " class participation homework midterm final\n", "0 10.0 30.0 20.0 40.0\n", "1 8.0 25.0 15.0 35.0\n", "2 6.0 20.0 10.0 35.0\n", "3 6.0 20.0 10.0 35.0\n", "4 6.0 20.0 10.0 30.0" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(method='ffill')" ] }, { "cell_type": "markdown", "id": "a772446b", "metadata": {}, "source": [ "With fillna you can do lots of other things with a little creativity. For example, you might pass the mean or median value of a Series:" ] }, { "cell_type": "code", "execution_count": 22, "id": "83c49612", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
class participationhomeworkmidtermfinal
count3.03.04.0000003.0
mean8.025.013.75000035.0
std2.05.04.7871365.0
min6.020.010.00000030.0
25%7.022.510.00000032.5
50%8.025.012.50000035.0
75%9.027.516.25000037.5
max10.030.020.00000040.0
\n", "
" ], "text/plain": [ " class participation homework midterm final\n", "count 3.0 3.0 4.000000 3.0\n", "mean 8.0 25.0 13.750000 35.0\n", "std 2.0 5.0 4.787136 5.0\n", "min 6.0 20.0 10.000000 30.0\n", "25% 7.0 22.5 10.000000 32.5\n", "50% 8.0 25.0 12.500000 35.0\n", "75% 9.0 27.5 16.250000 37.5\n", "max 10.0 30.0 20.000000 40.0" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe() #summary statistics" ] }, { "cell_type": "code", "execution_count": 23, "id": "c4a33cfa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
class participationhomeworkmidtermfinal
010.030.020.0040.0
18.025.015.0035.0
26.020.010.0035.0
38.025.013.7535.0
48.025.010.0030.0
\n", "
" ], "text/plain": [ " class participation homework midterm final\n", "0 10.0 30.0 20.00 40.0\n", "1 8.0 25.0 15.00 35.0\n", "2 6.0 20.0 10.00 35.0\n", "3 8.0 25.0 13.75 35.0\n", "4 8.0 25.0 10.00 30.0" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(df.mean())" ] }, { "cell_type": "markdown", "id": "9d30d19d", "metadata": {}, "source": [ "***\n", "\n", "### Removing Duplicates\n", "\n", "Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:" ] }, { "cell_type": "code", "execution_count": 24, "id": "bdfbabd1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
6two4
\n", "
" ], "text/plain": [ " k1 k2\n", "0 one 1\n", "1 two 1\n", "2 one 2\n", "3 two 3\n", "4 one 3\n", "5 two 4\n", "6 two 4" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],\n", " 'k2': [1, 1, 2, 3, 3, 4, 4]})\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "6628633d", "metadata": {}, "source": [ "The DataFrame method duplicated() returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not:" ] }, { "cell_type": "code", "execution_count": 25, "id": "c24a3e2b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 True\n", "dtype: bool" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.duplicated()" ] }, { "cell_type": "markdown", "id": "101da634", "metadata": {}, "source": [ "Relatedly, drop_duplicates() returns a DataFrame where the duplicated array is False:" ] }, { "cell_type": "code", "execution_count": 26, "id": "6ebf0302", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
\n", "
" ], "text/plain": [ " k1 k2\n", "0 one 1\n", "1 two 1\n", "2 one 2\n", "3 two 3\n", "4 one 3\n", "5 two 4" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates()" ] }, { "cell_type": "markdown", "id": "1382c471", "metadata": {}, "source": [ "drop_duplicates() considers all of the columns; alternatively, you can specify any **subset** of them to detect duplicates.\n", "\n", "Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:" ] }, { "cell_type": "code", "execution_count": 27, "id": "620473a2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2k3
0one10
1two11
2one22
3two33
4one34
5two45
6two46
\n", "
" ], "text/plain": [ " k1 k2 k3\n", "0 one 1 0\n", "1 two 1 1\n", "2 one 2 2\n", "3 two 3 3\n", "4 one 3 4\n", "5 two 4 5\n", "6 two 4 6" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['k3'] = range(7)\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 28, "id": "88613841", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2k3
0one10
1two11
\n", "
" ], "text/plain": [ " k1 k2 k3\n", "0 one 1 0\n", "1 two 1 1" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates(['k1'])" ] }, { "cell_type": "markdown", "id": "6e9c1b34", "metadata": {}, "source": [ "duplicated() and drop_duplicates() by default keep the **first** observed value combination. Passing keep='last' will return the last one:" ] }, { "cell_type": "code", "execution_count": 29, "id": "08c442c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k1k2k3
0one10
1two11
2one22
3two33
4one34
6two46
\n", "
" ], "text/plain": [ " k1 k2 k3\n", "0 one 1 0\n", "1 two 1 1\n", "2 one 2 2\n", "3 two 3 3\n", "4 one 3 4\n", "6 two 4 6" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates(['k1', 'k2'], keep='last')" ] }, { "cell_type": "markdown", "id": "b01c3321", "metadata": {}, "source": [ "***\n", "\n", "### Vectorized string functions in pandas\n", "\n", "Cleaning up a messy dataset for analysis often requires a lot of string **regularization**. For example, a column containing strings will sometimes have missing data:" ] }, { "cell_type": "code", "execution_count": 30, "id": "0d3090a7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dave dave@google.com\n", "Jack jack@xmu.edu.cn\n", "Steve steve@gmail.com\n", "Rose rose@xmu.edu.cn\n", "Tony None\n", "dtype: object" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.Series({'Dave': 'dave@google.com',\n", " 'Jack': 'jack@xmu.edu.cn',\n", " 'Steve': 'steve@gmail.com',\n", " 'Rose': 'rose@xmu.edu.cn',\n", " 'Tony': None})\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "51651957", "metadata": {}, "source": [ "Series has array-oriented methods for string operations that skip NA values. These are accessed through Series's str attribute.\n", "\n", "For example, we could check whether each email address has 'xmu.edu' in it with str.contains:" ] }, { "cell_type": "code", "execution_count": 31, "id": "2d0ec00d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dave False\n", "Jack True\n", "Steve False\n", "Rose True\n", "Tony None\n", "dtype: object" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.str.contains('xmu.edu')" ] }, { "cell_type": "markdown", "id": "71c9b6cd", "metadata": {}, "source": [ "You can similarly **slice** strings using this syntax:" ] }, { "cell_type": "code", "execution_count": 32, "id": "3fbc0d6d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dave dave@\n", "Jack jack@\n", "Steve steve\n", "Rose rose@\n", "Tony None\n", "dtype: object" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.str[:5]" ] }, { "cell_type": "code", "execution_count": 33, "id": "288f88a1", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Dave [dave, google.com]\n", "Jack [jack, xmu.edu.cn]\n", "Steve [steve, gmail.com]\n", "Rose [rose, xmu.edu.cn]\n", "Tony None\n", "dtype: object" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.str.split('@')" ] }, { "cell_type": "code", "execution_count": 34, "id": "a6c9f11b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dave dave\n", "Jack jack\n", "Steve steve\n", "Rose rose\n", "Tony None\n", "dtype: object" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.str.split('@').str.get(0)" ] }, { "cell_type": "markdown", "id": "95eceec9", "metadata": {}, "source": [ "Partial listing of vectorized string methods.\n", "\n", "|Method|Description|\n", "|:- | :- | \n", "|cat|Concatenate strings element-wise with optional delimiter\n", "|contains|Return boolean array if each string contains pattern/regex\n", "|count|Count occurrences of pattern\n", "|extract|Use a regular expression with groups to extract one or more strings from a Series of strings\n", "|endswith|Equivalent to x.endswith(pattern) for each element\n", "|startswith|Equivalent to x.startswith(pattern) for each element\n", "|findall|Compute list of all occurrences of pattern/regex for each string\n", "|get|Index into each element (retrieve i-th element)\n", "|join|Join strings in each element of the Series with passed separator\n", "|len|Compute length of each string\n", "|lower,upper|Convert cases;equivalent to x.lower() or x.upper() for each element\n", "|match|Use re.match with the passed regular expression on each element\n", "|replace|Replace occurrences of pattern/regex with some other string\n", "|slice|Slice each string in the Series\n", "|split|Split strings on delimiter or regular expression\n", "|strip|Trim whitespace from both sides, including newlines" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }