{ "cells": [ { "cell_type": "markdown", "id": "1e25f29b", "metadata": {}, "source": [ "# Lecture 04. Data Aggregation and Group Operations\n", "\n", "### Instructor: Luping Yu\n", "\n", "### Mar 19, 2024\n", "\n", "***\n", "\n", "Categorizing a dataset and applying a function to each group, whether an **aggregation** or **transformation**, is often a critical component of a data analysis workflow. After loading and preparing a dataset, you may need to compute group statistics for reporting or visualization purposes.\n", "\n", "pandas provides a flexible groupby() interface, enabling you to slice, dice, and summarize datasets in a natural way.\n", "\n", "***\n", "### GroupBy Mechanics\n", "\n", "Punchline: **split-apply-combine (拆分-应用-合并)** \n", "* In the first stage of the process, data is **split** into groups based on one or more keys that you provide.\n", "* Once this is done, a function is **applied** to each group, producing a new value.\n", "* Finally, the results of all those function applications are **combined** into a result object.\n", "\n", "See the following figure for a mockup of a simple group aggregation:\n", "\n", "![avatar](https://raw.githubusercontent.com/lazydingding/gallery/main/Screen%20Shot%202022-03-14%20at%2022.58.22.png)\n", "***\n", "\n", "To get started, here is a small tabular dataset as a DataFrame:" ] }, { "cell_type": "code", "execution_count": 1, "id": "7e30f646", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
key1key2data1data2
0aone-1.137953-1.741685
1atwo0.014705-0.937821
2bone0.140717-0.420972
3btwo-0.0984410.548059
4aone-1.203151-0.664293
\n", "
" ], "text/plain": [ " key1 key2 data1 data2\n", "0 a one -1.137953 -1.741685\n", "1 a two 0.014705 -0.937821\n", "2 b one 0.140717 -0.420972\n", "3 b two -0.098441 0.548059\n", "4 a one -1.203151 -0.664293" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "#np.random.randn: generate random numbers\n", "\n", "df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],\n", " 'key2' : ['one', 'two', 'one', 'two', 'one'],\n", " 'data1' : np.random.randn(5),\n", " 'data2' : np.random.randn(5)})\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "1c82534a", "metadata": {}, "source": [ "Suppose you wanted to compute the **mean** of the data1 column using the labels from key1.\n", "\n", "There are a number of ways to do this. One is to access data1 and call groupby() with the column at key1:" ] }, { "cell_type": "code", "execution_count": 2, "id": "4235891d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped = df['data1'].groupby(df['key1'])\n", "\n", "grouped" ] }, { "cell_type": "markdown", "id": "42d2a5c6", "metadata": {}, "source": [ "This grouped variable is now a GroupBy object.\n", "\n", "It has not actually computed anything yet except for some intermediate data about the group key df['key1']. The idea is that this object has all of the information needed to then apply some operation to each of the groups.\n", "\n", "For example, to compute group means we can call the GroupBy's mean() method:" ] }, { "cell_type": "code", "execution_count": 3, "id": "2225746f", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "key1\n", "a -0.775467\n", "b 0.021138\n", "Name: data1, dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped.mean()" ] }, { "cell_type": "markdown", "id": "f00cb2a9", "metadata": {}, "source": [ "The important thing here is that the data has been **aggregated** according to the group key, producing a new Series that is now indexed by the **unique values** in the key1 column.\n", "\n", "If instead we had passed multiple arrays as a list, we'd get something different:" ] }, { "cell_type": "code", "execution_count": 4, "id": "ef277cb1", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "key1 key2\n", "a one -1.170552\n", " two 0.014705\n", "b one 0.140717\n", " two -0.098441\n", "Name: data1, dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['data1'].groupby([df['key1'], df['key2']]).mean()" ] }, { "cell_type": "markdown", "id": "dac70643", "metadata": {}, "source": [ "Here we grouped the data using **two keys**, and the resulting Series now has a **hierarchical index** consisting of the **unique pairs** of keys observed.\n", "\n", "A generally useful GroupBy method is size(), which returns a Series containing group sizes:" ] }, { "cell_type": "code", "execution_count": 5, "id": "3fecc78e", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "key1 key2\n", "a one 2\n", " two 1\n", "b one 1\n", " two 1\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['key1', 'key2']).size()" ] }, { "cell_type": "markdown", "id": "71211471", "metadata": {}, "source": [ "For large datasets, it may be desirable to aggregate only a few columns. For example, in the preceding dataset, to compute means for just the data2 column, we could write:" ] }, { "cell_type": "code", "execution_count": 6, "id": "cffb413c", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "key1 key2\n", "a one -1.202989\n", " two -0.937821\n", "b one -0.420972\n", " two 0.548059\n", "Name: data2, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['key1', 'key2'])['data2'].mean()" ] }, { "cell_type": "markdown", "id": "c2c45ba4", "metadata": {}, "source": [ "***\n", "\n", "### Data Aggregation\n", "\n", "**Aggregations** refer to any data transformation that produces scalar values from arrays. The preceding examples have used several of them, including mean and size. Built-in functions can be invoked using agg().\n", "\n", "|Function | Description |\n", "|:- | :- | \n", "|count | Number of non-NA values in the group\n", "|sum | Sum of non-NA values\n", "|mean | Mean of non-NA values\n", "|median | Arithmetic median of non-NA values\n", "|std, var | Unbiased (n – 1 denominator) standard deviation and variance\n", "|min, max | Minimum and maximum of non-NA values\n", "|first, last | First and last non-NA values\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "64083577", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
key1data1data2
0a-1.137953-1.741685
1a0.014705-0.937821
2b0.140717-0.420972
3b-0.0984410.548059
4a-1.203151-0.664293
\n", "
" ], "text/plain": [ " key1 data1 data2\n", "0 a -1.137953 -1.741685\n", "1 a 0.014705 -0.937821\n", "2 b 0.140717 -0.420972\n", "3 b -0.098441 0.548059\n", "4 a -1.203151 -0.664293" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df[['key1','data1','data2']]\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 8, "id": "2ea90a13", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
key1
a0.014705-0.664293
b0.1407170.548059
\n", "
" ], "text/plain": [ " data1 data2\n", "key1 \n", "a 0.014705 -0.664293\n", "b 0.140717 0.548059" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('key1').max()" ] }, { "cell_type": "code", "execution_count": 9, "id": "a183aa7d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
key1
a-1.203151-1.741685
b-0.098441-0.420972
\n", "
" ], "text/plain": [ " data1 data2\n", "key1 \n", "a -1.203151 -1.741685\n", "b -0.098441 -0.420972" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('key1').agg('min')" ] }, { "cell_type": "markdown", "id": "81db68a7", "metadata": {}, "source": [ "To use your own aggregation functions, pass any function that aggregates an array to the apply method:" ] }, { "cell_type": "code", "execution_count": 10, "id": "778c3d27", "metadata": {}, "outputs": [], "source": [ "def peak_to_peak(arr):\n", " return arr.max() - arr.min()" ] }, { "cell_type": "code", "execution_count": 11, "id": "3dde8c29", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data1data2
key1
a1.2178561.077392
b0.2391580.969030
\n", "
" ], "text/plain": [ " data1 data2\n", "key1 \n", "a 1.217856 1.077392\n", "b 0.239158 0.969030" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(df['key1']).apply(peak_to_peak)" ] }, { "cell_type": "markdown", "id": "c5e10b9b", "metadata": {}, "source": [ "***\n", "\n", "### General split-apply-combine\n", "\n", "* Create analysis with .groupby() and **built-in** functions (mean, sum, count, etc.)\n", "* Create analysis with .groupby() and user defined functions\n", "* Use .transform() to join group stats to the original dataframe\n", "\n", "Let's get started with the tipping dataset:" ] }, { "cell_type": "code", "execution_count": 12, "id": "3cdb9e0b", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
daysizetotal_billtip
0Sun216.991.01
1Sun310.341.66
2Sun321.013.50
3Sun223.683.31
4Sun424.593.61
...............
239Sat329.035.92
240Sat227.182.00
241Sat222.672.00
242Sat217.821.75
243Thur218.783.00
\n", "

244 rows × 4 columns

\n", "
" ], "text/plain": [ " day size total_bill tip\n", "0 Sun 2 16.99 1.01\n", "1 Sun 3 10.34 1.66\n", "2 Sun 3 21.01 3.50\n", "3 Sun 2 23.68 3.31\n", "4 Sun 4 24.59 3.61\n", ".. ... ... ... ...\n", "239 Sat 3 29.03 5.92\n", "240 Sat 2 27.18 2.00\n", "241 Sat 2 22.67 2.00\n", "242 Sat 2 17.82 1.75\n", "243 Thur 2 18.78 3.00\n", "\n", "[244 rows x 4 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('examples/tips.csv')\n", "\n", "df = df[['day','size','total_bill','tip']]\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 13, "id": "38f4e5da", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sizetotal_billtip
day
Fri2.10526317.1515792.734737
Sat2.51724120.4413792.993103
Sun2.84210521.4100003.255132
Thur2.45161317.6827422.771452
\n", "
" ], "text/plain": [ " size total_bill tip\n", "day \n", "Fri 2.105263 17.151579 2.734737\n", "Sat 2.517241 20.441379 2.993103\n", "Sun 2.842105 21.410000 3.255132\n", "Thur 2.451613 17.682742 2.771452" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('day').mean()" ] }, { "cell_type": "code", "execution_count": 14, "id": "44b758df", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sizetotal_billtip
02.84210521.4100003.255132
12.84210521.4100003.255132
22.84210521.4100003.255132
32.84210521.4100003.255132
42.84210521.4100003.255132
............
2392.51724120.4413792.993103
2402.51724120.4413792.993103
2412.51724120.4413792.993103
2422.51724120.4413792.993103
2432.45161317.6827422.771452
\n", "

244 rows × 3 columns

\n", "
" ], "text/plain": [ " size total_bill tip\n", "0 2.842105 21.410000 3.255132\n", "1 2.842105 21.410000 3.255132\n", "2 2.842105 21.410000 3.255132\n", "3 2.842105 21.410000 3.255132\n", "4 2.842105 21.410000 3.255132\n", ".. ... ... ...\n", "239 2.517241 20.441379 2.993103\n", "240 2.517241 20.441379 2.993103\n", "241 2.517241 20.441379 2.993103\n", "242 2.517241 20.441379 2.993103\n", "243 2.451613 17.682742 2.771452\n", "\n", "[244 rows x 3 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('day').transform('mean')" ] }, { "cell_type": "code", "execution_count": 15, "id": "2dec97b7", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
daysizetotal_billtipday_avg_tip
0Sun216.991.013.255132
1Sun310.341.663.255132
2Sun321.013.503.255132
3Sun223.683.313.255132
4Sun424.593.613.255132
..................
239Sat329.035.922.993103
240Sat227.182.002.993103
241Sat222.672.002.993103
242Sat217.821.752.993103
243Thur218.783.002.771452
\n", "

244 rows × 5 columns

\n", "
" ], "text/plain": [ " day size total_bill tip day_avg_tip\n", "0 Sun 2 16.99 1.01 3.255132\n", "1 Sun 3 10.34 1.66 3.255132\n", "2 Sun 3 21.01 3.50 3.255132\n", "3 Sun 2 23.68 3.31 3.255132\n", "4 Sun 4 24.59 3.61 3.255132\n", ".. ... ... ... ... ...\n", "239 Sat 3 29.03 5.92 2.993103\n", "240 Sat 2 27.18 2.00 2.993103\n", "241 Sat 2 22.67 2.00 2.993103\n", "242 Sat 2 17.82 1.75 2.993103\n", "243 Thur 2 18.78 3.00 2.771452\n", "\n", "[244 rows x 5 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['day_avg_tip'] = df.groupby('day')['tip'].transform('mean')\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "bb59bd1c", "metadata": {}, "source": [ "***\n", "### Column-Wise and Multiple Function Application\n", "\n", "As you've already seen, aggregating data is a matter of using aggregate with the desired function or calling a method like mean or std.\n", "\n", "However, you may want to aggregate using a different function depending on the column, or multiple functions at once." ] }, { "cell_type": "code", "execution_count": 16, "id": "7a53da08", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesizetip_pct
016.991.01FemaleNoSunDinner20.059447
110.341.66MaleNoSunDinner30.160542
221.013.50MaleNoSunDinner30.166587
323.683.31MaleNoSunDinner20.139780
424.593.61FemaleNoSunDinner40.146808
...........................
23929.035.92MaleNoSatDinner30.203927
24027.182.00FemaleYesSatDinner20.073584
24122.672.00MaleYesSatDinner20.088222
24217.821.75MaleNoSatDinner20.098204
24318.783.00FemaleNoThurDinner20.159744
\n", "

244 rows × 8 columns

\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size tip_pct\n", "0 16.99 1.01 Female No Sun Dinner 2 0.059447\n", "1 10.34 1.66 Male No Sun Dinner 3 0.160542\n", "2 21.01 3.50 Male No Sun Dinner 3 0.166587\n", "3 23.68 3.31 Male No Sun Dinner 2 0.139780\n", "4 24.59 3.61 Female No Sun Dinner 4 0.146808\n", ".. ... ... ... ... ... ... ... ...\n", "239 29.03 5.92 Male No Sat Dinner 3 0.203927\n", "240 27.18 2.00 Female Yes Sat Dinner 2 0.073584\n", "241 22.67 2.00 Male Yes Sat Dinner 2 0.088222\n", "242 17.82 1.75 Male No Sat Dinner 2 0.098204\n", "243 18.78 3.00 Female No Thur Dinner 2 0.159744\n", "\n", "[244 rows x 8 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('examples/tips.csv')\n", "\n", "df['tip_pct'] = df['tip'] / df['total_bill']\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 17, "id": "351f27d7", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "day smoker\n", "Fri No 0.151650\n", " Yes 0.174783\n", "Sat No 0.158048\n", " Yes 0.147906\n", "Sun No 0.160113\n", " Yes 0.187250\n", "Thur No 0.160298\n", " Yes 0.163863\n", "Name: tip_pct, dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['day','smoker'])['tip_pct'].agg('mean')" ] }, { "cell_type": "markdown", "id": "23be69d0", "metadata": {}, "source": [ "If you pass a list of functions or function names instead, you get back a DataFrame with column names taken from the functions:" ] }, { "cell_type": "code", "execution_count": 18, "id": "08a685d5", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meanmedianstd
daysmoker
FriNo0.1516500.1492410.028123
Yes0.1747830.1739130.051293
SatNo0.1580480.1501520.039767
Yes0.1479060.1536240.061375
SunNo0.1601130.1616650.042347
Yes0.1872500.1381220.154134
ThurNo0.1602980.1534920.038774
Yes0.1638630.1538460.039389
\n", "
" ], "text/plain": [ " mean median std\n", "day smoker \n", "Fri No 0.151650 0.149241 0.028123\n", " Yes 0.174783 0.173913 0.051293\n", "Sat No 0.158048 0.150152 0.039767\n", " Yes 0.147906 0.153624 0.061375\n", "Sun No 0.160113 0.161665 0.042347\n", " Yes 0.187250 0.138122 0.154134\n", "Thur No 0.160298 0.153492 0.038774\n", " Yes 0.163863 0.153846 0.039389" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['day','smoker'])['tip_pct'].agg(['mean','median','std'])" ] }, { "cell_type": "markdown", "id": "2d724620", "metadata": {}, "source": [ "The most general-purpose GroupBy method is apply(). Suppose you wanted to select the top five tip_pct values by group." ] }, { "cell_type": "code", "execution_count": 19, "id": "064a0de2", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesizetip_pct
016.991.01FemaleNoSunDinner20.059447
110.341.66MaleNoSunDinner30.160542
221.013.50MaleNoSunDinner30.166587
323.683.31MaleNoSunDinner20.139780
424.593.61FemaleNoSunDinner40.146808
...........................
23929.035.92MaleNoSatDinner30.203927
24027.182.00FemaleYesSatDinner20.073584
24122.672.00MaleYesSatDinner20.088222
24217.821.75MaleNoSatDinner20.098204
24318.783.00FemaleNoThurDinner20.159744
\n", "

244 rows × 8 columns

\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size tip_pct\n", "0 16.99 1.01 Female No Sun Dinner 2 0.059447\n", "1 10.34 1.66 Male No Sun Dinner 3 0.160542\n", "2 21.01 3.50 Male No Sun Dinner 3 0.166587\n", "3 23.68 3.31 Male No Sun Dinner 2 0.139780\n", "4 24.59 3.61 Female No Sun Dinner 4 0.146808\n", ".. ... ... ... ... ... ... ... ...\n", "239 29.03 5.92 Male No Sat Dinner 3 0.203927\n", "240 27.18 2.00 Female Yes Sat Dinner 2 0.073584\n", "241 22.67 2.00 Male Yes Sat Dinner 2 0.088222\n", "242 17.82 1.75 Male No Sat Dinner 2 0.098204\n", "243 18.78 3.00 Female No Thur Dinner 2 0.159744\n", "\n", "[244 rows x 8 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "id": "3ba1f167", "metadata": {}, "source": [ "First, write a function that selects the rows with the largest values in a particular column:" ] }, { "cell_type": "code", "execution_count": 20, "id": "ad9ab492", "metadata": {}, "outputs": [], "source": [ "def top(df, n=5, column='tip_pct'):\n", " return df.sort_values(by=column, ascending=False)[:n]" ] }, { "cell_type": "code", "execution_count": 21, "id": "72ed53bb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesizetip_pct
1727.255.15MaleYesSunDinner20.710345
1789.604.00FemaleYesSunDinner20.416667
673.071.00FemaleYesSatDinner10.325733
23211.613.39MaleNoSatDinner20.291990
18323.176.50MaleYesSunDinner40.280535
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size tip_pct\n", "172 7.25 5.15 Male Yes Sun Dinner 2 0.710345\n", "178 9.60 4.00 Female Yes Sun Dinner 2 0.416667\n", "67 3.07 1.00 Female Yes Sat Dinner 1 0.325733\n", "232 11.61 3.39 Male No Sat Dinner 2 0.291990\n", "183 23.17 6.50 Male Yes Sun Dinner 4 0.280535" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top(df)" ] }, { "cell_type": "markdown", "id": "ee702134", "metadata": {}, "source": [ "Now, if we group by gender and call apply with this function, we get the following:" ] }, { "cell_type": "code", "execution_count": 22, "id": "b52436de", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesizetip_pct
sex
Female1789.604.00FemaleYesSunDinner20.416667
673.071.00FemaleYesSatDinner10.325733
10914.314.00FemaleYesSatDinner20.279525
9316.324.30FemaleYesFriDinner20.263480
22113.423.48FemaleYesFriLunch20.259314
Male1727.255.15MaleYesSunDinner20.710345
23211.613.39MaleNoSatDinner20.291990
18323.176.50MaleYesSunDinner40.280535
1497.512.00MaleNoThurLunch20.266312
18123.335.65MaleYesSunDinner20.242177
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size tip_pct\n", "sex \n", "Female 178 9.60 4.00 Female Yes Sun Dinner 2 0.416667\n", " 67 3.07 1.00 Female Yes Sat Dinner 1 0.325733\n", " 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525\n", " 93 16.32 4.30 Female Yes Fri Dinner 2 0.263480\n", " 221 13.42 3.48 Female Yes Fri Lunch 2 0.259314\n", "Male 172 7.25 5.15 Male Yes Sun Dinner 2 0.710345\n", " 232 11.61 3.39 Male No Sat Dinner 2 0.291990\n", " 183 23.17 6.50 Male Yes Sun Dinner 4 0.280535\n", " 149 7.51 2.00 Male No Thur Lunch 2 0.266312\n", " 181 23.33 5.65 Male Yes Sun Dinner 2 0.242177" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('sex').apply(top)" ] }, { "cell_type": "markdown", "id": "a3dbb5e5", "metadata": {}, "source": [ "What has happened here? The top function is called on each row group from the DataFrame. The result therefore has a hierarchical index whose inner level contains index values from the original DataFrame.\n", "\n", "If you pass a function to apply that takes other **arguments or keywords**, you can pass these after the function:" ] }, { "cell_type": "code", "execution_count": 23, "id": "9873434c", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesizetip_pct
sexday
FemaleFri9422.753.25FemaleNoFriDinner20.142857
Sat10244.302.50FemaleYesSatDinner30.056433
Sun1135.265.00FemaleNoSunDinner40.141804
Thur19743.115.00FemaleYesThurLunch40.115982
MaleFri9540.174.73MaleYesFriDinner40.117750
Sat17050.8110.00MaleYesSatDinner30.196812
Sun15648.175.00MaleNoSunDinner60.103799
Thur14241.195.00MaleNoThurLunch50.121389
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size \\\n", "sex day \n", "Female Fri 94 22.75 3.25 Female No Fri Dinner 2 \n", " Sat 102 44.30 2.50 Female Yes Sat Dinner 3 \n", " Sun 11 35.26 5.00 Female No Sun Dinner 4 \n", " Thur 197 43.11 5.00 Female Yes Thur Lunch 4 \n", "Male Fri 95 40.17 4.73 Male Yes Fri Dinner 4 \n", " Sat 170 50.81 10.00 Male Yes Sat Dinner 3 \n", " Sun 156 48.17 5.00 Male No Sun Dinner 6 \n", " Thur 142 41.19 5.00 Male No Thur Lunch 5 \n", "\n", " tip_pct \n", "sex day \n", "Female Fri 94 0.142857 \n", " Sat 102 0.056433 \n", " Sun 11 0.141804 \n", " Thur 197 0.115982 \n", "Male Fri 95 0.117750 \n", " Sat 170 0.196812 \n", " Sun 156 0.103799 \n", " Thur 142 0.121389 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['sex', 'day']).apply(top, n=1, column='total_bill')" ] }, { "cell_type": "markdown", "id": "f66abbc0", "metadata": {}, "source": [ "Beyond these basic usage mechanics, getting the most out of apply may require some creativity." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }