{ "cells": [ { "cell_type": "markdown", "id": "1e25f29b", "metadata": {}, "source": [ "# Lecture 05. Data Wrangling: Combine and Merge\n", "\n", "### Instructor: Luping Yu\n", "\n", "### Mar 26, 2024\n", "\n", "***\n", "\n", "In many applications, data may be spread across a number of files or databases or be arranged in a form that is not easy to analyze. This lecture focuses on tools to help **combine** and **merge** data.\n", "\n", "Data contained in pandas objects can be combined together in a number of ways:\n", "\n", "* pandas.merge() connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database **join** operations.\n", "\n", "\n", "* pandas.concat() concatenates or **stacks** together objects along an axis.\n", "\n", "***\n", "\n", "### Database-Style DataFrame Joins\n", "\n", "**Merge** or **join** operations combine datasets by **linking rows using one or more keys**. These operations are central to relational databases (e.g., SQL-based). The merge function in pandas is the main entry point for using these algorithms on your data." ] }, { "cell_type": "code", "execution_count": 1, "id": "6d73f330", "metadata": {}, "outputs": [], "source": [ "# Jupyter notebook display multiple pandas tables side by side (方便并排显示dataframe,看不懂没事)\n", "from IPython.display import display_html\n", "\n", "def display_side_by_side(*args):\n", " html_str = \"\"\n", " for df in args:\n", " html_str += df.to_html()\n", " display_html(\n", " html_str.replace(\"table\", 'table style=\"display:inline;margin-right:20px;\"'),\n", " raw=True,\n", " )" ] }, { "cell_type": "code", "execution_count": 2, "id": "2fa0fe5b", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "id": "c449027f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1
0b0
1b1
2a2
3c3
4a4
5a5
6b6
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata2
0a0
1b1
2d2
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],\n", " 'data1': range(7)})\n", "\n", "df2 = pd.DataFrame({'key': ['a', 'b', 'd'],\n", " 'data2': range(3)})\n", "\n", "display_side_by_side(df1, df2)" ] }, { "cell_type": "markdown", "id": "f7606fbc", "metadata": {}, "source": [ "This is an example of a **many-to-one** join; the data in df1 has multiple rows labeled a and b, whereas df2 has only one row for each value in the key column. Calling merge() with these objects we obtain:" ] }, { "cell_type": "code", "execution_count": 4, "id": "3dde8c29", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1
0b0
1b1
2a2
3c3
4a4
5a5
6b6
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata2
0a0
1b1
2d2
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
0b01
1b11
2b61
3a20
4a40
5a50
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_merged = pd.merge(df1, df2)\n", "\n", "display_side_by_side(df1, df2, df_merged)" ] }, { "cell_type": "markdown", "id": "77259845", "metadata": {}, "source": [ "Note that we didn't specify which column to join on. If that information is not specified, merge uses the **overlapping column** names as the keys.\n", "\n", "It's a good practice to specify explicitly, though:" ] }, { "cell_type": "code", "execution_count": 5, "id": "244c3f79", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1
0b0
1b1
2a2
3c3
4a4
5a5
6b6
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata2
0a0
1b1
2d2
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
0b01
1b11
2b61
3a20
4a40
5a50
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_merged = pd.merge(df1, df2, on='key')\n", "\n", "display_side_by_side(df1, df2, df_merged)" ] }, { "cell_type": "markdown", "id": "2cac3bf8", "metadata": {}, "source": [ "If the column names are different in each object, you can specify them separately:" ] }, { "cell_type": "code", "execution_count": 6, "id": "4e4ba695", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lkeydata1
0b0
1b1
2a2
3c3
4a4
5a5
6b6
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rkeydata2
0a0
1b1
2d2
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],\n", " 'data1': range(7)})\n", "\n", "df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],\n", " 'data2': range(3)})\n", "\n", "display_side_by_side(df3, df4)" ] }, { "cell_type": "code", "execution_count": 7, "id": "871c502d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lkeydata1
0b0
1b1
2a2
3c3
4a4
5a5
6b6
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rkeydata2
0a0
1b1
2d2
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lkeydata1rkeydata2
0b0b1
1b1b1
2b6b1
3a2a0
4a4a0
5a5a0
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_merged = pd.merge(df3, df4, left_on='lkey', right_on='rkey')\n", "\n", "display_side_by_side(df3, df4, df_merged)" ] }, { "cell_type": "markdown", "id": "c661bd55", "metadata": {}, "source": [ "You may notice that the 'c' and 'd' values and associated data are missing from the result.\n", "\n", "By default merge does an inner join: the keys in the result are the **intersection**, or the common set found in both tables. Other possible options are left, right, and outer join.\n", "\n", "The outer join takes the **union** of the keys, combining the effect of applying both left and right joins:" ] }, { "cell_type": "code", "execution_count": 8, "id": "6efc340d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1
0b0
1b1
2a2
3c3
4a4
5a5
6b6
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata2
0a0
1b1
2d2
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
0b01
1b11
2b61
3a20
4a40
5a50
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
0b0.01.0
1b1.01.0
2b6.01.0
3a2.00.0
4a4.00.0
5a5.00.0
6c3.0NaN
7dNaN2.0
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_merged_inner = pd.merge(df1, df2, how='inner')\n", "df_merged_outer = pd.merge(df1, df2, how='outer')\n", "\n", "display_side_by_side(df1, df2, df_merged_inner, df_merged_outer)" ] }, { "cell_type": "code", "execution_count": 9, "id": "cd5d7aed", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1
0b0
1b1
2a2
3c3
4a4
5a5
6b6
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata2
0a0
1b1
2d2
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
0b01.0
1b11.0
2a20.0
3c3NaN
4a40.0
5a50.0
6b61.0
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
0a2.00
1a4.00
2a5.00
3b0.01
4b1.01
5b6.01
6dNaN2
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_merged_left = pd.merge(df1, df2, how='left')\n", "df_merged_right = pd.merge(df1, df2, how='right')\n", "\n", "display_side_by_side(df1, df2, df_merged_left, df_merged_right)" ] }, { "cell_type": "markdown", "id": "6e64e4c9", "metadata": {}, "source": [ "See following table for a summary of the options for how=:\n", "\n", "|Option | Behavior |\n", "|:- | :- | \n", "|'inner' | Use only the key combinations observed in both tables\n", "|'left' | Use all key combinations found in the left table\n", "|'right' | Use all key combinations found in the right table\n", "|'outer' | Use all key combinations observed in both tables together\n", "\n", "**Many-to-many** merges have well-defined behavior. Here's an example:" ] }, { "cell_type": "code", "execution_count": 10, "id": "e7c5ade9", "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],\n", " 'data1': range(6)})\n", "\n", "df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],\n", " 'data2': range(5)})" ] }, { "cell_type": "code", "execution_count": 11, "id": "6e32f531", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1
0b0
1b1
2a2
3c3
4a4
5b5
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata2
0a0
1b1
2a2
3b3
4d4
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keydata1data2
0b01
1b03
2b11
3b13
4b51
5b53
6a20
7a22
8a40
9a42
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_merged = pd.merge(df1, df2)\n", "\n", "display_side_by_side(df1, df2, df_merged)" ] }, { "cell_type": "markdown", "id": "9f183e2a", "metadata": {}, "source": [ "Many-to-many joins form the **Cartesian product** of the rows. Since there were three 'b' rows in df1 and two in df2, there are six 'b' rows in the result." ] }, { "cell_type": "markdown", "id": "e3fac22c", "metadata": {}, "source": [ "To merge with **multiple keys**, pass a list of column names:" ] }, { "cell_type": "code", "execution_count": 12, "id": "c7be7f8d", "metadata": { "scrolled": true }, "outputs": [], "source": [ "df1 = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],\n", " 'key2': ['one', 'two', 'one'],\n", " 'lval': [1, 2, 3]})\n", "\n", "df2 = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],\n", " 'key2': ['one', 'one', 'one', 'two'],\n", " 'rval': [4, 5, 6, 7]})" ] }, { "cell_type": "code", "execution_count": 13, "id": "abf88fff", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
key1key2lval
0fooone1
1footwo2
2barone3
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
key1key2rval
0fooone4
1fooone5
2barone6
3bartwo7
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
key1key2lvalrval
0fooone14
1fooone15
2barone36
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_merged = pd.merge(df1, df2, on=['key1', 'key2'])\n", "\n", "display_side_by_side(df1, df2, df_merged)" ] }, { "cell_type": "markdown", "id": "e870e4db", "metadata": {}, "source": [ "See the following table for an argument reference on merge:\n", "\n", "|Argument | Description |\n", "|:- | :- | \n", "|how | One of'inner','outer','left', or'right'; defaults to'inner'.\n", "|on | Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given, will use the intersection of the column names inleftandrightas the join keys.\n", "|left_on | Columns in left DataFrame to use as join keys.\n", "|right_on | Analogous to left_on for left DataFrame.\n", "|left_index | Use row index in left as its join key (or keys, if a MultiIndex).\n", "|right_index | Analogous to left_index.\n", "\n", "***\n", "\n", "### Concatenating Along an Axis\n", "\n", "Another kind of data combination operation is referred to interchangeably as concatenation, binding, or stacking.\n", "\n", "The concat function in pandas provides a consistent way to concat the datasets. Suppose we have three Series with no index overlap:" ] }, { "cell_type": "code", "execution_count": 14, "id": "3cdb9e0b", "metadata": { "scrolled": true }, "outputs": [], "source": [ "s1 = pd.Series([0, 1], index=['a', 'b'])\n", "s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])\n", "s3 = pd.Series([5, 6], index=['f', 'g'])" ] }, { "cell_type": "markdown", "id": "c288af3e", "metadata": {}, "source": [ "Calling concat with these objects in a list glues together the values and indexes:" ] }, { "cell_type": "code", "execution_count": 15, "id": "38f4e5da", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "a 0\n", "b 1\n", "c 2\n", "d 3\n", "e 4\n", "f 5\n", "g 6\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([s1, s2, s3])" ] }, { "cell_type": "markdown", "id": "9e328d4d", "metadata": {}, "source": [ "By default concat works along **axis=0** (index), producing another Series.\n", "\n", "If you pass **axis=1** (columns), the result will instead be a DataFrame:" ] }, { "cell_type": "code", "execution_count": 16, "id": "eea9c421", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
a0.0NaNNaN
b1.0NaNNaN
cNaN2.0NaN
dNaN3.0NaN
eNaN4.0NaN
fNaNNaN5.0
gNaNNaN6.0
\n", "
" ], "text/plain": [ " 0 1 2\n", "a 0.0 NaN NaN\n", "b 1.0 NaN NaN\n", "c NaN 2.0 NaN\n", "d NaN 3.0 NaN\n", "e NaN 4.0 NaN\n", "f NaN NaN 5.0\n", "g NaN NaN 6.0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([s1, s2, s3], axis=1)" ] }, { "cell_type": "markdown", "id": "cac61270", "metadata": {}, "source": [ "The same logic extends to DataFrame objects:" ] }, { "cell_type": "code", "execution_count": 17, "id": "af6bf6c5", "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame([[0, 1],[2, 3],[4, 5]],\n", " index=['a', 'b', 'c'],\n", " columns=['one', 'two'])\n", "df2 = pd.DataFrame([[5, 6],[7, 8]],\n", " index=['a', 'c'],\n", " columns=['two', 'three'])" ] }, { "cell_type": "code", "execution_count": 18, "id": "87542bd2", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
a01
b23
c45
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
twothree
a56
c78
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothree
a0.01NaN
b2.03NaN
c4.05NaN
aNaN56.0
cNaN78.0
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_concated = pd.concat([df1, df2])\n", "\n", "display_side_by_side(df1, df2, df_concated)" ] }, { "cell_type": "code", "execution_count": 19, "id": "93edc049", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
a01
b23
c45
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
twothree
a56
c78
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwotwothree
a015.06.0
b23NaNNaN
c457.08.0
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_concated = pd.concat([df1, df2], axis=1)\n", "\n", "display_side_by_side(df1, df2, df_concated)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }