{"cells": [{"cell_type": "markdown", "metadata": {"tags": ["module-prog", "module-dsml"]}, "source": ["(Merging_combining_grouping_sorting)=\n", "# Merging, combining, grouping and sorting\n", "[Programming for Geoscientists](module-prog) [Data Science and Machine Learning for Geoscientists](module-dsml) \n", "\n", "(Merging)=\n", "## Merging\n", "``` {index} Pandas: merging\n", "```\n", "Let's look at New Zealand earthquake data:"]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "
\n", " \n", " \n", " | \n", " year | \n", " month | \n", " day | \n", " utc_time | \n", " mag | \n", " lat | \n", " lon | \n", " depth_km | \n", " region | \n", " iris_id | \n", " timestamp | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 2009 | \n", " 7 | \n", " 15 | \n", " 09:22:31 | \n", " 7.8 | \n", " -45.8339 | \n", " 166.6363 | \n", " 20.9 | \n", " OFF W. COAST OF S. ISLAND, N.Z. | \n", " 2871698 | \n", " 1247649751 | \n", "
\n", " \n", " 1 | \n", " 2016 | \n", " 11 | \n", " 13 | \n", " 11:02:59 | \n", " 7.8 | \n", " -42.7245 | \n", " 173.0647 | \n", " 22.0 | \n", " SOUTH ISLAND, NEW ZEALAND | \n", " 5197722 | \n", " 1479034979 | \n", "
\n", " \n", " 2 | \n", " 2003 | \n", " 8 | \n", " 21 | \n", " 12:12:47 | \n", " 7.2 | \n", " -45.0875 | \n", " 167.0892 | \n", " 6.8 | \n", " SOUTH ISLAND, NEW ZEALAND | \n", " 1628007 | \n", " 1061467967 | \n", "
\n", " \n", " 3 | \n", " 2001 | \n", " 8 | \n", " 21 | \n", " 06:52:06 | \n", " 7.1 | \n", " -36.8010 | \n", " -179.7230 | \n", " 33.5 | \n", " EAST OF NORTH ISLAND, N.Z. | \n", " 1169374 | \n", " 998376726 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" year month day utc_time mag lat lon depth_km \\\n", "0 2009 7 15 09:22:31 7.8 -45.8339 166.6363 20.9 \n", "1 2016 11 13 11:02:59 7.8 -42.7245 173.0647 22.0 \n", "2 2003 8 21 12:12:47 7.2 -45.0875 167.0892 6.8 \n", "3 2001 8 21 06:52:06 7.1 -36.8010 -179.7230 33.5 \n", "\n", " region iris_id timestamp \n", "0 OFF W. COAST OF S. ISLAND, N.Z. 2871698 1247649751 \n", "1 SOUTH ISLAND, NEW ZEALAND 5197722 1479034979 \n", "2 SOUTH ISLAND, NEW ZEALAND 1628007 1061467967 \n", "3 EAST OF NORTH ISLAND, N.Z. 1169374 998376726 "]}, "execution_count": 1, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas as pd\n", "import numpy as np\n", "\n", "nz_eqs = pd.read_csv(\"../../geosciences/data/nz_largest_eq_since_1970.csv\")\n", "nz_eqs.head(4)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Let's create a second DataFrame that has latitude, longitude and depth (km) columns. And add an extra column with square root of depth:"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " year | \n", " month | \n", " day | \n", " utc_time | \n", " mag | \n", " lat | \n", " lon | \n", " depth_km | \n", " region | \n", " iris_id | \n", " timestamp | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 2009 | \n", " 7 | \n", " 15 | \n", " 09:22:31 | \n", " 7.8 | \n", " -45.8339 | \n", " 166.6363 | \n", " 20.9 | \n", " OFF W. COAST OF S. ISLAND, N.Z. | \n", " 2871698 | \n", " 1247649751 | \n", "
\n", " \n", " 1 | \n", " 2016 | \n", " 11 | \n", " 13 | \n", " 11:02:59 | \n", " 7.8 | \n", " -42.7245 | \n", " 173.0647 | \n", " 22.0 | \n", " SOUTH ISLAND, NEW ZEALAND | \n", " 5197722 | \n", " 1479034979 | \n", "
\n", " \n", " 2 | \n", " 2003 | \n", " 8 | \n", " 21 | \n", " 12:12:47 | \n", " 7.2 | \n", " -45.0875 | \n", " 167.0892 | \n", " 6.8 | \n", " SOUTH ISLAND, NEW ZEALAND | \n", " 1628007 | \n", " 1061467967 | \n", "
\n", " \n", " 3 | \n", " 2001 | \n", " 8 | \n", " 21 | \n", " 06:52:06 | \n", " 7.1 | \n", " -36.8010 | \n", " -179.7230 | \n", " 33.5 | \n", " EAST OF NORTH ISLAND, N.Z. | \n", " 1169374 | \n", " 998376726 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" year month day utc_time mag lat lon depth_km \\\n", "0 2009 7 15 09:22:31 7.8 -45.8339 166.6363 20.9 \n", "1 2016 11 13 11:02:59 7.8 -42.7245 173.0647 22.0 \n", "2 2003 8 21 12:12:47 7.2 -45.0875 167.0892 6.8 \n", "3 2001 8 21 06:52:06 7.1 -36.8010 -179.7230 33.5 \n", "\n", " region iris_id timestamp \n", "0 OFF W. COAST OF S. ISLAND, N.Z. 2871698 1247649751 \n", "1 SOUTH ISLAND, NEW ZEALAND 5197722 1479034979 \n", "2 SOUTH ISLAND, NEW ZEALAND 1628007 1061467967 \n", "3 EAST OF NORTH ISLAND, N.Z. 1169374 998376726 "]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["nz_eqs = pd.read_csv(\"../../geosciences/data/nz_largest_eq_since_1970.csv\")\n", "nz_eqs.head(4)"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " lat | \n", " lon | \n", " depth_km | \n", " depth_sqrt | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " -45.8339 | \n", " 166.6363 | \n", " 20.9 | \n", " 4.571652 | \n", "
\n", " \n", " 1 | \n", " -42.7245 | \n", " 173.0647 | \n", " 22.0 | \n", " 4.690416 | \n", "
\n", " \n", " 2 | \n", " -45.0875 | \n", " 167.0892 | \n", " 6.8 | \n", " 2.607681 | \n", "
\n", " \n", " 3 | \n", " -36.8010 | \n", " -179.7230 | \n", " 33.5 | \n", " 5.787918 | \n", "
\n", " \n", " 4 | \n", " -46.4964 | \n", " 164.8273 | \n", " 7.6 | \n", " 2.756810 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" lat lon depth_km depth_sqrt\n", "0 -45.8339 166.6363 20.9 4.571652\n", "1 -42.7245 173.0647 22.0 4.690416\n", "2 -45.0875 167.0892 6.8 2.607681\n", "3 -36.8010 -179.7230 33.5 5.787918\n", "4 -46.4964 164.8273 7.6 2.756810"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["nz_eqs2 = nz_eqs.iloc[:,5:8]\n", "nz_eqs2[\"depth_sqrt\"] = np.sqrt(nz_eqs2.depth_km)\n", "nz_eqs2.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, we would like to join the original 'nz_eqs' and 'nz_eqs2' DataFrames by common columns. We can use merge function that takes two DataFrames as arguments and we can decided which columns to merge with. In our case, nz_eqs and nz_eqs2 share three data columns: lat, lon, depth_km. If we choose to merge with latitude, longitude, the final DataFrame would keep depth_km columns from both DataFrames, named depth_km_x and depth_km_y. If we wanted to merge DataFrames with all three common columns, we can use:"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " year | \n", " month | \n", " day | \n", " utc_time | \n", " mag | \n", " lat | \n", " lon | \n", " depth_km | \n", " region | \n", " iris_id | \n", " timestamp | \n", " depth_sqrt | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 2009 | \n", " 7 | \n", " 15 | \n", " 09:22:31 | \n", " 7.8 | \n", " -45.8339 | \n", " 166.6363 | \n", " 20.9 | \n", " OFF W. COAST OF S. ISLAND, N.Z. | \n", " 2871698 | \n", " 1247649751 | \n", " 4.571652 | \n", "
\n", " \n", " 1 | \n", " 2016 | \n", " 11 | \n", " 13 | \n", " 11:02:59 | \n", " 7.8 | \n", " -42.7245 | \n", " 173.0647 | \n", " 22.0 | \n", " SOUTH ISLAND, NEW ZEALAND | \n", " 5197722 | \n", " 1479034979 | \n", " 4.690416 | \n", "
\n", " \n", " 2 | \n", " 2003 | \n", " 8 | \n", " 21 | \n", " 12:12:47 | \n", " 7.2 | \n", " -45.0875 | \n", " 167.0892 | \n", " 6.8 | \n", " SOUTH ISLAND, NEW ZEALAND | \n", " 1628007 | \n", " 1061467967 | \n", " 2.607681 | \n", "
\n", " \n", " 3 | \n", " 2001 | \n", " 8 | \n", " 21 | \n", " 06:52:06 | \n", " 7.1 | \n", " -36.8010 | \n", " -179.7230 | \n", " 33.5 | \n", " EAST OF NORTH ISLAND, N.Z. | \n", " 1169374 | \n", " 998376726 | \n", " 5.787918 | \n", "
\n", " \n", " 4 | \n", " 2004 | \n", " 11 | \n", " 22 | \n", " 20:26:23 | \n", " 7.1 | \n", " -46.4964 | \n", " 164.8273 | \n", " 7.6 | \n", " OFF W. COAST OF S. ISLAND, N.Z. | \n", " 1888053 | \n", " 1101155183 | \n", " 2.756810 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" year month day utc_time mag lat lon depth_km \\\n", "0 2009 7 15 09:22:31 7.8 -45.8339 166.6363 20.9 \n", "1 2016 11 13 11:02:59 7.8 -42.7245 173.0647 22.0 \n", "2 2003 8 21 12:12:47 7.2 -45.0875 167.0892 6.8 \n", "3 2001 8 21 06:52:06 7.1 -36.8010 -179.7230 33.5 \n", "4 2004 11 22 20:26:23 7.1 -46.4964 164.8273 7.6 \n", "\n", " region iris_id timestamp depth_sqrt \n", "0 OFF W. COAST OF S. ISLAND, N.Z. 2871698 1247649751 4.571652 \n", "1 SOUTH ISLAND, NEW ZEALAND 5197722 1479034979 4.690416 \n", "2 SOUTH ISLAND, NEW ZEALAND 1628007 1061467967 2.607681 \n", "3 EAST OF NORTH ISLAND, N.Z. 1169374 998376726 5.787918 \n", "4 OFF W. COAST OF S. ISLAND, N.Z. 1888053 1101155183 2.756810 "]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["nz_eqs_merged = pd.merge(nz_eqs, nz_eqs2, how=\"left\",\n", " on=[\"lat\", \"lon\", \"depth_km\"])\n", "\n", "nz_eqs_merged.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["(Combining)=\n", "## Combining\n", "``` {index} Pandas: combining\n", "```\n", "If we want to join the same data from two tables, we can use the concat() function. Let's split the original data into two DataFrames and try to combine them back together:"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Shapes of two DataFrames: (15000, 11) (13000, 11)\n", "Original DataFrame shape: (25000, 11)\n", "Concatenated DataFrame with duplicates shape: (28000, 11)\n", "Concatenated DataFrame without duplicates shape: (25000, 11)\n"]}], "source": ["# Extract two overlapping DataFrames\n", "nz_eqs3 = nz_eqs.iloc[:15000,:]\n", "nz_eqs4 = nz_eqs.iloc[12000:,:]\n", "\n", "print(\"Shapes of two DataFrames:\", nz_eqs3.shape, nz_eqs4.shape)\n", "\n", "# Reset the index from original DataFrame\n", "nz_eqs4.reset_index(drop=True, inplace=True)\n", "\n", "# Concatenate DataFrames\n", "# Use ignore_index to create a new index\n", "nz_eqs_concat = pd.concat([nz_eqs3, nz_eqs4], ignore_index=True)\n", "\n", "# Drop duplicates in the new DataFrame\n", "nz_eqs_concat_unique = nz_eqs_concat.drop_duplicates()\n", "\n", "print(\"Original DataFrame shape:\", nz_eqs.shape)\n", "print(\"Concatenated DataFrame with duplicates shape:\",\n", " nz_eqs_concat.shape)\n", "print(\"Concatenated DataFrame without duplicates shape:\",\n", " nz_eqs_concat_unique.shape)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["(Grouping)=\n", "## Grouping\n", "``` {index} Pandas: grouping\n", "```\n", "If we wanted to count how many times specific regions in New Zealand were hit by an earthquake, we can use groupby() function and count():"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["region\n", "CHATHAM ISLANDS, N.Z. REGION 4\n", "COOK STRAIT, NEW ZEALAND 1381\n", "EAST OF NORTH ISLAND, N.Z. 1676\n", "NORTH ISLAND, NEW ZEALAND 6790\n", "NORTH OF NEW ZEALAND 49\n", "NORTHWEST OF NEW ZEALAND 6\n", "OFF E. COAST OF N. ISLAND, N.Z. 7741\n", "OFF E. COAST OF S. ISLAND, N.Z. 182\n", "OFF W. COAST OF N. ISLAND, N.Z. 229\n", "OFF W. COAST OF S. ISLAND, N.Z. 1768\n", "SOUTH ISLAND, NEW ZEALAND 3119\n", "SOUTH OF CHATHAM ISLANDS 1\n", "SOUTH OF KERMADEC ISLANDS 2053\n", "SOUTHEAST OF AUSTRALIA 1\n", "Name: region, dtype: int64"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["nz_eqs.groupby('region').region.count()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["This analysis is also equivalent to pandas built-in value_counts() function:"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["OFF E. COAST OF N. ISLAND, N.Z. 7741\n", "NORTH ISLAND, NEW ZEALAND 6790\n", "SOUTH ISLAND, NEW ZEALAND 3119\n", "SOUTH OF KERMADEC ISLANDS 2053\n", "OFF W. COAST OF S. ISLAND, N.Z. 1768\n", "EAST OF NORTH ISLAND, N.Z. 1676\n", "COOK STRAIT, NEW ZEALAND 1381\n", "OFF W. COAST OF N. ISLAND, N.Z. 229\n", "OFF E. COAST OF S. ISLAND, N.Z. 182\n", "NORTH OF NEW ZEALAND 49\n", "NORTHWEST OF NEW ZEALAND 6\n", "CHATHAM ISLANDS, N.Z. REGION 4\n", "SOUTH OF CHATHAM ISLANDS 1\n", "SOUTHEAST OF AUSTRALIA 1\n", "Name: region, dtype: int64"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["nz_eqs.region.value_counts()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We can also groupby two columns, e.g. region and year based on count, maximum and minimum depth in that year and region:"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " | \n", " len | \n", " min | \n", " max | \n", "
\n", " \n", " region | \n", " year | \n", " | \n", " | \n", " | \n", "
\n", " \n", " \n", " \n", " CHATHAM ISLANDS, N.Z. REGION | \n", " 1992 | \n", " 1.0 | \n", " 33.0 | \n", " 33.0 | \n", "
\n", " \n", " 1993 | \n", " 1.0 | \n", " 33.0 | \n", " 33.0 | \n", "
\n", " \n", " 2003 | \n", " 2.0 | \n", " 33.0 | \n", " 33.0 | \n", "
\n", " \n", " COOK STRAIT, NEW ZEALAND | \n", " 1971 | \n", " 22.0 | \n", " 12.0 | \n", " 205.5 | \n", "
\n", " \n", " 1972 | \n", " 30.0 | \n", " 10.8 | \n", " 220.2 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" len min max\n", "region year \n", "CHATHAM ISLANDS, N.Z. REGION 1992 1.0 33.0 33.0\n", " 1993 1.0 33.0 33.0\n", " 2003 2.0 33.0 33.0\n", "COOK STRAIT, NEW ZEALAND 1971 22.0 12.0 205.5\n", " 1972 30.0 10.8 220.2"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["nz_eqs.groupby(['region', 'year']).depth_km.agg([len, min, max]).head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["(Sorting)=\n", "## Sorting\n", "``` {index} Pandas: sorting\n", "```\n", "Groupby() function returns the values in the index order. Suppose we would like to know count of earthquakes in each year in descending order. We can use function sort_values() for that:"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " len | \n", " min | \n", " max | \n", "
\n", " \n", " year | \n", " | \n", " | \n", " | \n", "
\n", " \n", " \n", " \n", " 1995 | \n", " 1631.0 | \n", " 0.0 | \n", " 403.5 | \n", "
\n", " \n", " 2001 | \n", " 941.0 | \n", " 0.0 | \n", " 430.7 | \n", "
\n", " \n", " 1985 | \n", " 892.0 | \n", " 0.0 | \n", " 362.4 | \n", "
\n", " \n", " 1997 | \n", " 794.0 | \n", " 0.0 | \n", " 600.0 | \n", "
\n", " \n", " 1984 | \n", " 724.0 | \n", " 0.0 | \n", " 448.6 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" len min max\n", "year \n", "1995 1631.0 0.0 403.5\n", "2001 941.0 0.0 430.7\n", "1985 892.0 0.0 362.4\n", "1997 794.0 0.0 600.0\n", "1984 724.0 0.0 448.6"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["nz_eqs_years = nz_eqs.groupby(['year']).depth_km.agg([len, min, max])\n", "nz_eqs_years_sorted = nz_eqs_years.sort_values(by=\"len\", ascending=False)\n", "nz_eqs_years_sorted.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["# References\n", "The notebook was compiled based on:\n", "* [Pandas official Getting Started tutorials](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)\n", "* [Kaggle tutorial](https://www.kaggle.com/learn/pandas)"]}], "metadata": {"celltoolbar": "Tags", "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8"}}, "nbformat": 4, "nbformat_minor": 2}