Merging, combining, grouping and sorting#

Programming for Geoscientists Data Science and Machine Learning for Geoscientists

Merging#

Let’s look at New Zealand earthquake data:

import pandas as pd
import numpy as np

nz_eqs = pd.read_csv("../../geosciences/data/nz_largest_eq_since_1970.csv")
nz_eqs.head(4)

	year	month	day	utc_time	mag	lat	lon	depth_km	region	iris_id	timestamp
0	2009	7	15	09:22:31	7.8	-45.8339	166.6363	20.9	OFF W. COAST OF S. ISLAND, N.Z.	2871698	1247649751
1	2016	11	13	11:02:59	7.8	-42.7245	173.0647	22.0	SOUTH ISLAND, NEW ZEALAND	5197722	1479034979
2	2003	8	21	12:12:47	7.2	-45.0875	167.0892	6.8	SOUTH ISLAND, NEW ZEALAND	1628007	1061467967
3	2001	8	21	06:52:06	7.1	-36.8010	-179.7230	33.5	EAST OF NORTH ISLAND, N.Z.	1169374	998376726

Let’s create a second DataFrame that has latitude, longitude and depth (km) columns. And add an extra column with square root of depth:

nz_eqs = pd.read_csv("../../geosciences/data/nz_largest_eq_since_1970.csv")
nz_eqs.head(4)

	year	month	day	utc_time	mag	lat	lon	depth_km	region	iris_id	timestamp
0	2009	7	15	09:22:31	7.8	-45.8339	166.6363	20.9	OFF W. COAST OF S. ISLAND, N.Z.	2871698	1247649751
1	2016	11	13	11:02:59	7.8	-42.7245	173.0647	22.0	SOUTH ISLAND, NEW ZEALAND	5197722	1479034979
2	2003	8	21	12:12:47	7.2	-45.0875	167.0892	6.8	SOUTH ISLAND, NEW ZEALAND	1628007	1061467967
3	2001	8	21	06:52:06	7.1	-36.8010	-179.7230	33.5	EAST OF NORTH ISLAND, N.Z.	1169374	998376726

nz_eqs2 = nz_eqs.iloc[:,5:8]
nz_eqs2["depth_sqrt"] = np.sqrt(nz_eqs2.depth_km)
nz_eqs2.head()

	lat	lon	depth_km	depth_sqrt
0	-45.8339	166.6363	20.9	4.571652
1	-42.7245	173.0647	22.0	4.690416
2	-45.0875	167.0892	6.8	2.607681
3	-36.8010	-179.7230	33.5	5.787918
4	-46.4964	164.8273	7.6	2.756810

Now, we would like to join the original ‘nz_eqs’ and ‘nz_eqs2’ DataFrames by common columns. We can use merge function that takes two DataFrames as arguments and we can decided which columns to merge with. In our case, nz_eqs and nz_eqs2 share three data columns: lat, lon, depth_km. If we choose to merge with latitude, longitude, the final DataFrame would keep depth_km columns from both DataFrames, named depth_km_x and depth_km_y. If we wanted to merge DataFrames with all three common columns, we can use:

nz_eqs_merged = pd.merge(nz_eqs, nz_eqs2, how="left",
                         on=["lat", "lon", "depth_km"])

nz_eqs_merged.head()

	year	month	day	utc_time	mag	lat	lon	depth_km	region	iris_id	timestamp	depth_sqrt
0	2009	7	15	09:22:31	7.8	-45.8339	166.6363	20.9	OFF W. COAST OF S. ISLAND, N.Z.	2871698	1247649751	4.571652
1	2016	11	13	11:02:59	7.8	-42.7245	173.0647	22.0	SOUTH ISLAND, NEW ZEALAND	5197722	1479034979	4.690416
2	2003	8	21	12:12:47	7.2	-45.0875	167.0892	6.8	SOUTH ISLAND, NEW ZEALAND	1628007	1061467967	2.607681
3	2001	8	21	06:52:06	7.1	-36.8010	-179.7230	33.5	EAST OF NORTH ISLAND, N.Z.	1169374	998376726	5.787918
4	2004	11	22	20:26:23	7.1	-46.4964	164.8273	7.6	OFF W. COAST OF S. ISLAND, N.Z.	1888053	1101155183	2.756810

Combining#

If we want to join the same data from two tables, we can use the concat() function. Let’s split the original data into two DataFrames and try to combine them back together:

# Extract two overlapping DataFrames
nz_eqs3 = nz_eqs.iloc[:15000,:]
nz_eqs4 = nz_eqs.iloc[12000:,:]

print("Shapes of two DataFrames:", nz_eqs3.shape, nz_eqs4.shape)

# Reset the index from original DataFrame
nz_eqs4.reset_index(drop=True, inplace=True)

# Concatenate DataFrames
# Use ignore_index to create a new index
nz_eqs_concat = pd.concat([nz_eqs3, nz_eqs4], ignore_index=True)

# Drop duplicates in the new DataFrame
nz_eqs_concat_unique = nz_eqs_concat.drop_duplicates()

print("Original DataFrame shape:", nz_eqs.shape)
print("Concatenated DataFrame with duplicates shape:",
      nz_eqs_concat.shape)
print("Concatenated DataFrame without duplicates shape:",
      nz_eqs_concat_unique.shape)

Shapes of two DataFrames: (15000, 11) (13000, 11)
Original DataFrame shape: (25000, 11)
Concatenated DataFrame with duplicates shape: (28000, 11)
Concatenated DataFrame without duplicates shape: (25000, 11)

Grouping#

If we wanted to count how many times specific regions in New Zealand were hit by an earthquake, we can use groupby() function and count():

nz_eqs.groupby('region').region.count()

region
CHATHAM ISLANDS, N.Z. REGION          4
COOK STRAIT, NEW ZEALAND           1381
EAST OF NORTH ISLAND, N.Z.         1676
NORTH ISLAND, NEW ZEALAND          6790
NORTH OF NEW ZEALAND                 49
NORTHWEST OF NEW ZEALAND              6
OFF E. COAST OF N. ISLAND, N.Z.    7741
OFF E. COAST OF S. ISLAND, N.Z.     182
OFF W. COAST OF N. ISLAND, N.Z.     229
OFF W. COAST OF S. ISLAND, N.Z.    1768
SOUTH ISLAND, NEW ZEALAND          3119
SOUTH OF CHATHAM ISLANDS              1
SOUTH OF KERMADEC ISLANDS          2053
SOUTHEAST OF AUSTRALIA                1
Name: region, dtype: int64

This analysis is also equivalent to pandas built-in value_counts() function:

nz_eqs.region.value_counts()

OFF E. COAST OF N. ISLAND, N.Z.    7741
NORTH ISLAND, NEW ZEALAND          6790
SOUTH ISLAND, NEW ZEALAND          3119
SOUTH OF KERMADEC ISLANDS          2053
OFF W. COAST OF S. ISLAND, N.Z.    1768
EAST OF NORTH ISLAND, N.Z.         1676
COOK STRAIT, NEW ZEALAND           1381
OFF W. COAST OF N. ISLAND, N.Z.     229
OFF E. COAST OF S. ISLAND, N.Z.     182
NORTH OF NEW ZEALAND                 49
NORTHWEST OF NEW ZEALAND              6
CHATHAM ISLANDS, N.Z. REGION          4
SOUTH OF CHATHAM ISLANDS              1
SOUTHEAST OF AUSTRALIA                1
Name: region, dtype: int64

We can also groupby two columns, e.g. region and year based on count, maximum and minimum depth in that year and region:

nz_eqs.groupby(['region', 'year']).depth_km.agg([len, min, max]).head()

		len	min	max
region	year
CHATHAM ISLANDS, N.Z. REGION	1992	1.0	33.0	33.0
	1993	1.0	33.0	33.0
	2003	2.0	33.0	33.0
COOK STRAIT, NEW ZEALAND	1971	22.0	12.0	205.5
COOK STRAIT, NEW ZEALAND	1972	30.0	10.8	220.2

Sorting#

Groupby() function returns the values in the index order. Suppose we would like to know count of earthquakes in each year in descending order. We can use function sort_values() for that:

nz_eqs_years = nz_eqs.groupby(['year']).depth_km.agg([len, min, max])
nz_eqs_years_sorted = nz_eqs_years.sort_values(by="len", ascending=False)
nz_eqs_years_sorted.head()

	len	min	max
year
1995	1631.0	0.0	403.5
2001	941.0	0.0	430.7
1985	892.0	0.0	362.4
1997	794.0	0.0	600.0
1984	724.0	0.0	448.6

References#

The notebook was compiled based on:

ESE Jupyter Material

Merging, combining, grouping and sorting

Contents