Merging, combining, grouping and sorting#

Programming for Geoscientists Data Science and Machine Learning for Geoscientists

Merging#

Let’s look at New Zealand earthquake data:

import pandas as pd
import numpy as np

nz_eqs = pd.read_csv("../../geosciences/data/nz_largest_eq_since_1970.csv")
nz_eqs.head(4)
year month day utc_time mag lat lon depth_km region iris_id timestamp
0 2009 7 15 09:22:31 7.8 -45.8339 166.6363 20.9 OFF W. COAST OF S. ISLAND, N.Z. 2871698 1247649751
1 2016 11 13 11:02:59 7.8 -42.7245 173.0647 22.0 SOUTH ISLAND, NEW ZEALAND 5197722 1479034979
2 2003 8 21 12:12:47 7.2 -45.0875 167.0892 6.8 SOUTH ISLAND, NEW ZEALAND 1628007 1061467967
3 2001 8 21 06:52:06 7.1 -36.8010 -179.7230 33.5 EAST OF NORTH ISLAND, N.Z. 1169374 998376726

Let’s create a second DataFrame that has latitude, longitude and depth (km) columns. And add an extra column with square root of depth:

nz_eqs = pd.read_csv("../../geosciences/data/nz_largest_eq_since_1970.csv")
nz_eqs.head(4)
year month day utc_time mag lat lon depth_km region iris_id timestamp
0 2009 7 15 09:22:31 7.8 -45.8339 166.6363 20.9 OFF W. COAST OF S. ISLAND, N.Z. 2871698 1247649751
1 2016 11 13 11:02:59 7.8 -42.7245 173.0647 22.0 SOUTH ISLAND, NEW ZEALAND 5197722 1479034979
2 2003 8 21 12:12:47 7.2 -45.0875 167.0892 6.8 SOUTH ISLAND, NEW ZEALAND 1628007 1061467967
3 2001 8 21 06:52:06 7.1 -36.8010 -179.7230 33.5 EAST OF NORTH ISLAND, N.Z. 1169374 998376726
nz_eqs2 = nz_eqs.iloc[:,5:8]
nz_eqs2["depth_sqrt"] = np.sqrt(nz_eqs2.depth_km)
nz_eqs2.head()
lat lon depth_km depth_sqrt
0 -45.8339 166.6363 20.9 4.571652
1 -42.7245 173.0647 22.0 4.690416
2 -45.0875 167.0892 6.8 2.607681
3 -36.8010 -179.7230 33.5 5.787918
4 -46.4964 164.8273 7.6 2.756810

Now, we would like to join the original ‘nz_eqs’ and ‘nz_eqs2’ DataFrames by common columns. We can use merge function that takes two DataFrames as arguments and we can decided which columns to merge with. In our case, nz_eqs and nz_eqs2 share three data columns: lat, lon, depth_km. If we choose to merge with latitude, longitude, the final DataFrame would keep depth_km columns from both DataFrames, named depth_km_x and depth_km_y. If we wanted to merge DataFrames with all three common columns, we can use:

nz_eqs_merged = pd.merge(nz_eqs, nz_eqs2, how="left",
                         on=["lat", "lon", "depth_km"])

nz_eqs_merged.head()
year month day utc_time mag lat lon depth_km region iris_id timestamp depth_sqrt
0 2009 7 15 09:22:31 7.8 -45.8339 166.6363 20.9 OFF W. COAST OF S. ISLAND, N.Z. 2871698 1247649751 4.571652
1 2016 11 13 11:02:59 7.8 -42.7245 173.0647 22.0 SOUTH ISLAND, NEW ZEALAND 5197722 1479034979 4.690416
2 2003 8 21 12:12:47 7.2 -45.0875 167.0892 6.8 SOUTH ISLAND, NEW ZEALAND 1628007 1061467967 2.607681
3 2001 8 21 06:52:06 7.1 -36.8010 -179.7230 33.5 EAST OF NORTH ISLAND, N.Z. 1169374 998376726 5.787918
4 2004 11 22 20:26:23 7.1 -46.4964 164.8273 7.6 OFF W. COAST OF S. ISLAND, N.Z. 1888053 1101155183 2.756810

Combining#

If we want to join the same data from two tables, we can use the concat() function. Let’s split the original data into two DataFrames and try to combine them back together:

# Extract two overlapping DataFrames
nz_eqs3 = nz_eqs.iloc[:15000,:]
nz_eqs4 = nz_eqs.iloc[12000:,:]

print("Shapes of two DataFrames:", nz_eqs3.shape, nz_eqs4.shape)

# Reset the index from original DataFrame
nz_eqs4.reset_index(drop=True, inplace=True)

# Concatenate DataFrames
# Use ignore_index to create a new index
nz_eqs_concat = pd.concat([nz_eqs3, nz_eqs4], ignore_index=True)

# Drop duplicates in the new DataFrame
nz_eqs_concat_unique = nz_eqs_concat.drop_duplicates()

print("Original DataFrame shape:", nz_eqs.shape)
print("Concatenated DataFrame with duplicates shape:",
      nz_eqs_concat.shape)
print("Concatenated DataFrame without duplicates shape:",
      nz_eqs_concat_unique.shape)
Shapes of two DataFrames: (15000, 11) (13000, 11)
Original DataFrame shape: (25000, 11)
Concatenated DataFrame with duplicates shape: (28000, 11)
Concatenated DataFrame without duplicates shape: (25000, 11)

Grouping#

If we wanted to count how many times specific regions in New Zealand were hit by an earthquake, we can use groupby() function and count():

nz_eqs.groupby('region').region.count()
region
CHATHAM ISLANDS, N.Z. REGION          4
COOK STRAIT, NEW ZEALAND           1381
EAST OF NORTH ISLAND, N.Z.         1676
NORTH ISLAND, NEW ZEALAND          6790
NORTH OF NEW ZEALAND                 49
NORTHWEST OF NEW ZEALAND              6
OFF E. COAST OF N. ISLAND, N.Z.    7741
OFF E. COAST OF S. ISLAND, N.Z.     182
OFF W. COAST OF N. ISLAND, N.Z.     229
OFF W. COAST OF S. ISLAND, N.Z.    1768
SOUTH ISLAND, NEW ZEALAND          3119
SOUTH OF CHATHAM ISLANDS              1
SOUTH OF KERMADEC ISLANDS          2053
SOUTHEAST OF AUSTRALIA                1
Name: region, dtype: int64

This analysis is also equivalent to pandas built-in value_counts() function:

nz_eqs.region.value_counts()
OFF E. COAST OF N. ISLAND, N.Z.    7741
NORTH ISLAND, NEW ZEALAND          6790
SOUTH ISLAND, NEW ZEALAND          3119
SOUTH OF KERMADEC ISLANDS          2053
OFF W. COAST OF S. ISLAND, N.Z.    1768
EAST OF NORTH ISLAND, N.Z.         1676
COOK STRAIT, NEW ZEALAND           1381
OFF W. COAST OF N. ISLAND, N.Z.     229
OFF E. COAST OF S. ISLAND, N.Z.     182
NORTH OF NEW ZEALAND                 49
NORTHWEST OF NEW ZEALAND              6
CHATHAM ISLANDS, N.Z. REGION          4
SOUTH OF CHATHAM ISLANDS              1
SOUTHEAST OF AUSTRALIA                1
Name: region, dtype: int64

We can also groupby two columns, e.g. region and year based on count, maximum and minimum depth in that year and region:

nz_eqs.groupby(['region', 'year']).depth_km.agg([len, min, max]).head()
len min max
region year
CHATHAM ISLANDS, N.Z. REGION 1992 1.0 33.0 33.0
1993 1.0 33.0 33.0
2003 2.0 33.0 33.0
COOK STRAIT, NEW ZEALAND 1971 22.0 12.0 205.5
1972 30.0 10.8 220.2

Sorting#

Groupby() function returns the values in the index order. Suppose we would like to know count of earthquakes in each year in descending order. We can use function sort_values() for that:

nz_eqs_years = nz_eqs.groupby(['year']).depth_km.agg([len, min, max])
nz_eqs_years_sorted = nz_eqs_years.sort_values(by="len", ascending=False)
nz_eqs_years_sorted.head()
len min max
year
1995 1631.0 0.0 403.5
2001 941.0 0.0 430.7
1985 892.0 0.0 362.4
1997 794.0 0.0 600.0
1984 724.0 0.0 448.6

References#

The notebook was compiled based on: