Merging, combining, grouping and sorting
Contents
Merging, combining, grouping and sorting#
Programming for Geoscientists Data Science and Machine Learning for Geoscientists
Merging#
Let’s look at New Zealand earthquake data:
import pandas as pd
import numpy as np
nz_eqs = pd.read_csv("../../geosciences/data/nz_largest_eq_since_1970.csv")
nz_eqs.head(4)
year | month | day | utc_time | mag | lat | lon | depth_km | region | iris_id | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2009 | 7 | 15 | 09:22:31 | 7.8 | -45.8339 | 166.6363 | 20.9 | OFF W. COAST OF S. ISLAND, N.Z. | 2871698 | 1247649751 |
1 | 2016 | 11 | 13 | 11:02:59 | 7.8 | -42.7245 | 173.0647 | 22.0 | SOUTH ISLAND, NEW ZEALAND | 5197722 | 1479034979 |
2 | 2003 | 8 | 21 | 12:12:47 | 7.2 | -45.0875 | 167.0892 | 6.8 | SOUTH ISLAND, NEW ZEALAND | 1628007 | 1061467967 |
3 | 2001 | 8 | 21 | 06:52:06 | 7.1 | -36.8010 | -179.7230 | 33.5 | EAST OF NORTH ISLAND, N.Z. | 1169374 | 998376726 |
Let’s create a second DataFrame that has latitude, longitude and depth (km) columns. And add an extra column with square root of depth:
nz_eqs = pd.read_csv("../../geosciences/data/nz_largest_eq_since_1970.csv")
nz_eqs.head(4)
year | month | day | utc_time | mag | lat | lon | depth_km | region | iris_id | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2009 | 7 | 15 | 09:22:31 | 7.8 | -45.8339 | 166.6363 | 20.9 | OFF W. COAST OF S. ISLAND, N.Z. | 2871698 | 1247649751 |
1 | 2016 | 11 | 13 | 11:02:59 | 7.8 | -42.7245 | 173.0647 | 22.0 | SOUTH ISLAND, NEW ZEALAND | 5197722 | 1479034979 |
2 | 2003 | 8 | 21 | 12:12:47 | 7.2 | -45.0875 | 167.0892 | 6.8 | SOUTH ISLAND, NEW ZEALAND | 1628007 | 1061467967 |
3 | 2001 | 8 | 21 | 06:52:06 | 7.1 | -36.8010 | -179.7230 | 33.5 | EAST OF NORTH ISLAND, N.Z. | 1169374 | 998376726 |
nz_eqs2 = nz_eqs.iloc[:,5:8]
nz_eqs2["depth_sqrt"] = np.sqrt(nz_eqs2.depth_km)
nz_eqs2.head()
lat | lon | depth_km | depth_sqrt | |
---|---|---|---|---|
0 | -45.8339 | 166.6363 | 20.9 | 4.571652 |
1 | -42.7245 | 173.0647 | 22.0 | 4.690416 |
2 | -45.0875 | 167.0892 | 6.8 | 2.607681 |
3 | -36.8010 | -179.7230 | 33.5 | 5.787918 |
4 | -46.4964 | 164.8273 | 7.6 | 2.756810 |
Now, we would like to join the original ‘nz_eqs’ and ‘nz_eqs2’ DataFrames by common columns. We can use merge function that takes two DataFrames as arguments and we can decided which columns to merge with. In our case, nz_eqs and nz_eqs2 share three data columns: lat, lon, depth_km. If we choose to merge with latitude, longitude, the final DataFrame would keep depth_km columns from both DataFrames, named depth_km_x and depth_km_y. If we wanted to merge DataFrames with all three common columns, we can use:
nz_eqs_merged = pd.merge(nz_eqs, nz_eqs2, how="left",
on=["lat", "lon", "depth_km"])
nz_eqs_merged.head()
year | month | day | utc_time | mag | lat | lon | depth_km | region | iris_id | timestamp | depth_sqrt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2009 | 7 | 15 | 09:22:31 | 7.8 | -45.8339 | 166.6363 | 20.9 | OFF W. COAST OF S. ISLAND, N.Z. | 2871698 | 1247649751 | 4.571652 |
1 | 2016 | 11 | 13 | 11:02:59 | 7.8 | -42.7245 | 173.0647 | 22.0 | SOUTH ISLAND, NEW ZEALAND | 5197722 | 1479034979 | 4.690416 |
2 | 2003 | 8 | 21 | 12:12:47 | 7.2 | -45.0875 | 167.0892 | 6.8 | SOUTH ISLAND, NEW ZEALAND | 1628007 | 1061467967 | 2.607681 |
3 | 2001 | 8 | 21 | 06:52:06 | 7.1 | -36.8010 | -179.7230 | 33.5 | EAST OF NORTH ISLAND, N.Z. | 1169374 | 998376726 | 5.787918 |
4 | 2004 | 11 | 22 | 20:26:23 | 7.1 | -46.4964 | 164.8273 | 7.6 | OFF W. COAST OF S. ISLAND, N.Z. | 1888053 | 1101155183 | 2.756810 |
Combining#
If we want to join the same data from two tables, we can use the concat() function. Let’s split the original data into two DataFrames and try to combine them back together:
# Extract two overlapping DataFrames
nz_eqs3 = nz_eqs.iloc[:15000,:]
nz_eqs4 = nz_eqs.iloc[12000:,:]
print("Shapes of two DataFrames:", nz_eqs3.shape, nz_eqs4.shape)
# Reset the index from original DataFrame
nz_eqs4.reset_index(drop=True, inplace=True)
# Concatenate DataFrames
# Use ignore_index to create a new index
nz_eqs_concat = pd.concat([nz_eqs3, nz_eqs4], ignore_index=True)
# Drop duplicates in the new DataFrame
nz_eqs_concat_unique = nz_eqs_concat.drop_duplicates()
print("Original DataFrame shape:", nz_eqs.shape)
print("Concatenated DataFrame with duplicates shape:",
nz_eqs_concat.shape)
print("Concatenated DataFrame without duplicates shape:",
nz_eqs_concat_unique.shape)
Shapes of two DataFrames: (15000, 11) (13000, 11)
Original DataFrame shape: (25000, 11)
Concatenated DataFrame with duplicates shape: (28000, 11)
Concatenated DataFrame without duplicates shape: (25000, 11)
Grouping#
If we wanted to count how many times specific regions in New Zealand were hit by an earthquake, we can use groupby() function and count():
nz_eqs.groupby('region').region.count()
region
CHATHAM ISLANDS, N.Z. REGION 4
COOK STRAIT, NEW ZEALAND 1381
EAST OF NORTH ISLAND, N.Z. 1676
NORTH ISLAND, NEW ZEALAND 6790
NORTH OF NEW ZEALAND 49
NORTHWEST OF NEW ZEALAND 6
OFF E. COAST OF N. ISLAND, N.Z. 7741
OFF E. COAST OF S. ISLAND, N.Z. 182
OFF W. COAST OF N. ISLAND, N.Z. 229
OFF W. COAST OF S. ISLAND, N.Z. 1768
SOUTH ISLAND, NEW ZEALAND 3119
SOUTH OF CHATHAM ISLANDS 1
SOUTH OF KERMADEC ISLANDS 2053
SOUTHEAST OF AUSTRALIA 1
Name: region, dtype: int64
This analysis is also equivalent to pandas built-in value_counts() function:
nz_eqs.region.value_counts()
OFF E. COAST OF N. ISLAND, N.Z. 7741
NORTH ISLAND, NEW ZEALAND 6790
SOUTH ISLAND, NEW ZEALAND 3119
SOUTH OF KERMADEC ISLANDS 2053
OFF W. COAST OF S. ISLAND, N.Z. 1768
EAST OF NORTH ISLAND, N.Z. 1676
COOK STRAIT, NEW ZEALAND 1381
OFF W. COAST OF N. ISLAND, N.Z. 229
OFF E. COAST OF S. ISLAND, N.Z. 182
NORTH OF NEW ZEALAND 49
NORTHWEST OF NEW ZEALAND 6
CHATHAM ISLANDS, N.Z. REGION 4
SOUTH OF CHATHAM ISLANDS 1
SOUTHEAST OF AUSTRALIA 1
Name: region, dtype: int64
We can also groupby two columns, e.g. region and year based on count, maximum and minimum depth in that year and region:
nz_eqs.groupby(['region', 'year']).depth_km.agg([len, min, max]).head()
len | min | max | ||
---|---|---|---|---|
region | year | |||
CHATHAM ISLANDS, N.Z. REGION | 1992 | 1.0 | 33.0 | 33.0 |
1993 | 1.0 | 33.0 | 33.0 | |
2003 | 2.0 | 33.0 | 33.0 | |
COOK STRAIT, NEW ZEALAND | 1971 | 22.0 | 12.0 | 205.5 |
1972 | 30.0 | 10.8 | 220.2 |
Sorting#
Groupby() function returns the values in the index order. Suppose we would like to know count of earthquakes in each year in descending order. We can use function sort_values() for that:
nz_eqs_years = nz_eqs.groupby(['year']).depth_km.agg([len, min, max])
nz_eqs_years_sorted = nz_eqs_years.sort_values(by="len", ascending=False)
nz_eqs_years_sorted.head()
len | min | max | |
---|---|---|---|
year | |||
1995 | 1631.0 | 0.0 | 403.5 |
2001 | 941.0 | 0.0 | 430.7 |
1985 | 892.0 | 0.0 | 362.4 |
1997 | 794.0 | 0.0 | 600.0 |
1984 | 724.0 | 0.0 | 448.6 |
References#
The notebook was compiled based on: