Filtering, selecting and assigning#

Programming for Geoscientists Data Science and Machine Learning for Geoscientists

Let’s look at New Zealand earthquake data:

import pandas as pd
nz_eqs = pd.read_csv("../../geosciences/data/nz_largest_eq_since_1970.csv")
nz_eqs.head(4)
year month day utc_time mag lat lon depth_km region iris_id timestamp
0 2009 7 15 09:22:31 7.8 -45.8339 166.6363 20.9 OFF W. COAST OF S. ISLAND, N.Z. 2871698 1247649751
1 2016 11 13 11:02:59 7.8 -42.7245 173.0647 22.0 SOUTH ISLAND, NEW ZEALAND 5197722 1479034979
2 2003 8 21 12:12:47 7.2 -45.0875 167.0892 6.8 SOUTH ISLAND, NEW ZEALAND 1628007 1061467967
3 2001 8 21 06:52:06 7.1 -36.8010 -179.7230 33.5 EAST OF NORTH ISLAND, N.Z. 1169374 998376726

We would like to filter it, so that we only use earthquakes since 2000 and with magnitude higher than 5. To do that, we can include a conditional expression:

nz_eqs_2 = nz_eqs[(nz_eqs.year >= 2000) & (nz_eqs.mag >= 5.0)]
nz_eqs_2.shape
(544, 11)

This operation reduced our DataFrame to 544 rows from 25,000. The symbol & is used for AND condition, while symbol | can be used for OR condition.

If we want to pick specific rows and colums we can use syntax:

df.iloc[row_start:row_end:step, column_start:column_end:step]

where both row end and column end are not included.

nz_eqs.iloc[1:12:3, 6:9]
lon depth_km region
1 173.0647 22.0 SOUTH ISLAND, NEW ZEALAND
4 164.8273 7.6 OFF W. COAST OF S. ISLAND, N.Z.
7 171.9023 4.0 SOUTH ISLAND, NEW ZEALAND
10 167.5146 2.0 SOUTH ISLAND, NEW ZEALAND

Pandas’ built-in function .isin([list of values we look for]) allows us to select data whose specified values are in the given column. This function returns True/False. Isin can be used with:

df.loc[df.col1.isin([value1, value2, ...])]

It will return a DataFrame filtered by the isin condition:

nz_eqs2 = nz_eqs.loc[nz_eqs.year.isin([2000])]
nz_eqs2.head()
year month day utc_time mag lat lon depth_km region iris_id timestamp
46 2000 11 1 10:35:54 6.1 -45.009 167.123 12.2 SOUTH ISLAND, NEW ZEALAND 926219 973074954
93 2000 3 4 14:09:50 5.8 -34.381 179.146 33.0 SOUTH OF KERMADEC ISLANDS 880024 952178990
125 2000 7 1 04:48:56 5.7 -35.469 178.742 258.0 OFF E. COAST OF N. ISLAND, N.Z. 871393 962426936
174 2000 12 26 15:21:31 5.6 -35.117 179.700 73.0 OFF E. COAST OF N. ISLAND, N.Z. 1100574 977844091
230 2000 4 26 14:54:05 5.5 -33.660 -178.889 15.6 SOUTH OF KERMADEC ISLANDS 964992 956760845

Another built-in selectors are .isnull([list of values we look for]) and .notnull([list of values we look for]). These selectors look for either ‘NaN’ or not ‘NaN’ values.

If we want to create a new column, e.g. with depth in m, we can easily assign value to a new column:

nz_eqs["depth_m"] = nz_eqs["depth_km"]*1000
nz_eqs.head()
year month day utc_time mag lat lon depth_km region iris_id timestamp depth_m
0 2009 7 15 09:22:31 7.8 -45.8339 166.6363 20.9 OFF W. COAST OF S. ISLAND, N.Z. 2871698 1247649751 20900.0
1 2016 11 13 11:02:59 7.8 -42.7245 173.0647 22.0 SOUTH ISLAND, NEW ZEALAND 5197722 1479034979 22000.0
2 2003 8 21 12:12:47 7.2 -45.0875 167.0892 6.8 SOUTH ISLAND, NEW ZEALAND 1628007 1061467967 6800.0
3 2001 8 21 06:52:06 7.1 -36.8010 -179.7230 33.5 EAST OF NORTH ISLAND, N.Z. 1169374 998376726 33500.0
4 2004 11 22 20:26:23 7.1 -46.4964 164.8273 7.6 OFF W. COAST OF S. ISLAND, N.Z. 1888053 1101155183 7600.0

These calculations are done at once for each row. That means that there is no need to loop over elements in the DataFrame, which makes it very efficient!

References#

The notebook was compiled based on: