Filtering, selecting and assigning#

Programming for Geoscientists Data Science and Machine Learning for Geoscientists

Let’s look at New Zealand earthquake data:

import pandas as pd
nz_eqs = pd.read_csv("../../geosciences/data/nz_largest_eq_since_1970.csv")
nz_eqs.head(4)

	year	month	day	utc_time	mag	lat	lon	depth_km	region	iris_id	timestamp
0	2009	7	15	09:22:31	7.8	-45.8339	166.6363	20.9	OFF W. COAST OF S. ISLAND, N.Z.	2871698	1247649751
1	2016	11	13	11:02:59	7.8	-42.7245	173.0647	22.0	SOUTH ISLAND, NEW ZEALAND	5197722	1479034979
2	2003	8	21	12:12:47	7.2	-45.0875	167.0892	6.8	SOUTH ISLAND, NEW ZEALAND	1628007	1061467967
3	2001	8	21	06:52:06	7.1	-36.8010	-179.7230	33.5	EAST OF NORTH ISLAND, N.Z.	1169374	998376726

We would like to filter it, so that we only use earthquakes since 2000 and with magnitude higher than 5. To do that, we can include a conditional expression:

nz_eqs_2 = nz_eqs[(nz_eqs.year >= 2000) & (nz_eqs.mag >= 5.0)]
nz_eqs_2.shape

(544, 11)

This operation reduced our DataFrame to 544 rows from 25,000. The symbol & is used for AND condition, while symbol | can be used for OR condition.

If we want to pick specific rows and colums we can use syntax:

df.iloc[row_start:row_end:step, column_start:column_end:step]

where both row end and column end are not included.

nz_eqs.iloc[1:12:3, 6:9]

	lon	depth_km	region
1	173.0647	22.0	SOUTH ISLAND, NEW ZEALAND
4	164.8273	7.6	OFF W. COAST OF S. ISLAND, N.Z.
7	171.9023	4.0	SOUTH ISLAND, NEW ZEALAND
10	167.5146	2.0	SOUTH ISLAND, NEW ZEALAND

Pandas’ built-in function .isin([list of values we look for]) allows us to select data whose specified values are in the given column. This function returns True/False. Isin can be used with:

df.loc[df.col1.isin([value1, value2, ...])]

It will return a DataFrame filtered by the isin condition:

nz_eqs2 = nz_eqs.loc[nz_eqs.year.isin([2000])]
nz_eqs2.head()

	year	month	day	utc_time	mag	lat	lon	depth_km	region	iris_id	timestamp
46	2000	11	1	10:35:54	6.1	-45.009	167.123	12.2	SOUTH ISLAND, NEW ZEALAND	926219	973074954
93	2000	3	4	14:09:50	5.8	-34.381	179.146	33.0	SOUTH OF KERMADEC ISLANDS	880024	952178990
125	2000	7	1	04:48:56	5.7	-35.469	178.742	258.0	OFF E. COAST OF N. ISLAND, N.Z.	871393	962426936
174	2000	12	26	15:21:31	5.6	-35.117	179.700	73.0	OFF E. COAST OF N. ISLAND, N.Z.	1100574	977844091
230	2000	4	26	14:54:05	5.5	-33.660	-178.889	15.6	SOUTH OF KERMADEC ISLANDS	964992	956760845

Another built-in selectors are .isnull([list of values we look for]) and .notnull([list of values we look for]). These selectors look for either ‘NaN’ or not ‘NaN’ values.

If we want to create a new column, e.g. with depth in m, we can easily assign value to a new column:

nz_eqs["depth_m"] = nz_eqs["depth_km"]*1000
nz_eqs.head()

	year	month	day	utc_time	mag	lat	lon	depth_km	region	iris_id	timestamp	depth_m
0	2009	7	15	09:22:31	7.8	-45.8339	166.6363	20.9	OFF W. COAST OF S. ISLAND, N.Z.	2871698	1247649751	20900.0
1	2016	11	13	11:02:59	7.8	-42.7245	173.0647	22.0	SOUTH ISLAND, NEW ZEALAND	5197722	1479034979	22000.0
2	2003	8	21	12:12:47	7.2	-45.0875	167.0892	6.8	SOUTH ISLAND, NEW ZEALAND	1628007	1061467967	6800.0
3	2001	8	21	06:52:06	7.1	-36.8010	-179.7230	33.5	EAST OF NORTH ISLAND, N.Z.	1169374	998376726	33500.0
4	2004	11	22	20:26:23	7.1	-46.4964	164.8273	7.6	OFF W. COAST OF S. ISLAND, N.Z.	1888053	1101155183	7600.0

These calculations are done at once for each row. That means that there is no need to loop over elements in the DataFrame, which makes it very efficient!

References#

The notebook was compiled based on:

ESE Jupyter Material

Filtering, selecting and assigning

Contents

Filtering, selecting and assigning#

References#