Filtering, selecting and assigning
Contents
Filtering, selecting and assigning#
Programming for Geoscientists Data Science and Machine Learning for Geoscientists
Let’s look at New Zealand earthquake data:
import pandas as pd
nz_eqs = pd.read_csv("../../geosciences/data/nz_largest_eq_since_1970.csv")
nz_eqs.head(4)
year | month | day | utc_time | mag | lat | lon | depth_km | region | iris_id | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2009 | 7 | 15 | 09:22:31 | 7.8 | -45.8339 | 166.6363 | 20.9 | OFF W. COAST OF S. ISLAND, N.Z. | 2871698 | 1247649751 |
1 | 2016 | 11 | 13 | 11:02:59 | 7.8 | -42.7245 | 173.0647 | 22.0 | SOUTH ISLAND, NEW ZEALAND | 5197722 | 1479034979 |
2 | 2003 | 8 | 21 | 12:12:47 | 7.2 | -45.0875 | 167.0892 | 6.8 | SOUTH ISLAND, NEW ZEALAND | 1628007 | 1061467967 |
3 | 2001 | 8 | 21 | 06:52:06 | 7.1 | -36.8010 | -179.7230 | 33.5 | EAST OF NORTH ISLAND, N.Z. | 1169374 | 998376726 |
We would like to filter it, so that we only use earthquakes since 2000 and with magnitude higher than 5. To do that, we can include a conditional expression:
nz_eqs_2 = nz_eqs[(nz_eqs.year >= 2000) & (nz_eqs.mag >= 5.0)]
nz_eqs_2.shape
(544, 11)
This operation reduced our DataFrame to 544 rows from 25,000. The symbol & is used for AND condition, while symbol | can be used for OR condition.
If we want to pick specific rows and colums we can use syntax:
df.iloc[row_start:row_end:step, column_start:column_end:step]
where both row end and column end are not included.
nz_eqs.iloc[1:12:3, 6:9]
lon | depth_km | region | |
---|---|---|---|
1 | 173.0647 | 22.0 | SOUTH ISLAND, NEW ZEALAND |
4 | 164.8273 | 7.6 | OFF W. COAST OF S. ISLAND, N.Z. |
7 | 171.9023 | 4.0 | SOUTH ISLAND, NEW ZEALAND |
10 | 167.5146 | 2.0 | SOUTH ISLAND, NEW ZEALAND |
Pandas’ built-in function .isin([list of values we look for]) allows us to select data whose specified values are in the given column. This function returns True/False. Isin can be used with:
df.loc[df.col1.isin([value1, value2, ...])]
It will return a DataFrame filtered by the isin condition:
nz_eqs2 = nz_eqs.loc[nz_eqs.year.isin([2000])]
nz_eqs2.head()
year | month | day | utc_time | mag | lat | lon | depth_km | region | iris_id | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|
46 | 2000 | 11 | 1 | 10:35:54 | 6.1 | -45.009 | 167.123 | 12.2 | SOUTH ISLAND, NEW ZEALAND | 926219 | 973074954 |
93 | 2000 | 3 | 4 | 14:09:50 | 5.8 | -34.381 | 179.146 | 33.0 | SOUTH OF KERMADEC ISLANDS | 880024 | 952178990 |
125 | 2000 | 7 | 1 | 04:48:56 | 5.7 | -35.469 | 178.742 | 258.0 | OFF E. COAST OF N. ISLAND, N.Z. | 871393 | 962426936 |
174 | 2000 | 12 | 26 | 15:21:31 | 5.6 | -35.117 | 179.700 | 73.0 | OFF E. COAST OF N. ISLAND, N.Z. | 1100574 | 977844091 |
230 | 2000 | 4 | 26 | 14:54:05 | 5.5 | -33.660 | -178.889 | 15.6 | SOUTH OF KERMADEC ISLANDS | 964992 | 956760845 |
Another built-in selectors are .isnull([list of values we look for]) and .notnull([list of values we look for]). These selectors look for either ‘NaN’ or not ‘NaN’ values.
If we want to create a new column, e.g. with depth in m, we can easily assign value to a new column:
nz_eqs["depth_m"] = nz_eqs["depth_km"]*1000
nz_eqs.head()
year | month | day | utc_time | mag | lat | lon | depth_km | region | iris_id | timestamp | depth_m | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2009 | 7 | 15 | 09:22:31 | 7.8 | -45.8339 | 166.6363 | 20.9 | OFF W. COAST OF S. ISLAND, N.Z. | 2871698 | 1247649751 | 20900.0 |
1 | 2016 | 11 | 13 | 11:02:59 | 7.8 | -42.7245 | 173.0647 | 22.0 | SOUTH ISLAND, NEW ZEALAND | 5197722 | 1479034979 | 22000.0 |
2 | 2003 | 8 | 21 | 12:12:47 | 7.2 | -45.0875 | 167.0892 | 6.8 | SOUTH ISLAND, NEW ZEALAND | 1628007 | 1061467967 | 6800.0 |
3 | 2001 | 8 | 21 | 06:52:06 | 7.1 | -36.8010 | -179.7230 | 33.5 | EAST OF NORTH ISLAND, N.Z. | 1169374 | 998376726 | 33500.0 |
4 | 2004 | 11 | 22 | 20:26:23 | 7.1 | -46.4964 | 164.8273 | 7.6 | OFF W. COAST OF S. ISLAND, N.Z. | 1888053 | 1101155183 | 7600.0 |
These calculations are done at once for each row. That means that there is no need to loop over elements in the DataFrame, which makes it very efficient!
References#
The notebook was compiled based on: