Summary statistics#

Programming for Geoscientists Data Science and Machine Learning for Geoscientists

Pandas have built-in functions that can calculate simple statistics:

df.describe()

For numeric data, this will return count, median, standard deviation, minimum, maximum, 25, 50 and 75 percentiles. For strings/timestamps, this will return count, unique, top and frequency.

Let’s load New Zealand earthquake data:

import pandas as pd
nz_eqs = pd.read_csv("../../geosciences/data/nz_largest_eq_since_1970.csv")
nz_eqs.head(4)
year month day utc_time mag lat lon depth_km region iris_id timestamp
0 2009 7 15 09:22:31 7.8 -45.8339 166.6363 20.9 OFF W. COAST OF S. ISLAND, N.Z. 2871698 1247649751
1 2016 11 13 11:02:59 7.8 -42.7245 173.0647 22.0 SOUTH ISLAND, NEW ZEALAND 5197722 1479034979
2 2003 8 21 12:12:47 7.2 -45.0875 167.0892 6.8 SOUTH ISLAND, NEW ZEALAND 1628007 1061467967
3 2001 8 21 06:52:06 7.1 -36.8010 -179.7230 33.5 EAST OF NORTH ISLAND, N.Z. 1169374 998376726
nz_eqs.describe()
year month day mag lat lon depth_km iris_id timestamp
count 25000.000000 25000.000000 25000.000000 25000.000000 25000.000000 25000.000000 25000.000000 2.500000e+04 2.500000e+04
mean 1993.862160 6.408760 15.384160 4.270952 -38.939428 130.757907 94.014232 2.285625e+06 7.684759e+08
std 12.733297 3.512482 8.814035 0.356037 3.278140 117.371409 94.284137 2.562292e+06 4.021305e+08
min 1970.000000 1.000000 1.000000 3.900000 -47.952400 -179.999000 0.000000 1.034600e+04 2.629760e+05
25% 1984.000000 3.000000 8.000000 4.000000 -40.537000 169.977150 12.000000 4.046555e+05 4.698424e+08
50% 1995.000000 7.000000 15.000000 4.200000 -38.063050 175.867700 42.000000 1.608522e+06 7.920289e+08
75% 2003.000000 9.000000 23.000000 4.400000 -36.864300 177.507000 170.100000 3.059155e+06 1.061565e+09
max 2020.000000 12.000000 31.000000 7.800000 -33.608600 180.000000 665.100000 1.124420e+07 1.590893e+09

Pandas also has separate built-in functions that can calculate these statistics, e.g. to get mean magnitude we can simply call:

nz_eqs["mag"].mean()
4.27095199999966

For full tutorial see Descriptive Statistics chapter in Pandas documentation.

References#

The notebook was compiled based on: