{"cells": [{"cell_type": "markdown", "metadata": {"tags": ["module-prog", "module-dsml"]}, "source": ["(Pandas_statistics)=\n", "# Summary statistics\n", "[Programming for Geoscientists](module-prog) [Data Science and Machine Learning for Geoscientists](module-dsml) \n", "``` {index} Pandas: summary statistics\n", "```\n", "Pandas have built-in functions that can calculate simple statistics:\n", "\n", " df.describe()\n", " \n", "For numeric data, this will return count, median, standard deviation, minimum, maximum, 25, 50 and 75 percentiles. For strings/timestamps, this will return count, unique, top and frequency.\n", "\n", "Let's load New Zealand earthquake data:"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "
\n", " \n", " \n", " | \n", " year | \n", " month | \n", " day | \n", " utc_time | \n", " mag | \n", " lat | \n", " lon | \n", " depth_km | \n", " region | \n", " iris_id | \n", " timestamp | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 2009 | \n", " 7 | \n", " 15 | \n", " 09:22:31 | \n", " 7.8 | \n", " -45.8339 | \n", " 166.6363 | \n", " 20.9 | \n", " OFF W. COAST OF S. ISLAND, N.Z. | \n", " 2871698 | \n", " 1247649751 | \n", "
\n", " \n", " 1 | \n", " 2016 | \n", " 11 | \n", " 13 | \n", " 11:02:59 | \n", " 7.8 | \n", " -42.7245 | \n", " 173.0647 | \n", " 22.0 | \n", " SOUTH ISLAND, NEW ZEALAND | \n", " 5197722 | \n", " 1479034979 | \n", "
\n", " \n", " 2 | \n", " 2003 | \n", " 8 | \n", " 21 | \n", " 12:12:47 | \n", " 7.2 | \n", " -45.0875 | \n", " 167.0892 | \n", " 6.8 | \n", " SOUTH ISLAND, NEW ZEALAND | \n", " 1628007 | \n", " 1061467967 | \n", "
\n", " \n", " 3 | \n", " 2001 | \n", " 8 | \n", " 21 | \n", " 06:52:06 | \n", " 7.1 | \n", " -36.8010 | \n", " -179.7230 | \n", " 33.5 | \n", " EAST OF NORTH ISLAND, N.Z. | \n", " 1169374 | \n", " 998376726 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" year month day utc_time mag lat lon depth_km \\\n", "0 2009 7 15 09:22:31 7.8 -45.8339 166.6363 20.9 \n", "1 2016 11 13 11:02:59 7.8 -42.7245 173.0647 22.0 \n", "2 2003 8 21 12:12:47 7.2 -45.0875 167.0892 6.8 \n", "3 2001 8 21 06:52:06 7.1 -36.8010 -179.7230 33.5 \n", "\n", " region iris_id timestamp \n", "0 OFF W. COAST OF S. ISLAND, N.Z. 2871698 1247649751 \n", "1 SOUTH ISLAND, NEW ZEALAND 5197722 1479034979 \n", "2 SOUTH ISLAND, NEW ZEALAND 1628007 1061467967 \n", "3 EAST OF NORTH ISLAND, N.Z. 1169374 998376726 "]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas as pd\n", "nz_eqs = pd.read_csv(\"../../geosciences/data/nz_largest_eq_since_1970.csv\")\n", "nz_eqs.head(4)"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " year | \n", " month | \n", " day | \n", " mag | \n", " lat | \n", " lon | \n", " depth_km | \n", " iris_id | \n", " timestamp | \n", "
\n", " \n", " \n", " \n", " count | \n", " 25000.000000 | \n", " 25000.000000 | \n", " 25000.000000 | \n", " 25000.000000 | \n", " 25000.000000 | \n", " 25000.000000 | \n", " 25000.000000 | \n", " 2.500000e+04 | \n", " 2.500000e+04 | \n", "
\n", " \n", " mean | \n", " 1993.862160 | \n", " 6.408760 | \n", " 15.384160 | \n", " 4.270952 | \n", " -38.939428 | \n", " 130.757907 | \n", " 94.014232 | \n", " 2.285625e+06 | \n", " 7.684759e+08 | \n", "
\n", " \n", " std | \n", " 12.733297 | \n", " 3.512482 | \n", " 8.814035 | \n", " 0.356037 | \n", " 3.278140 | \n", " 117.371409 | \n", " 94.284137 | \n", " 2.562292e+06 | \n", " 4.021305e+08 | \n", "
\n", " \n", " min | \n", " 1970.000000 | \n", " 1.000000 | \n", " 1.000000 | \n", " 3.900000 | \n", " -47.952400 | \n", " -179.999000 | \n", " 0.000000 | \n", " 1.034600e+04 | \n", " 2.629760e+05 | \n", "
\n", " \n", " 25% | \n", " 1984.000000 | \n", " 3.000000 | \n", " 8.000000 | \n", " 4.000000 | \n", " -40.537000 | \n", " 169.977150 | \n", " 12.000000 | \n", " 4.046555e+05 | \n", " 4.698424e+08 | \n", "
\n", " \n", " 50% | \n", " 1995.000000 | \n", " 7.000000 | \n", " 15.000000 | \n", " 4.200000 | \n", " -38.063050 | \n", " 175.867700 | \n", " 42.000000 | \n", " 1.608522e+06 | \n", " 7.920289e+08 | \n", "
\n", " \n", " 75% | \n", " 2003.000000 | \n", " 9.000000 | \n", " 23.000000 | \n", " 4.400000 | \n", " -36.864300 | \n", " 177.507000 | \n", " 170.100000 | \n", " 3.059155e+06 | \n", " 1.061565e+09 | \n", "
\n", " \n", " max | \n", " 2020.000000 | \n", " 12.000000 | \n", " 31.000000 | \n", " 7.800000 | \n", " -33.608600 | \n", " 180.000000 | \n", " 665.100000 | \n", " 1.124420e+07 | \n", " 1.590893e+09 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" year month day mag lat \\\n", "count 25000.000000 25000.000000 25000.000000 25000.000000 25000.000000 \n", "mean 1993.862160 6.408760 15.384160 4.270952 -38.939428 \n", "std 12.733297 3.512482 8.814035 0.356037 3.278140 \n", "min 1970.000000 1.000000 1.000000 3.900000 -47.952400 \n", "25% 1984.000000 3.000000 8.000000 4.000000 -40.537000 \n", "50% 1995.000000 7.000000 15.000000 4.200000 -38.063050 \n", "75% 2003.000000 9.000000 23.000000 4.400000 -36.864300 \n", "max 2020.000000 12.000000 31.000000 7.800000 -33.608600 \n", "\n", " lon depth_km iris_id timestamp \n", "count 25000.000000 25000.000000 2.500000e+04 2.500000e+04 \n", "mean 130.757907 94.014232 2.285625e+06 7.684759e+08 \n", "std 117.371409 94.284137 2.562292e+06 4.021305e+08 \n", "min -179.999000 0.000000 1.034600e+04 2.629760e+05 \n", "25% 169.977150 12.000000 4.046555e+05 4.698424e+08 \n", "50% 175.867700 42.000000 1.608522e+06 7.920289e+08 \n", "75% 177.507000 170.100000 3.059155e+06 1.061565e+09 \n", "max 180.000000 665.100000 1.124420e+07 1.590893e+09 "]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["nz_eqs.describe()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pandas also has separate built-in functions that can calculate these statistics, e.g. to get mean magnitude we can simply call:"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["4.27095199999966"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["nz_eqs[\"mag\"].mean()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["For full tutorial see [Descriptive Statistics](https://pandas.pydata.org/docs/getting_started/basics.html#descriptive-statistics) chapter in Pandas documentation."]}, {"cell_type": "markdown", "metadata": {}, "source": ["# References\n", "The notebook was compiled based on:\n", "* [Pandas official Getting Started tutorials](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)\n", "* [Kaggle tutorial](https://www.kaggle.com/learn/pandas)"]}], "metadata": {"celltoolbar": "Tags", "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8"}}, "nbformat": 4, "nbformat_minor": 2}