Introduction#

Programming for Geoscientists Data Science and Machine Learning for Geoscientists

Pandas is an open source library for Python that can be used for data manipulation and analysis. If your data can be put into a spreadsheet, Pandas is exactly what you need!

Pandas is a very powerful tool with highly optimised performance. The full documentation can be found here.

To start working with pandas you can simply import it. A standard alias used for pandas is “pd”:

import pandas as pd

Pandas objects#

DataFrame is a 2D data structure with columns and rows, much like a table or Excel spreadsheet. To manually create a DataFrame, we can create a dictionary of lists. The keys are used as a table header and values in each list as rows:

df = pd.DataFrame({
     "Rock_type": ["granite",
                   "andesite",
                   "limestone"],
     "Density": [2.6, 2.8, 2.3],
     "Main_mineral": ["quartz",
                      "feldspar",
                      "calcite"]})

# Use df.head() to display
# top rows
df.head()

	Rock_type	Density	Main_mineral
0	granite	2.6	quartz
1	andesite	2.8	feldspar
2	limestone	2.3	calcite

To extract data from specific column, we can call the column name in two ways:

# First method
# Similar to calling dictionary keys
# Works for all column names
print(df["Rock_type"])

# Second method
# This will only work if the column name
# has no spaces and is not named like
# any pandas attribute, e.g. T will mean
# transpose and it won't extract a column
print(df.Rock_type)

    granite
   andesite
  limestone
Name: Rock_type, dtype: object
    granite
   andesite
  limestone
Name: Rock_type, dtype: object

pd.Series(["granite", "andesite", "limestone"])

    granite
   andesite
  limestone
dtype: object

Reading files#

Most of the time we will want to load data in different file formats, rather than manually creating a dataframe. Pandas have a very easy syntax for reading files:

pd.read_*

where * can be csv, excel, html, sql and so on. For .txt file extentions we can use

pd.read_csv(file_name, delimiter=" ") or pd.read_fwf(file_name)

In this example we will look at New Zealand earthquake data in .csv format from IRIS. With .head() we can specify how many rows to print, in this case, we want to display first 4 rows (that includes header):

nz_eqs = pd.read_csv("../../geosciences/data/nz_largest_eq_since_1970.csv")
nz_eqs.head(4)

	year	month	day	utc_time	mag	lat	lon	depth_km	region	iris_id	timestamp
0	2009	7	15	09:22:31	7.8	-45.8339	166.6363	20.9	OFF W. COAST OF S. ISLAND, N.Z.	2871698	1247649751
1	2016	11	13	11:02:59	7.8	-42.7245	173.0647	22.0	SOUTH ISLAND, NEW ZEALAND	5197722	1479034979
2	2003	8	21	12:12:47	7.2	-45.0875	167.0892	6.8	SOUTH ISLAND, NEW ZEALAND	1628007	1061467967
3	2001	8	21	06:52:06	7.1	-36.8010	-179.7230	33.5	EAST OF NORTH ISLAND, N.Z.	1169374	998376726

We can check DataFrame shape by using:

nz_eqs.shape

(25000, 11)

We have 25,000 rows and 11 columns, that’s a lot of data!

Writing files#

Let’s say we want to export this data as an excel spreadsheet but we only want to export magnitude, latitude, longitude and depth columns:

nz_eqs.to_excel("../../geosciences/data/nz_eqs.xlsx",
                sheet_name="Earthquakes",
                columns=["mag", "lat", "lon",
                        "depth_km"])

ExcelWriter object allows us to export more than one sheet into the same file:

with pd.ExcelWriter("../../geosciences/data/nz_eqs.xlsx") as writer:
    nz_eqs.to_excel(writer, sheet_name="Earthquakes",
                   columns=["mag", "lat", "lon",
                        "depth_km"])
    nz_eqs.to_excel(writer, sheet_name="Extra info",
                   columns=["region", "iris_id",
                            "timestamp"])

References#

The notebook was compiled based on these tutorials:

ESE Jupyter Material

Introduction

Contents

Introduction#

Pandas objects#

Reading files#

Writing files#

References#