Introduction#

Programming for Geoscientists Data Science and Machine Learning for Geoscientists

Pandas is an open source library for Python that can be used for data manipulation and analysis. If your data can be put into a spreadsheet, Pandas is exactly what you need!

Pandas is a very powerful tool with highly optimised performance. The full documentation can be found here.

To start working with pandas you can simply import it. A standard alias used for pandas is “pd”:

import pandas as pd

Pandas objects#

DataFrame is a 2D data structure with columns and rows, much like a table or Excel spreadsheet. To manually create a DataFrame, we can create a dictionary of lists. The keys are used as a table header and values in each list as rows:

df = pd.DataFrame({
     "Rock_type": ["granite",
                   "andesite",
                   "limestone"],
     "Density": [2.6, 2.8, 2.3],
     "Main_mineral": ["quartz",
                      "feldspar",
                      "calcite"]})

# Use df.head() to display
# top rows
df.head()
Rock_type Density Main_mineral
0 granite 2.6 quartz
1 andesite 2.8 feldspar
2 limestone 2.3 calcite

To extract data from specific column, we can call the column name in two ways:

# First method
# Similar to calling dictionary keys
# Works for all column names
print(df["Rock_type"])

# Second method
# This will only work if the column name
# has no spaces and is not named like
# any pandas attribute, e.g. T will mean
# transpose and it won't extract a column
print(df.Rock_type)
0      granite
1     andesite
2    limestone
Name: Rock_type, dtype: object
0      granite
1     andesite
2    limestone
Name: Rock_type, dtype: object
pd.Series(["granite", "andesite", "limestone"])
0      granite
1     andesite
2    limestone
dtype: object

Reading files#

Most of the time we will want to load data in different file formats, rather than manually creating a dataframe. Pandas have a very easy syntax for reading files:

pd.read_*

where * can be csv, excel, html, sql and so on. For .txt file extentions we can use

pd.read_csv(file_name, delimiter=" ") or pd.read_fwf(file_name)

In this example we will look at New Zealand earthquake data in .csv format from IRIS. With .head() we can specify how many rows to print, in this case, we want to display first 4 rows (that includes header):

nz_eqs = pd.read_csv("../../geosciences/data/nz_largest_eq_since_1970.csv")
nz_eqs.head(4)
year month day utc_time mag lat lon depth_km region iris_id timestamp
0 2009 7 15 09:22:31 7.8 -45.8339 166.6363 20.9 OFF W. COAST OF S. ISLAND, N.Z. 2871698 1247649751
1 2016 11 13 11:02:59 7.8 -42.7245 173.0647 22.0 SOUTH ISLAND, NEW ZEALAND 5197722 1479034979
2 2003 8 21 12:12:47 7.2 -45.0875 167.0892 6.8 SOUTH ISLAND, NEW ZEALAND 1628007 1061467967
3 2001 8 21 06:52:06 7.1 -36.8010 -179.7230 33.5 EAST OF NORTH ISLAND, N.Z. 1169374 998376726

We can check DataFrame shape by using:

nz_eqs.shape
(25000, 11)

We have 25,000 rows and 11 columns, that’s a lot of data!

Writing files#

Let’s say we want to export this data as an excel spreadsheet but we only want to export magnitude, latitude, longitude and depth columns:

nz_eqs.to_excel("../../geosciences/data/nz_eqs.xlsx",
                sheet_name="Earthquakes",
                columns=["mag", "lat", "lon",
                        "depth_km"])

ExcelWriter object allows us to export more than one sheet into the same file:

with pd.ExcelWriter("../../geosciences/data/nz_eqs.xlsx") as writer:
    nz_eqs.to_excel(writer, sheet_name="Earthquakes",
                   columns=["mag", "lat", "lon",
                        "depth_km"])
    nz_eqs.to_excel(writer, sheet_name="Extra info",
                   columns=["region", "iris_id",
                            "timestamp"])

References#

The notebook was compiled based on these tutorials: