File handling
Contents
File handling#
Programming for Geoscientists Data Science and Machine Learning for Geoscientists
Suppose we have a text file like below, from which we would like to extract temperature and density data:
# Density of air at different temperatures, at 1 atm pressure
# Column 1: temperature in Celsius degrees
# Column 2: density in kg/m^3
0.0 999.8425
4.0 999.9750
15.0 999.1026
20.0 998.2071
25.0 997.0479
37.0 993.3316
50.0 988.04
100.0 958.3665
# Source: Wikipedia (keyword Density)
Python#
It is usually best to read and write files using the with
statement which will improve our syntax and do a lot of things automatically for us (like closing the file we opened).
Our file contains 3 lines at the beginning and 1 line at the end which we do not want to extract. Additionally, the temperature and density values are separated by whitespaces. Here is an example of how we could extract the data into two arrays:
with open('density_water.dat', 'r') as file: # r for read
# skip extra lines by taking a slice of the file
lines = file.readlines()[3:-1]
# initialise 2 empty arrays to store our data
temp, density = np.zeros(len(lines)), np.zeros(len(lines))
for i in range(len(lines)):
values = lines[i].split()
temp[i] = float(values[0])
density[i] = float(values[1])
print(temp)
print(density)
[ 0. 4. 15. 20. 25. 37. 50. 100.]
[999.8425 999.975 999.1026 998.2071 997.0479 993.3316 988.04 958.3665]
The lines variable in the above example is a list which contains our lines with numbers, but they are stored as a string - they are not recognised as numbers at this point. Note that the temperature and density arrays are already initialised as arrays of specific length, rather than empty lists which we will then append numbers to. It is good practice to initalise arrays like this when we know exactly how many elements our final array will have. The for loop is used to cycle through all elements of our data list. Each i-th element is a string, which we split into a new list with the same number of elements as the number of values that are separated by whitespaces in the line - in our case 2. Finally we can assign those values to elements in the temperature and density arrays, but first we converted them from a string to a float.
The below example shows how we could write the data into a text file with, for example here, comma separated values.
with open('output.txt', 'w') as file: # w for write
file.write('# Temperature and density data\n')
for i in range(len(temp)):
file.write(f'{temp[i]},{density[i]}\n')
NumPy#
NumPy’s genfromtxt
(generate from text) function provides full control over the file which we are trying to open. It initialises a numpy array from the data. Some of the parameters include:
dtype
- set a data type; if not set, determines the data type automatically for each columncomments
- skips every line starting with a string that is set here;#
by defaultskip_header
- number of lines to skip at the beginning of the fileskip_footer
- number of lines to skip at the end of the filedelimiter
- the string used to separate values; any whitespaces by default
For our file, we do not have to change anything since comments are marked by a # at the beginning of the line and the values are separated by whitespaces.
import numpy as np
data = np.genfromtxt('density_water.dat')
print(data)
[[ 0. 999.8425]
[ 4. 999.975 ]
[ 15. 999.1026]
[ 20. 998.2071]
[ 25. 997.0479]
[ 37. 993.3316]
[ 50. 988.04 ]
[100. 958.3665]]
We can save our array to a file in multiple ways - examples below. If we plan on using the array in another python code, perhaps it is best to save it as a .npy
file. We can later easily load the .npy
file and reconstruct the original array. The reader is encouraged to read about pickling in Python, which allows any object in Python to be saved in this way, not only numpy arrays.
np.savetxt('data.txt', data) # save data array to a text file
np.save('data.npy', data) # save data array to a .npy file
A = np.load('data.npy')
print(A)
[[ 0. 999.8425]
[ 4. 999.975 ]
[ 15. 999.1026]
[ 20. 998.2071]
[ 25. 997.0479]
[ 37. 993.3316]
[ 50. 988.04 ]
[100. 958.3665]]
Pandas#
Despite its name, pandas’ read_csv
can read many different types of files, including our .dat
file. Pandas DataFrame
is the primary data structure in pandas, so this function will generate one such DataFrame
. It is a very powerful object with many capabilities, our own tutorial can be found in Introduction to Pandas.
As in the NumPy example, we specify that our comment lines begin with a #
and that our delimiter is whitespace(s). Furthermore, we can give names (header) to our columns in the DataFrame
, where we assign our list col_names
using the names parameter. Pandas’ read_csv()
by default automatically sets this from the header line in our file, which we do not have in our file so we set header=None
.
from pandas import read_csv
col_names = ['temperature', 'density']
df = read_csv('density_water.dat', comment='#', delim_whitespace=True, names=col_names, header=None)
print(df)
df.to_csv('data.csv', index=False)
temperature density
0 0.0 999.8425
1 4.0 999.9750
2 15.0 999.1026
3 20.0 998.2071
4 25.0 997.0479
5 37.0 993.3316
6 50.0 988.0400
7 100.0 958.3665
Exercises#
Temperature You are given a table with average temperature in different countries in the World. Open the file and find the countries ending with stan (like Turkmenistan) and save their average temperatures in a separate file). Columns are
\t
separated.
The file can be found at 'Data\\TempData.txt'
HINT: Structure the data as a list of tuples.
Answer
def isStan(country):
if len(country) < 4:
return False
else:
return country[-4::] == "stan" # True or False
with open('Data\\TempData.txt', 'r') as file:
lines = file.readlines()
nameTemp =[]
res = 0
stan_countries = 0
for i in range(len(lines)):
values = lines[i].split('\t')
nameTemp.append((values[0].strip(), float(values[1])))
if(isStan(nameTemp[i][0])): # if True
res += nameTemp[i][1]
stan_countries += 1
print(res/stan_countries)
Hottest countires: Using the same file as in the exercise before, via NumPy, find the countries with the average temperature above 27.
HINT: Set dtype=['U20', float]
, remember about delimiters.
Answer
import numpy as np
data = np.genfromtxt('Data\\TempData.txt', delimiter='\t', dtype=['U20', float])
for i in data:
if i[1] > 27:
print(i[0])
Countries per continent Construct a list showing how many countries and independent regions there are on a particular continent. Consider only the countries with a
Three_Letter_Country_Code
value. Use the Pandas library.
The file can be found at Data\\CountryContinent.csv
.
HINT: Use the for index, row in df.iterrows():
structure of the for
loop. The final result should look something like this: [['Asia', 58], ['Europe', 57], ['Antarctica', 5], ['Africa', 58], ['Oceania', 27], ['North America', 43], ['South America', 14]]
Answer
from pandas import read_csv
df = read_csv('Data\\CountryContinent.csv')
continents = df['Continent_Name'].unique() # list of continent names from the file
res = [[continent, 0] for continent in continents] # initial list, not counted yet
for index, row in df.iterrows():
if row["Three_Letter_Country_Code"] != "nan":
for j in range(len(res)):
if row["Continent_Name"] == res[j][0]: # find correct continent
res[j][1] += 1 # increase country count by 1
print(res)