# How To Use The COSIMA Cookbook¶

This notebook is designed to help new users get to grips with the COSIMA Cookbook.

It assumes that: * You have access to the COSIMA cookbook. * We recommend using the latest version of the cookbook available through the conda/analysis3-unstable module on NCI. * You can fire up a Jupyter notebook!

Before starting, load in some standard libraries that you are likely to need:

import matplotlib.pyplot as plt
import xarray as xr
import numpy as np

import IPython.display
import cmocean as cm
import cartopy.crs as ccrs


In addition, you always need to load the cosima_cookbook module. This provides a bunch of functions that you will use:

import cosima_cookbook as cc


## 1. The Cookbook Philosophy¶

The COSIMA Cookbook is a framework for analysing ocean-sea ice model output. It is designed to:

• Provide examples of commonly used diagnostics;

• Write efficient, well-documented, openly accessible code;

• Encourage community input to the code;

• Ensure diagnostic results are reproducible;

• Process diagnostics directly from the model output, minimising creation of intermediate files;

• Find methods to deal with the memory limitations of analysing high-resolution model output.

### 1.1 A database of experiments¶

The COSIMA Cookbook relies on a database of experiments in order to load model output. This database effectively holds metadata for each experiment, as well as variable names, data ranges and so on.

There are three different ways for you to access the database:

1. You can use the default database, which is periodically refreshed automatically. This database sits in /g/data3/hh5/tmp/cosima/database/access-om2.db and should be readable for all users. It includes all experiments stored in the COSIMA data directory under project hh5 on NCI. The examples in this tutorial use this database.

2. You use some of the databases sitting in /g/data/ik11/databases/ which may or may not be up to date’ or

3. Otherwise, you can make your own database, which is stored in your own path and includes only the experiments you are interested in. Please refer to the Make_Your_Own_Database tutorial for instructions on how to create this database.

To access the default database, you need to start a database session each time you fire up a notebook:

session = cc.database.create_session()


### 1.2 Inbuilt Database Functions¶

We have constructed a few functions to help you operate the cookbook and to access the datasets. These functions all sit in the cosima_cookbook directory. The following functions query the data available in the database, without loading the data itself.

They return pandas dataframes, which by default are truncated to show only the first and last 5 rows. To see more results, we need to set an option in pandas itself:

import pandas as pd
pd.set_option("display.max_rows", n)


Where n is the maximum number of rows to display without truncation; if there are more rows, the result will again be truncated. Pass None to never truncate (but some dataframes can be very big!)

get_experiments lists all of the experiments that are catalogued in the database.

cc.querying.get_experiments(session)

experiment ncfiles
0 01deg_jra55v13_ryf8485_spinup6_update_ocn_f 40
1 01deg_jra55v13_ryf8485_spinup6_dt_720 21
2 01deg_jra55v13_ryf8485_spinup6_newexe_highfreq 1222
3 01deg_jra55v13_ryf8485_spinup6_newtopog 225
4 01deg_jra55v13_ryf8485_spinup6_newtopog_scalewind 96
... ... ...
152 monthly 13
153 10 12
154 10_KDS75 12
155 01 12
156 025_KDS50 12

157 rows × 2 columns

Internally, an experiment is a set of netCDF4 files as shown in the above table.

get_ncfiles provides a list of all the netcdf filenames saved for a given experiment along with the time stamp for when that file was added to the cookbook database. Note that each of these filenames are present in some or all of the output directories – but the cookbook philosophy is that you don’t need to know about the directories in which these files are stored. To see the relevant files:

cc.querying.get_ncfiles(session, '025deg_jra55v13_ryf8485_gmredi6')

ncfile index_time
0 output000/ice/OUTPUT/iceh.1900-01.nc 2019-06-21 13:22:35.285505
1 output000/ice/OUTPUT/iceh.1900-02.nc 2019-06-21 13:22:33.693859
2 output000/ice/OUTPUT/iceh.1900-03.nc 2019-06-21 13:22:33.462804
3 output000/ice/OUTPUT/iceh.1900-04.nc 2019-06-21 13:22:38.755781
4 output000/ice/OUTPUT/iceh.1900-05.nc 2019-06-21 13:22:34.233748
... ... ...
4975 restart147/ice/mice.nc 2019-06-21 13:16:04.354172
4976 restart147/ice/monthly_sstsss.nc 2019-06-21 13:16:04.278494
4977 restart147/ice/o2i.nc 2019-06-21 13:16:04.205090
4978 restart147/ice/sicemass.nc 2019-06-21 13:16:04.057519
4979 restart147/ice/u_star.nc 2019-06-21 13:16:04.240386

4980 rows × 2 columns

More usefully, get_variables provides a list of all the variables available in a specific experiment.

cc.querying.get_variables(session, experiment='025deg_jra55v13_ryf8485_gmredi6', frequency='1 monthly')

name long_name frequency ncfile # ncfiles time_start time_end
0 ANGLE angle grid makes with latitude line on U grid 1 monthly output148/ice/OUTPUT/iceh.2197-12.nc 3600 1900-01-01 00:00:00 2198-01-01 00:00:00
1 ANGLET angle grid makes with latitude line on T grid 1 monthly output148/ice/OUTPUT/iceh.2197-12.nc 3600 1900-01-01 00:00:00 2198-01-01 00:00:00
2 HTE T cell width on East side 1 monthly output148/ice/OUTPUT/iceh.2197-12.nc 3600 1900-01-01 00:00:00 2198-01-01 00:00:00
3 HTN T cell width on North side 1 monthly output148/ice/OUTPUT/iceh.2197-12.nc 3600 1900-01-01 00:00:00 2198-01-01 00:00:00
4 NCAT category maximum thickness 1 monthly output148/ice/OUTPUT/iceh.2197-12.nc 3600 1900-01-01 00:00:00 2198-01-01 00:00:00
... ... ... ... ... ... ... ...
174 vvel_m ice velocity (y) 1 monthly output148/ice/OUTPUT/iceh.2197-12.nc 3600 1900-01-01 00:00:00 2198-01-01 00:00:00
175 xt_ocean tcell longitude 1 monthly output148/ocean/ocean_snapshot.nc 300 1900-01-01 00:00:00 2198-01-01 00:00:00
176 xu_ocean ucell longitude 1 monthly output148/ocean/ocean_snapshot.nc 300 1900-01-01 00:00:00 2198-01-01 00:00:00
177 yt_ocean tcell latitude 1 monthly output148/ocean/ocean_snapshot.nc 300 1900-01-01 00:00:00 2198-01-01 00:00:00
178 yu_ocean ucell latitude 1 monthly output148/ocean/ocean_snapshot.nc 300 1900-01-01 00:00:00 2198-01-01 00:00:00

179 rows × 7 columns

Since this is a pretty big list, we can search the dataframe for variable names which contain a specific string.

vars_025deg = cc.querying.get_variables(session, experiment='025deg_jra55v13_ryf8485_gmredi6')
vars_025deg[vars_025deg['name'].str.lower().str.contains('temp')]

name long_name frequency ncfile # ncfiles time_start time_end
1 TEMP None None restart143/ice/monthly_sstsss.nc 48 None None
204 temp Conservative temperature 1 monthly output089/ocean/ocean_month.nc 40 2002-01-01 00:00:00 2080-01-01 00:00:00
205 temp_advection cp*rho*dzt*advection tendency 1 monthly output051_budget/ocean/ocean_month.nc 1 2002-01-01 00:00:00 2004-01-01 00:00:00
206 temp_eta_smooth surface smoother for temp 1 monthly output051_budget/ocean/ocean_month.nc 1 2002-01-01 00:00:00 2004-01-01 00:00:00
207 temp_global_ave Global mean temp in liquid seawater 1 monthly output148/ocean/ocean_scalar.nc 150 1900-01-01 00:00:00 2198-01-01 00:00:00
208 temp_nonlocal_KPP cp*rho*dzt*nonlocal tendency from KPP 1 monthly output051_budget/ocean/ocean_month.nc 1 2002-01-01 00:00:00 2004-01-01 00:00:00
209 temp_rivermix cp*rivermix*rho_dzt*temp 1 monthly output051_budget/ocean/ocean_month.nc 1 2002-01-01 00:00:00 2004-01-01 00:00:00
210 temp_submeso rho*dzt*cp*submesoscale tendency (heating) 1 monthly output051_budget/ocean/ocean_month.nc 1 2002-01-01 00:00:00 2004-01-01 00:00:00
211 temp_surface_ave Global mass weighted mean surface temp in liqu... 1 monthly output148/ocean/ocean_scalar.nc 150 1900-01-01 00:00:00 2198-01-01 00:00:00
212 temp_tendency time tendency for tracer Conservative temperature 1 monthly output051_budget/ocean/ocean_month.nc 1 2002-01-01 00:00:00 2004-01-01 00:00:00
213 temp_tendency_expl explicit in time tendency for tracer Conservat... 1 monthly output051_budget/ocean/ocean_month.nc 1 2002-01-01 00:00:00 2004-01-01 00:00:00
214 temp_vdiffuse_diff_cbt vert diffusion of heat due to diff_cbt 1 monthly output051_budget/ocean/ocean_month.nc 1 2002-01-01 00:00:00 2004-01-01 00:00:00
215 temp_vdiffuse_diff_cbt_conv vert diffusion of heat due to diff_cbt_conv 1 monthly output051_budget/ocean/ocean_month.nc 1 2002-01-01 00:00:00 2004-01-01 00:00:00
216 temp_vdiffuse_impl implicit vert diffusion of heat 1 monthly output051_budget/ocean/ocean_month.nc 1 2002-01-01 00:00:00 2004-01-01 00:00:00
217 temp_vdiffuse_k33 vert diffusion of heat due to K33 from neutral... 1 monthly output051_budget/ocean/ocean_month.nc 1 2002-01-01 00:00:00 2004-01-01 00:00:00
218 temp_vdiffuse_sbc vert diffusion of heat due to surface flux 1 monthly output051_budget/ocean/ocean_month.nc 1 2002-01-01 00:00:00 2004-01-01 00:00:00
288 temp Conservative temperature 1 yearly output148/ocean/ocean.nc 150 1900-01-01 00:00:00 2198-01-01 00:00:00
289 temp_xflux_adv cp*rho*dzt*dyt*u*temp 1 yearly output148/ocean/ocean.nc 150 1900-01-01 00:00:00 2198-01-01 00:00:00
290 temp_yflux_adv cp*rho*dzt*dxt*v*temp 1 yearly output148/ocean/ocean.nc 150 1900-01-01 00:00:00 2198-01-01 00:00:00

Omitting the frequency would give variables at all temporal frequencies. To determine what frequencies are in a given experient, we can use get_frequencies. Leaving off the experiment gives all possible frequencies.

cc.querying.get_frequencies(session, experiment='025deg_jra55v13_ryf8485_gmredi6')

frequency
0 None
1 1 monthly
2 1 yearly
3 static

Python has many ways of reading in data from a netcdf file … so we thought we would add another way. This is achieved in the querying.getvar() function, which is the most commonly used function in the cookbook. This function queries the database to find a specific variable, and loads some or all of that file. We will now take a little while to get to know this function. In it’s simplest form, you need just three arguments: expt, variable and database.

You can see all the available options using the inbuilt help function, which brings up the function documentation.

help(cc.querying.getvar)

Help on function getvar in module cosima_cookbook.querying:

getvar(expt, variable, session, ncfile=None, start_time=None, end_time=None, n=None, **kwargs)
For a given experiment, return an xarray DataArray containing the
specified variable.

expt - text string indicating the name of the experiment
variable - text string indicating the name of the variable to load
session - a database session created by cc.database.create_session()
ncfile -  an optional text string indicating the pattern for filenames
to load. All filenames containing this string will match, so
be specific. '/' can be used to match the start of the
filename, and '%' is a wildcard character.
start_time - only load data after this date. specify as a text string,
e.g. '1900-01-01'
end_time - only load data before this date. specify as a text string,
e.g. '1900-01-01'
n - after all other queries, restrict the total number of files to the
first n. pass a negative value to restrict to the last n

Note that if start_time and/or end_time are used, the time range
of the resulting dataset may not be bounded exactly on those
values, depending on where the underlying files start/end. Use
dataset.sel() to exactly select times from the dataset.

Other kwargs are passed through to xarray.open_mfdataset, including:

chunks - Override any chunking by passing a chunks dictionary.
decode_times - Time decoding can be disabled by passing decode_times=False


You may like to note a few things about this function: 1. The data is returned as an xarray DataArray, which includes the coordinate and attribute information from the netcdf file (more on xarray later). 2. The variable time does not start at zero - and if you don’t like it you can introduce an offset to alter the time axis. 3. By default, we load the whole dataset, but we can load a subset of the times (see below). 4. Other customisable options include setting the variable chunking and incorporating a function to operate on the data.

expt = '025deg_jra55v13_ryf8485_gmredi6'
variable = 'temp_global_ave'
darray = cc.querying.getvar(expt, variable, session)
darray

xarray.DataArray
'temp_global_ave'
• time: 3576
• scalar_axis: 1


Array  Chunk

Bytes  14.30 kB   4 B
Shape  (3576, 1)   (1, 1)
Type  float32  numpy.ndarray

1
3576


• scalar_axis
(scalar_axis)
float64
0.0
long_name :
none
units :
none
cartesian_axis :
X
array([0.])
• time
(time)
object
1900-01-16 12:00:00 ... 2197-12-16 12:00:00
long_name :
time
cartesian_axis :
T
calendar_type :
NOLEAP
bounds :
time_bounds
array([cftime.DatetimeNoLeap(1900, 1, 16, 12, 0, 0, 0, 4, 16),
cftime.DatetimeNoLeap(1900, 2, 15, 0, 0, 0, 0, 6, 46),
cftime.DatetimeNoLeap(1900, 3, 16, 12, 0, 0, 0, 0, 75), ...,
cftime.DatetimeNoLeap(2197, 10, 16, 12, 0, 0, 0, 0, 289),
cftime.DatetimeNoLeap(2197, 11, 16, 0, 0, 0, 0, 3, 320),
cftime.DatetimeNoLeap(2197, 12, 16, 12, 0, 0, 0, 5, 350)], dtype=object)
• long_name :
Global mean temp in liquid seawater
units :
deg_C
valid_range :
[ -10. 1000.]
cell_methods :
time: mean
time_avg_info :
average_T1,average_T2,average_DT
standard_name :
sea_water_potential_temperature

You can see that this operation loads the globally averaged potential temperature from the model output. The time axis runs from 1900 to 2198. For some variables (particularly 3D variables that might use a lot of memory) you may prefer to restrict yourself to a smaller time window:

darray = cc.querying.getvar(expt,variable,session,
start_time='2000-01-01',
end_time='2050-12-31')
darray

xarray.DataArray
'temp_global_ave'
• time: 648
• scalar_axis: 1


Array  Chunk

Bytes  2.59 kB   4 B
Shape  (648, 1)   (1, 1)
Type  float32  numpy.ndarray

1
648


• scalar_axis
(scalar_axis)
float64
0.0
long_name :
none
units :
none
cartesian_axis :
X
array([0.])
• time
(time)
object
1998-01-16 12:00:00 ... 2051-12-16 12:00:00
long_name :
time
cartesian_axis :
T
calendar_type :
NOLEAP
bounds :
time_bounds
array([cftime.DatetimeNoLeap(1998, 1, 16, 12, 0, 0, 0, 4, 16),
cftime.DatetimeNoLeap(1998, 2, 15, 0, 0, 0, 0, 6, 46),
cftime.DatetimeNoLeap(1998, 3, 16, 12, 0, 0, 0, 0, 75), ...,
cftime.DatetimeNoLeap(2051, 10, 16, 12, 0, 0, 0, 1, 289),
cftime.DatetimeNoLeap(2051, 11, 16, 0, 0, 0, 0, 4, 320),
cftime.DatetimeNoLeap(2051, 12, 16, 12, 0, 0, 0, 6, 350)], dtype=object)
• long_name :
Global mean temp in liquid seawater
units :
deg_C
valid_range :
[ -10. 1000.]
cell_methods :
time: mean
time_avg_info :
average_T1,average_T2,average_DT
standard_name :
sea_water_potential_temperature

You will see that the time boundaries are not exact here. cc.querying.getvar loads all files that include any dates within the specified range. You can use .sel(time=...) to refine this selection if required (see below).

### 1.4 Exercises¶

OK, this is a tutorial, so now you have to do some work. Your tasks are to: * Find and load SSH from an experiment (an experiment … perhaps a 1° configuration would be best).

• Just load the last 10 files from an experiment (any variable you like).

• Load potential temperature from an experiment (again, 1° would be quickest). Can you chunk the data differently from the default?

## 2. How to manipulate and plot variables with xarray¶

We use the python package xarray (which is built on dask, pandas, matplotlib and numpy) for many of our diagnostics. xarray has a a lot of nice features, some of which we will try to demonstrate for you.

### 2.1 Plotting¶

xarray’s .plot() method does its best to figure out what you are trying to plot, and plotting it for you. Let’s start by loading a 1-dimensional variable and plotting.

expt = '025deg_jra55v13_ryf8485_gmredi6'
variable = 'temp_global_ave'
darray = cc.querying.getvar(expt, variable, session)
darray.plot();

darray

xarray.DataArray
'temp_global_ave'
• time: 3576
• scalar_axis: 1


Array  Chunk

Bytes  14.30 kB   4 B
Shape  (3576, 1)   (1, 1)
Type  float32  numpy.ndarray

1
3576


• scalar_axis
(scalar_axis)
float64
0.0
long_name :
none
units :
none
cartesian_axis :
X
array([0.])
• time
(time)
object
1900-01-16 12:00:00 ... 2197-12-16 12:00:00
long_name :
time
cartesian_axis :
T
calendar_type :
NOLEAP
bounds :
time_bounds
array([cftime.DatetimeNoLeap(1900, 1, 16, 12, 0, 0, 0, 4, 16),
cftime.DatetimeNoLeap(1900, 2, 15, 0, 0, 0, 0, 6, 46),
cftime.DatetimeNoLeap(1900, 3, 16, 12, 0, 0, 0, 0, 75), ...,
cftime.DatetimeNoLeap(2197, 10, 16, 12, 0, 0, 0, 0, 289),
cftime.DatetimeNoLeap(2197, 11, 16, 0, 0, 0, 0, 3, 320),
cftime.DatetimeNoLeap(2197, 12, 16, 12, 0, 0, 0, 5, 350)], dtype=object)
• long_name :
Global mean temp in liquid seawater
units :
deg_C
valid_range :
[ -10. 1000.]
cell_methods :
time: mean
time_avg_info :
average_T1,average_T2,average_DT
standard_name :
sea_water_potential_temperature

You should see that xarray has figured out that this data is a timeseries, that the x-axis is representing time and that the y-axis is temp_global_ave. You can always modify aspects of your plot if you are unhappy with the default xarray behaviour:

darray.plot()
plt.xlabel('Year')
plt.ylabel('Temperature (°C)')
plt.title('Globally Averaged Temperature')

Text(0.5, 1.0, 'Globally Averaged Temperature')


Because xarray knows about dimensions, it has plotting routines which can figure out what it should plot. By way of example, let’s load a single time slice of surface_temp and see how .plot() handles it:

expt = '025deg_jra55v13_ryf8485_gmredi6'
variable = 'surface_temp'
darray = cc.querying.getvar(expt, variable, session, n=-1)
darray.mean('time').plot();


Again, you can customise this plot as you see fit:

darray = darray - 273.15 # convert from Kelvin to Celsius
darray.mean('time').plot.contourf(levels=np.arange(-2, 32, 2), cmap=cm.cm.thermal)
plt.ylabel('latitude')
plt.xlabel('longitude');

/g/data3/hh5/public/apps/miniconda3/envs/analysis3-20.01/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)


### 2.2 Slicing and dicing¶

There are two different ways of subselecting from a DataArray: isel and sel. The first of these is probably what you are used to – you specify the value of the index of the array. In the second case you specify the value of the coordinate you want to select. These two methods are demonstrated in the following example:

darray = cc.querying.getvar('1deg_jra55v13_iaf_spinup1_B1','pot_rho_2',session)
density = darray.isel(time=200).sel(st_ocean=1000, method='nearest')
density.plot();


In the above example, a 300-year dataset is loaded. We then use isel to select the 201st year (timeindex of 200) and use sel to select a z level that is about 1000m deep. The sel method is very flexible, allowing us to use similar code in differing model resolutions or grids. In addition, both methods allow you to slice a range of values:

darray = cc.querying.getvar('1deg_jra55v13_iaf_spinup1_B1', 'v', session, ncfile="ocean.nc")
v = darray.isel(time=100).sel(st_ocean=50, method='nearest').sel(yu_ocean=slice(-50, -20)).sel(xu_ocean=slice(-230, -180)).load()
v.plot();


Here we have taken meridional velocity, and sliced out a small region of interest for our plot. Note the load() method, which tells xarray to do the calculation (otherwise xarray aims to defer calculations until the variable is needed).

### 2.3 Averaging along dimensions¶

We often perform operations such as averaging on dataarrays. Again, knowledge of the coordinates can be a big help here, as you can instruct the mean() method to operate along given coordinates. The case below takes a temporal and zonal average of potential density.

#### IMPORTANT¶

To be precise, it is actually a mean in the $$i$$-grid direction, which is only zonal outside the tripolar region in the Arctic, i.e., south of 65N in the ACCESS-OM2 models. To compute the zonal mean correctly one needs to be a bit more carefull; see the DocumentedExamples/True_Zonal_Mean.ipynb <https://cosima-recipes.readthedocs.io/en/latest/documented_examples/True_Zonal_Mean.html#gallery-documented-examples-true-zonal-mean-ipynb>__.

darray = cc.querying.getvar('1deg_jra55v13_iaf_spinup1_B1', 'pot_rho_2', session, n=-10)
darray.mean('time').mean('xt_ocean').plot(cmap=cm.cm.haline)
plt.gca().invert_yaxis();

/g/data3/hh5/public/apps/miniconda3/envs/analysis3-20.01/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)


### 2.4 Resampling¶

xarray uses datetime conventions to allow for operations such as resampling in time. This resampling is simple and powerful. Here is an example of re-plotting the figure from 2.1 with annual averaging:

darray = cc.querying.getvar('025deg_jra55v13_iaf_gmredi6', 'temp_global_ave', session)
meandata = darray.resample(time='A').mean(dim='time')
meandata.plot();


### 2.5 Exercises¶

• Pick an experiment and plot a map of the temperature of the upper 100m of the ocean for one year.

• Now, take the same experiment and construct a timeseries of spatially averaged (regional or global) upper 700m temperature, resampled every 3 years.

### 3.1 Making a map with cartopy¶

darray = cc.querying.getvar('025deg_jra55v13_iaf_gmredi6', 'temp', session, n=-1)
temp = darray.mean('time').sel(st_ocean=50, method='nearest') - 273.15
plt.figure(figsize=(8, 4))
ax = plt.axes(projection=ccrs.Robinson())
temp.plot.pcolormesh(ax=ax, transform=ccrs.PlateCarree(),x='xt_ocean', y='yt_ocean', cmap=cm.cm.thermal, vmin=-2, vmax=30);


### 3.2 Distributed computing¶

Many of our scripts use multiple cores for their calculations, usually via the following . It sets up a local cluster on your node for distributed computation.

from dask.distributed import Client

client = Client()
client


### Cluster

• Workers: 8
• Cores: 48
• Memory: 202.49 GB

The dashboard link should allow you to access information on how your work is distributed between the cores on your local cluster.