Exploring the COSIMA Cookbook

Statement of problem

COSIMA is producing a lot of data and we need to be able to find it to analyse it. The current location for COSIMA outputs is in the outputs directory in the ik11 project. Contained within are subdirectories for each model resolution and within each of these directories are subdirectories for each model configuration

!ls /g/data/ik11/outputs
README      access-om2  access-om2-01  access-om2-025
!ls /g/data/ik11/outputs/access-om2-01
01deg_jra55v13_ryf9091                  01deg_jra55v13_ryf9091_qian_wp
01deg_jra55v13_ryf9091_5Kv      01deg_jra55v13_ryf9091_tides_control
01deg_jra55v13_ryf9091_OFAM3visc    01deg_jra55v13_ryf9091_tides_fixed
01deg_jra55v13_ryf9091_k_smag_iso3  01deg_jra55v140_iaf

All the data is contained in netCDF files, of which there are many!

!find /g/data/ik11/outputs/ -iname '*.nc' | wc -l
86068

GOAL: access data by specifying an experiment and a variable

COSIMA Cookbook solution

In order to achieve the above goal the COSIMA Cookbook provides tools to crawl directories looking for netCDF data files, read metadata from the files about the data they contain, and then save this data to an SQL database.

The Cookbook also provides an API to query the database and retrieve data by experiment and variable name.

import cosima_cookbook as cc
session = cc.database.create_session()
cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='u', session=session, n=1)
<xarray.DataArray 'u' (Time: 1, zaxis_1: 75, yaxis_1: 2700, xaxis_1: 3600)>
dask.array<open_dataset-8278b6c6951bdac96657a7d49ccc0ba1u, shape=(1, 75, 2700, 3600), dtype=float64, chunksize=(1, 19, 135, 180), chunktype=numpy.ndarray>
Coordinates:
  * xaxis_1  (xaxis_1) float64 1.0 2.0 3.0 4.0 ... 3.598e+03 3.599e+03 3.6e+03
  * yaxis_1  (yaxis_1) float64 1.0 2.0 3.0 4.0 ... 2.698e+03 2.699e+03 2.7e+03
  * zaxis_1  (zaxis_1) float64 1.0 2.0 3.0 4.0 5.0 ... 71.0 72.0 73.0 74.0 75.0
  * Time     (Time) float64 1.0
Attributes:
    long_name:  u
    units:      none
    checksum:   B17C5F68783DB907

The question then becomes, how do I find out what experiment to use, and what variables are available? Currently the API provides get_experiments to give a list of experiments and get_variables which returns a list of variables for a given experiment

cc.querying.get_experiments(session, all=True)
experiment contact email created description notes root_dir ncfiles
0 01deg_jra55v13_ryf9091_OFAM3visc Andrew Kiss andrew.kiss@anu.edu.au 2020-03-29 00:00:00 0.1 degree ACCESS-OM2 global model configurati... None /g/data/ik11/outputs/access-om2-01/01deg_jra55... 50
1 01deg_jra55v13_ryf9091_tides_fixed Adele Morrison adele.morrison@anu.edu.au 2020-06-11 00:00:00 0.1 degree ACCESS-OM2 global model configurati... Mostly 1 month run lengths, but a couple of mo... /g/data/ik11/outputs/access-om2-01/01deg_jra55... 1851
2 01deg_jra55v13_ryf9091_k_smag_iso3 Andrew Kiss andrew.kiss@anu.edu.au 2020-03-29 00:00:00 0.1 degree ACCESS-OM2 global model configurati... None /g/data/ik11/outputs/access-om2-01/01deg_jra55... 128
3 01deg_jra55v13_ryf9091_5Kv Ryan Holmes ryan.holmes@unsw.edu.au 2020-03-01 00:00:00 As for 01deg_jra55v13_ryf9091 except with a ba... None /g/data/ik11/outputs/access-om2-01/01deg_jra55... 19
4 1deg_jra55v131_ryf_nonuniform_albedo Andrew Kiss andrew.kiss@anu.edu.au 2020-03-24 00:00:00 1 degree ACCESS-OM2 global model configuration... None /g/data/ik11/outputs/access-om2/1deg_jra55v131... 260
5 01deg_jra55v13_ryf9091_tides_control None None NaT None None /g/data/ik11/outputs/access-om2-01/01deg_jra55... 620
6 1deg_jra55v131_ryf_const_albedo Andrew Kiss andrew.kiss@anu.edu.au 2020-03-24 00:00:00 1 degree ACCESS-OM2 global model configuration... None /g/data/ik11/outputs/access-om2/1deg_jra55v131... 260
7 01deg_jra55v13_ryf9091_tides None None NaT None None /g/data/ik11/outputs/access-om2-01/01deg_jra55... 2578
8 025deg_jra55_ryf9091_gadi_noGM Ryan Holmes ryan.holmes@unsw.edu.au 2020-04-01 00:00:00 0.25 degree ACCESS-OM2 global model configurat... None /g/data/ik11/outputs/access-om2-025/025deg_jra... 316
9 1deg_jra55_iaf_v2.0.0rc3_nonuniform_albedo Andrew Kiss andrew.kiss@anu.edu.au 2020-05-30 00:00:00 1 degree ACCESS-OM2 global model configuration... None /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... 4660
10 025deg_jra55_ryf9091_gadi_norediGM Ryan Holmes ryan.holmes@unsw.edu.au 2020-04-01 00:00:00 0.25 degree ACCESS-OM2 global model configurat... None /g/data/ik11/outputs/access-om2-025/025deg_jra... 312
11 01deg_jra55v140_iaf Andrew Kiss andrew.kiss@anu.edu.au 2020-06-09 00:00:00 0.1 degree ACCESS-OM2 global model configurati... The 0.1 degree ACCESS-OM2 model spin up using ... /g/data/ik11/outputs/access-om2-01/01deg_jra55... 47199
12 01deg_jra55v13_ryf9091 Andy Hogg andy.hogg@anu.edu.au 2020-06-11 00:00:00 0.1 degree ACCESS-OM2 global model configurati... Additional daily outputs saved from 1 Jan 1950... /g/data/ik11/outputs/access-om2-01/01deg_jra55... 9964
13 1deg_jra55_ryf9091_gadi Ryan Holmes ryan.holmes@unsw.edu.au 2020-02-01 00:00:00 1 degree ACCESS-OM2 global model configuration... None /g/data/ik11/outputs/access-om2/1deg_jra55_ryf... 6865
14 025deg_jra55_ryf9091_gadi Ryan Holmes ryan.holmes@unsw.edu.au 2020-02-01 00:00:00 0.25 degree ACCESS-OM2 global model configurat... None /g/data/ik11/outputs/access-om2-025/025deg_jra... 8838
15 1deg_jra55_iaf_v2.0.0rc3 Andrew Kiss andrew.kiss@anu.edu.au 2020-05-30 00:00:00 1 degree ACCESS-OM2 global model configuration... None /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... 4660
16 01deg_jra55v13_ryf9091_qian_wp Qian Li qian.li5@unsw.edu.au 2020-03-13 00:00:00 Wind perturbation experiment None /g/data/ik11/outputs/access-om2-01/01deg_jra55... 36
17 MRI-JRA55-do-1-4-0 Hiroyuki Tsujino htsujino@mri-jma.go.jp 2019-03-08 08:53:09 MRI JRA55-do 1.4.0 dataset prepared for input4... Based on JRA-55 reanalysis (1958-01 to 2019-01... /g/data/ik11/inputs/JRA-55/MRI-JRA55-do/MRI-JR... 682
18 JRA55-RYF-1-4 Kial Stewart kial.stewart@anu.edu.au 2020-04-17 00:00:00 This dataset is derived from JRA55-do (JRA-55 ... Further information on source dataset availabl... /g/data/ik11/inputs/JRA-55/RYF/indexing/JRA55-... 10
variables = cc.querying.get_variables(session, experiment='01deg_jra55v140_iaf')
variables
name long_name frequency ncfile # ncfiles time_start time_end
0 SALT None None restart239/ice/monthly_sstsss.nc 61 None None
1 TEMP None None restart239/ice/monthly_sstsss.nc 61 None None
2 Time Time None restart243/ocean/ocean_velocity_advection.res.nc 671 None None
3 Tsfcn None None restart227/ice/iced.2015-01-01-00000.nc 61 None None
4 advectionu advectionu None restart243/ocean/ocean_velocity_advection.res.nc 61 None None
... ... ... ... ... ... ... ...
362 time time static output243/ocean/ocean-2d-drag_coeff.nc 3660 1900-01-01 00:00:00 2019-01-01 00:00:00
363 xt_ocean tcell longitude static output034/ocean/ocean-2d-geolat_t.nc 1708 1900-01-01 00:00:00 1900-01-01 00:00:00
364 xu_ocean ucell longitude static output243/ocean/ocean-2d-drag_coeff.nc 1952 1900-01-01 00:00:00 2019-01-01 00:00:00
365 yt_ocean tcell latitude static output034/ocean/ocean-2d-geolat_t.nc 1708 1900-01-01 00:00:00 1900-01-01 00:00:00
366 yu_ocean ucell latitude static output243/ocean/ocean-2d-drag_coeff.nc 1952 1900-01-01 00:00:00 2019-01-01 00:00:00

367 rows × 7 columns

But there are sometimes duplicate variables with different frequency:

variables[variables.name == 'surface_salt']
name long_name frequency ncfile # ncfiles time_start time_end
169 surface_salt Practical Salinity 1 daily output243/ocean/ocean-2d-surface_salt-1-daily-... 244 1958-01-01 00:00:00 2019-01-01 00:00:00
304 surface_salt Practical Salinity 1 monthly output243/ocean/ocean-2d-surface_salt-1-monthl... 244 1958-01-01 00:00:00 2019-01-01 00:00:00

If you just try and load this data you will get an error because you will be trying to load data from different files with different temporal frequency

cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='surface_salt', session=session)
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-10-4c4ac916b058> in <module>
----> 1 cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='surface_salt', session=session)


/g/data3/hh5/public/apps/miniconda3/envs/analysis3-20.07/lib/python3.7/site-packages/cosima_cookbook/querying.py in getvar(expt, variable, session, ncfile, start_time, end_time, n, frequency, **kwargs)
    185         if variable not in d.coords
    186         else d,
--> 187         **xr_kwargs
    188     )
    189


/g/data3/hh5/public/apps/miniconda3/envs/analysis3-20.07/lib/python3.7/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, lock, data_vars, coords, combine, autoclose, parallel, join, attrs_file, **kwargs)
    950                 coords=coords,
    951                 join=join,
--> 952                 combine_attrs="drop",
    953             )
    954         else:


/g/data3/hh5/public/apps/miniconda3/envs/analysis3-20.07/lib/python3.7/site-packages/xarray/core/combine.py in combine_by_coords(datasets, compat, data_vars, coords, fill_value, join, combine_attrs)
    750                 raise ValueError(
    751                     "Resulting object does not have monotonic"
--> 752                     " global indexes along dimension {}".format(dim)
    753                 )
    754         concatenated_grouped_by_data_vars.append(concatenated)


ValueError: Resulting object does not have monotonic global indexes along dimension time

Exploring a Cookbook Database

The COSIMA Cookbook explore submodule seeks to solve the issue of how to find relevant experiments and variables within a Cookbook database and simplify the process of loading this data.

It does this by providing GUI elements that users can embed in their jupyter notebooks that can be used to filter and query the database.

Requirements: The explorer submodule feature requires using the cosima-cookbook version found in conda/analysis3-20.07 (or later) kernel on NCI (or your own up-to-date cookbook installation).

from cosima_cookbook import explore

Database Explorer

The first component is DatabaseExplorer, which is used to find relevant experiments. Re-use an existing session or don’t specify session and it will start with the default database.

Filtering can be applied to narrow down the number of experiments. Select one or more keywords to reduce the listed experiments to those that contain all the selected keywords. To show only those experiments which contain a given variable select the variable from the list of available variables in Database and push the ‘>>’ button to move them to the right hand box. Now when filter is pushed only experiments which contain the variables in the right hand box will be shown. Variables can be removed from the filtering box by selecting and pushing ‘<<’. Note that the list of available variables contains all variables contained in the database. The filtering by keyword does not change the available variables. Both filtering methods are applied to find the list of matching experiments, but the two methods are independent in all other respects.

Note also that the list of available variables is pre-filtered: all variables from restart files and variables that can be unambiguously identified as coordinate variables are not listed. It is possible to remove this pre-filtering by deselecting the checkboxes underneath the variable list.

By default all variables from all model components are shown in the selection box. To display only variables from one model component select the required component from the dropdown menu which defaults to “All models”.

The search box can be used to further narrow the list of available variables. When text is entered into the search box only variables that contain that text in their variable name or their long_name attribute will be displayed in the selection box.

When a variable is selected the long_name is displayed below the variable selector box. In some cases when filtering and/or searching a variable will be automatically selected but may show as highlighted in the selector box. This is undesirable, but currently unavoidable.

When an experiment is selected and the ‘Load Experiment’ button pushed, it open an Experiment Explorer gui element below the Database Explorer. A detailed explanation of the Experiment Explorer is in the next section.

(Note: The widgets have been exported to be usable in an HTML page, but they will ONLY function properly if loaded as a jupyter notebook)

from cosima_cookbook import explore
dbx = explore.DatabaseExplorer(session=session)
dbx
DatabaseExplorer(children=(HTML(value='<style>.header p{ line-height: 1.4; margin-bottom: 10px }</style>n    …

Experiment Explorer

The ExperimentExplorer can be used independently of the DatabaseExplorer if you already know the experiment you wish to load.

You can re-use an existing database session, or not supply that argument and a new session will be created automatically with the default database. If you pass an experiment name this experiment will be loaded by default, but it is not necessary to do so, as any experiment present in the database can be selected from a drop-down menu at the top.

The box showing the available variables is the same as the one in the filtering element from DatabaseExplorer, with exactly the same functionality to show only variables from selected models, search by variable name and long name, and filter out coordinates and restarts.

When a variable is selected the long name is displayed below the box as before, but it also populates the frequency drop down and date range slider to the right. Identical variables can be present in a data set with different temporal frequencies. It is necessary to choose a frequency in this case as those variables cannot be loaded into the same xarray.DataArray. When a frequency is selected the date range slider may change the range of available dates if they differ between the two frequencies.

It is advisable to reduce the date range you load if you know you only need the data for a limited time range, as it is much quicker to load the metadata as fewer files need to be opened and their metadata checked.

Once you have selected a variable, confirmed the frequency and date range are correct, push the “Load” button and the data will be loaded into an xarray.DataArray object. When this is done the metadata from the loaded data will be displayed at the end of the cell output.

The relevant command used to load the data is displayed, so that it can be copied, reused, and/or modified.

The loaded data is available as the .data attribute of the ExperimentExplorer object. At any time a different variable from the same or a different experiment can be loaded, and the .data attribute will be updated to reflect the new data.

ee = explore.ExperimentExplorer(session=session, experiment='01deg_jra55v140_iaf')
ee
ExperimentExplorer(children=(HTML(value='n            <h3>Experiment Explorer</h3>nn            <p>Select a…
ee.data
<xarray.DataArray 'surface_temp' (time: 99, yt_ocean: 2700, xt_ocean: 3600)>
dask.array<concatenate, shape=(99, 2700, 3600), dtype=float32, chunksize=(1, 540, 720), chunktype=numpy.ndarray>
Coordinates:
  * xt_ocean  (xt_ocean) float64 -279.9 -279.8 -279.7 ... 79.75 79.85 79.95
  * yt_ocean  (yt_ocean) float64 -81.11 -81.07 -81.02 ... 89.89 89.94 89.98
  * time      (time) object 1970-10-16 12:00:00 ... 1978-12-16 12:00:00
Attributes:
    long_name:      Conservative temperature
    units:          K
    valid_range:    [-10. 500.]
    cell_methods:   time: mean
    time_avg_info:  average_T1,average_T2,average_DT
    coordinates:    geolon_t geolat_t
    standard_name:  sea_surface_conservative_temperature

Download python script: Using_Explorer_tools.py

Download Jupyter notebook: Using_Explorer_tools.ipynb