Exploring the COSIMA Cookbook

Statement of problem

COSIMA is producing a lot of data and we need to be able to find it to analyse it. The data is contained in multiple locations. One of these locations is in the outputs directory in the ik11 project. Contained within are subdirectories for each model resolution and within each of these directories are subdirectories for each model configuration

[1]:
!ls /g/data/ik11/outputs
access-cm2-025  access-om2-01   mom6-eac        mom6-om4-025  README
access-om2      access-om2-025  mom6-global-01  mom6-panan
[2]:
!ls /g/data/ik11/outputs/access-om2-01
01deg_jra55v13_iaf_4hourly
01deg_jra55v13_ryf9091
01deg_jra55v13_ryf9091_5Kv
01deg_jra55v13_ryf9091_easterlies_down10
01deg_jra55v13_ryf9091_easterlies_up10
01deg_jra55v13_ryf9091_easterlies_up10_meridional
01deg_jra55v13_ryf9091_easterlies_up10_noDSW
01deg_jra55v13_ryf9091_easterlies_up10_zonal
01deg_jra55v13_ryf9091_k_smag_iso3
01deg_jra55v13_ryf9091_OFAM3visc
01deg_jra55v13_ryf9091_qian_ctrl
01deg_jra55v13_ryf9091_qian_wthmp
01deg_jra55v13_ryf9091_qian_wthp
01deg_jra55v13_ryf9091_qian_wtlp
01deg_jra55v13_ryf9091_rerun_for_easterlies
01deg_jra55v13_ryf9091_weddell_down2
01deg_jra55v13_ryf9091_weddell_up1
01deg_jra55v140_iaf
01deg_jra55v140_iaf_cycle2
01deg_jra55v140_iaf_cycle3
01deg_jra55v140_iaf_cycle3_antarctic_tracers
01deg_jra55v140_iaf_cycle3_HF
01deg_jra55v140_iaf_cycle4
01deg_jra55v140_iaf_cycle4_jra55v150_extension
01deg_jra55v140_iaf_cycle4_MWpert
01deg_jra55v140_iaf_cycle4_OLD
01deg_jra55v140_iaf_cycle4_rerun_from_1980
01deg_jra55v140_iaf_cycle4-test
01deg_jra55v140_iaf_KvJ09
01deg_jra55v150_iaf_cycle1
basal_melt_outputs

All the data is contained in netCDF files, of which there are many. At the time of writing this, 48118 netCDF files in the above directories.

GOAL: access data by specifying an experiment and a variable

COSIMA Cookbook solution

In order to achieve the above goal the COSIMA Cookbook provides tools to search directories looking for netCDF data files, read metadata from the files about the data they contain, and then save this data to an SQL database.

The Cookbook also provides an API to query the database and retrieve data by experiment and variable name.

[3]:
import cosima_cookbook as cc
[4]:
session = cc.database.create_session()
[5]:
cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='u', session=session, n=1)
[5]:
<xarray.DataArray 'u' (time: 3, st_ocean: 75, yu_ocean: 2700, xu_ocean: 3600)>
dask.array<open_dataset-bd3ff7e4cd0aac77292c3fb238623039u, shape=(3, 75, 2700, 3600), dtype=float32, chunksize=(1, 19, 135, 180), chunktype=numpy.ndarray>
Coordinates:
  * xu_ocean  (xu_ocean) float64 -279.9 -279.8 -279.7 -279.6 ... 79.8 79.9 80.0
  * yu_ocean  (yu_ocean) float64 -81.09 -81.05 -81.0 -80.96 ... 89.92 89.96 90.0
  * st_ocean  (st_ocean) float64 0.5413 1.681 2.94 ... 5.511e+03 5.709e+03
  * time      (time) datetime64[ns] 1958-01-16T12:00:00 ... 1958-03-16T12:00:00
Attributes: (12/13)
    long_name:      i-current
    units:          m/sec
    valid_range:    [-10.  10.]
    cell_methods:   time: mean
    time_avg_info:  average_T1,average_T2,average_DT
    coordinates:    geolon_c geolat_c
    ...             ...
    ncfiles:        ['/g/data/cj50/access-om2/raw-output/access-om2-01/01deg_...
    contact:        Andrew Kiss
    email:          andrew.kiss@anu.edu.au
    created:        2020-06-09
    description:    0.1 degree ACCESS-OM2 global model configuration under in...
    notes:          Source code: https://github.com/COSIMA/access-om2 License...

The question then becomes, how do I find out what experiment to use, and what variables are available? Currently the API provides get_experiments to give a list of experiments and get_variables which returns a list of variables for a given experiment

[6]:
cc.querying.get_experiments(session, all=True)
[6]:
experiment contact email created description notes url root_dir ncfiles
0 woa18 Ocean Climate Laboratory, National Centers for... NCEI.info@noaa.gov 2019-07-29 Climatological mean state for the global ocean... These data are openly available to the public.... http://www.ncei.noaa.gov /g/data/ik11/observations/woa18 24
1 eac-zstar-v1 None None None None None None /g/data/ik11/outputs/mom6-eac/eac-zstar-v1 29
2 eac-zstar-v2 None None None None None None /g/data/ik11/outputs/mom6-eac/eac-zstar-v2 76
3 025deg_jra55_ryf9091_gadi_norediGM Ryan Holmes ryan.holmes@unsw.edu.au 2020-04-01 0.25 degree ACCESS-OM2 global model configurat... None None /g/data/ik11/outputs/access-om2-025/025deg_jra... 312
4 025deg_jra55_ryf9091_gadi_noGM Ryan Holmes ryan.holmes@unsw.edu.au 2020-04-01 0.25 degree ACCESS-OM2 global model configurat... None None /g/data/ik11/outputs/access-om2-025/025deg_jra... 316
... ... ... ... ... ... ... ... ... ...
171 1deg_jra55_iaf_omip2spunup_cycle50 Pat Wongpan pat.wongpan@utas.edu.au 2023-01-20 Continuation of omip2spunup up to 50 cycles (i... None None /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... 998
172 1deg_jra55_iaf_omip2spunup_cycle46 Pat Wongpan pat.wongpan@utas.edu.au 2023-01-20 Continuation of omip2spunup up to 50 cycles (i... None None /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... 1014
173 1deg_jra55_iaf_omip2spunup_cycle49 Pat Wongpan pat.wongpan@utas.edu.au 2023-01-20 Continuation of omip2spunup up to 50 cycles (i... None None /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... 1014
174 1deg_jra55_iaf_omip2spunup_cycle48 Pat Wongpan pat.wongpan@utas.edu.au 2023-01-20 Continuation of omip2spunup up to 50 cycles (i... None None /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... 1014
175 1deg_jra55_iaf_omip2spunup_cycle47 Pat Wongpan pat.wongpan@utas.edu.au 2023-01-20 Continuation of omip2spunup up to 50 cycles (i... None None /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... 1014

176 rows × 9 columns

[7]:
variables = cc.querying.get_variables(session, experiment='01deg_jra55v140_iaf')
variables
[7]:
name long_name units frequency ncfile cell_methods # ncfiles time_start time_end
0 pfmice_i None None None output028/ocean/o2i.nc None 244 None None
1 sslx_i None None None output028/ocean/o2i.nc None 244 None None
2 ssly_i None None None output028/ocean/o2i.nc None 244 None None
3 sss_i None None None output028/ocean/o2i.nc None 244 None None
4 sst_i None None None output028/ocean/o2i.nc None 244 None None
... ... ... ... ... ... ... ... ... ...
266 time time days since 1900-01-01 00:00:00 static output243/ocean/ocean-2d-drag_coeff.nc None 3660 1900-01-01 00:00:00 2019-01-01 00:00:00
267 xt_ocean tcell longitude degrees_E static output126/ocean/ocean-2d-ht.nc None 1708 1900-01-01 00:00:00 1900-01-01 00:00:00
268 xu_ocean ucell longitude degrees_E static output243/ocean/ocean-2d-drag_coeff.nc None 1952 1900-01-01 00:00:00 2019-01-01 00:00:00
269 yt_ocean tcell latitude degrees_N static output126/ocean/ocean-2d-ht.nc None 1708 1900-01-01 00:00:00 1900-01-01 00:00:00
270 yu_ocean ucell latitude degrees_N static output243/ocean/ocean-2d-drag_coeff.nc None 1952 1900-01-01 00:00:00 2019-01-01 00:00:00

271 rows × 9 columns

But there are sometimes duplicate variables with different frequency:

[8]:
variables[variables.name == 'surface_salt']
[8]:
name long_name units frequency ncfile cell_methods # ncfiles time_start time_end
54 surface_salt Practical Salinity psu 1 daily output243/ocean/ocean-2d-surface_salt-1-daily-... time: mean 244 1958-01-01 00:00:00 2019-01-01 00:00:00
201 surface_salt Practical Salinity psu 1 monthly output243/ocean/ocean-2d-surface_salt-1-monthl... time: mean 244 1958-01-01 00:00:00 2019-01-01 00:00:00

If you just try and load this data you will get an error because you will be trying to load data from different files with different temporal frequency

[9]:
cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='surface_salt', session=session)
---------------------------------------------------------------------------
QueryWarning                              Traceback (most recent call last)
Cell In[9], line 1
----> 1 cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='surface_salt', session=session)

File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/lib/python3.9/site-packages/cosima_cookbook/querying.py:334, in getvar(expt, variable, session, ncfile, start_time, end_time, n, frequency, attrs, attrs_unique, return_dataset, **kwargs)
    331 if attrs_unique is None:
    332     attrs_unique = {"cell_methods": "time: mean"}
--> 334 ncfiles = _ncfiles_for_variable(
    335     expt,
    336     variable,
    337     session,
    338     ncfile,
    339     start_time,
    340     end_time,
    341     n,
    342     frequency,
    343     attrs,
    344     attrs_unique,
    345 )
    347 variables = [variable]
    348 if return_dataset:
    349     # we know at least one variable was returned, so we can index ncfiles
    350     # ask for the extra variables associated with cell_methods, etc.

File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/lib/python3.9/site-packages/cosima_cookbook/querying.py:529, in _ncfiles_for_variable(expt, variable, session, ncfile, start_time, end_time, n, frequency, attrs, attrs_unique)
    527 unique_freqs = set(f.NCFile.frequency for f in ncfiles)
    528 if len(unique_freqs) > 1:
--> 529     warnings.warn(
    530         f"Your query returns files with differing frequencies: {unique_freqs}. "
    531         "This could lead to unexpected behaviour! Disambiguate by passing "
    532         "frequency= to getvar, specifying the desired frequency.",
    533         QueryWarning,
    534     )
    536 return ncfiles

QueryWarning: Your query returns files with differing frequencies: {'1 daily', '1 monthly'}. This could lead to unexpected behaviour! Disambiguate by passing frequency= to getvar, specifying the desired frequency.

Exploring a Cookbook Database

The COSIMA Cookbook explore submodule seeks to solve the issue of how to find relevant experiments and variables within a Cookbook database and simplify the process of loading this data.

It does this by providing GUI elements that users can embed in their jupyter notebooks that can be used to filter and query the database.

Requirements: The explorer submodule feature requires using the cosima-cookbook version found in conda/analysis3-20.07 (or later) kernel on NCI (or your own up-to-date cookbook installation).

[10]:
from cosima_cookbook import explore

Database Explorer

The first component is DatabaseExplorer, which is used to find relevant experiments. Re-use an existing session or don’t specify session and it will start with the default database.

Filtering can be applied to narrow down the number of experiments. Select one or more keywords to reduce the listed experiments to those that contain all the selected keywords. To show only those experiments which contain a given variable select the variable from the list of available variables in Database and push the ‘>>’ button to move them to the right hand box. Now when filter is pushed only experiments which contain the variables in the right hand box will be shown. Variables can be removed from the filtering box by selecting and pushing ‘<<’. Note that the list of available variables contains all variables contained in the database. The filtering by keyword does not change the available variables. Both filtering methods are applied to find the list of matching experiments, but the two methods are independent in all other respects.

Note also that the list of available variables is pre-filtered: all variables from restart files and variables that can be unambiguously identified as coordinate variables are not listed. It is possible to remove this pre-filtering by deselecting the checkboxes underneath the variable list.

By default all variables from all model components are shown in the selection box. To display only variables from one model component select the required component from the dropdown menu which defaults to “All models”.

The search box can be used to further narrow the list of available variables. When text is entered into the search box only variables that contain that text in their variable name or their long_name attribute will be displayed in the selection box.

When a variable is selected the long_name is displayed below the variable selector box. In some cases when filtering and/or searching a variable will be automatically selected but may show as highlighted in the selector box. This is undesirable, but currently unavoidable.

When an experiment is selected and the ‘Load Experiment’ button pushed, it open an Experiment Explorer gui element below the Database Explorer. A detailed explanation of the Experiment Explorer is in the next section.

(Note: The widgets have been exported to be viewable in an HTML page, but they will ONLY function properly if loaded as a jupyter notebook)

[11]:
%%time
from cosima_cookbook import explore
dbx = explore.DatabaseExplorer(session=session)
dbx
CPU times: user 58.2 s, sys: 8.46 s, total: 1min 6s
Wall time: 1min 59s

Experiment Explorer

The ExperimentExplorer can be used independently of the DatabaseExplorer if you already know the experiment you wish to load.

You can re-use an existing database session, or not supply that argument and a new session will be created automatically with the default database. If you pass an experiment name this experiment will be loaded by default, but it is not necessary to do so, as any experiment present in the database can be selected from a drop-down menu at the top.

The box showing the available variables is the same as the one in the filtering element from DatabaseExplorer, with exactly the same functionality to show only variables from selected models, search by variable name and long name, and filter out coordinates and restarts.

When a variable is selected the long name is displayed below the box as before, but it also populates the frequency drop down and date range slider to the right. Identical variables can be present in a data set with different temporal frequencies. It is necessary to choose a frequency in this case as those variables cannot be loaded into the same xarray.DataArray. When a frequency is selected the date range slider may change the range of available dates if they differ between the two frequencies.

It is advisable to reduce the date range you load if you know you only need the data for a limited time range, as it is much quicker to load the metadata as fewer files need to be opened and their metadata checked.

Once you have selected a variable, confirmed the frequency and date range are correct, push the “Load” button and the data will be loaded into an xarray.DataArray object. When this is done the metadata from the loaded data will be displayed at the end of the cell output.

The relevant command used to load the data is displayed, so that it can be copied, reused, and/or modified.

The loaded data is available as the .data attribute of the ExperimentExplorer object. At any time a different variable from the same or a different experiment can be loaded, and the .data attribute will be updated to reflect the new data.

[12]:
ee = explore.ExperimentExplorer(session=session, experiment='01deg_jra55v140_iaf')
ee
[13]:
ee.data
[13]:
<xarray.DataArray 'surface_salt' (time: 1734, yt_ocean: 2700, xt_ocean: 3600)>
dask.array<concatenate, shape=(1734, 2700, 3600), dtype=float32, chunksize=(1, 540, 720), chunktype=numpy.ndarray>
Coordinates:
  * xt_ocean  (xt_ocean) float64 -279.9 -279.8 -279.7 ... 79.75 79.85 79.95
  * yt_ocean  (yt_ocean) float64 -81.11 -81.07 -81.02 ... 89.89 89.94 89.98
  * time      (time) datetime64[ns] 1990-01-01T12:00:00 ... 1994-09-30T12:00:00
Attributes: (12/13)
    long_name:      Practical Salinity
    units:          psu
    valid_range:    [-10. 100.]
    cell_methods:   time: mean
    time_avg_info:  average_T1,average_T2,average_DT
    coordinates:    geolon_t geolat_t
    ...             ...
    ncfiles:        ['/g/data/cj50/access-om2/raw-output/access-om2-01/01deg_...
    contact:        Andrew Kiss
    email:          andrew.kiss@anu.edu.au
    created:        2020-06-09
    description:    0.1 degree ACCESS-OM2 global model configuration under in...
    notes:          Source code: https://github.com/COSIMA/access-om2 License...
[ ]: