Exploring the COSIMA Cookbook¶
Statement of problem¶
COSIMA is producing a lot of data and we need to be able to find it to analyse it. The data is contained in multiple locations. One of these locations is in the outputs
directory in the ik11
project. Contained within are subdirectories for each model resolution and within each of these directories are subdirectories for each model configuration
[1]:
!ls /g/data/ik11/outputs
access-cm2-025 access-om2-01 mom6-eac mom6-om4-025 README
access-om2 access-om2-025 mom6-global-01 mom6-panan
[2]:
!ls /g/data/ik11/outputs/access-om2-01
01deg_jra55v13_iaf_4hourly
01deg_jra55v13_ryf9091
01deg_jra55v13_ryf9091_5Kv
01deg_jra55v13_ryf9091_easterlies_down10
01deg_jra55v13_ryf9091_easterlies_up10
01deg_jra55v13_ryf9091_easterlies_up10_meridional
01deg_jra55v13_ryf9091_easterlies_up10_noDSW
01deg_jra55v13_ryf9091_easterlies_up10_zonal
01deg_jra55v13_ryf9091_k_smag_iso3
01deg_jra55v13_ryf9091_OFAM3visc
01deg_jra55v13_ryf9091_qian_ctrl
01deg_jra55v13_ryf9091_qian_wthmp
01deg_jra55v13_ryf9091_qian_wthp
01deg_jra55v13_ryf9091_qian_wtlp
01deg_jra55v13_ryf9091_rerun_for_easterlies
01deg_jra55v13_ryf9091_weddell_down2
01deg_jra55v13_ryf9091_weddell_up1
01deg_jra55v140_iaf
01deg_jra55v140_iaf_cycle2
01deg_jra55v140_iaf_cycle3
01deg_jra55v140_iaf_cycle3_antarctic_tracers
01deg_jra55v140_iaf_cycle3_HF
01deg_jra55v140_iaf_cycle4
01deg_jra55v140_iaf_cycle4_jra55v150_extension
01deg_jra55v140_iaf_cycle4_MWpert
01deg_jra55v140_iaf_cycle4_OLD
01deg_jra55v140_iaf_cycle4_rerun_from_1980
01deg_jra55v140_iaf_cycle4-test
01deg_jra55v140_iaf_KvJ09
01deg_jra55v150_iaf_cycle1
basal_melt_outputs
All the data is contained in netCDF files, of which there are many. At the time of writing this, 48118 netCDF files in the above directories.
GOAL: access data by specifying an experiment and a variable
COSIMA Cookbook solution¶
In order to achieve the above goal the COSIMA Cookbook provides tools to search directories looking for netCDF data files, read metadata from the files about the data they contain, and then save this data to an SQL database.
The Cookbook also provides an API to query the database and retrieve data by experiment and variable name.
[3]:
import cosima_cookbook as cc
[4]:
session = cc.database.create_session()
[5]:
cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='u', session=session, n=1)
[5]:
<xarray.DataArray 'u' (time: 3, st_ocean: 75, yu_ocean: 2700, xu_ocean: 3600)> dask.array<open_dataset-bd3ff7e4cd0aac77292c3fb238623039u, shape=(3, 75, 2700, 3600), dtype=float32, chunksize=(1, 19, 135, 180), chunktype=numpy.ndarray> Coordinates: * xu_ocean (xu_ocean) float64 -279.9 -279.8 -279.7 -279.6 ... 79.8 79.9 80.0 * yu_ocean (yu_ocean) float64 -81.09 -81.05 -81.0 -80.96 ... 89.92 89.96 90.0 * st_ocean (st_ocean) float64 0.5413 1.681 2.94 ... 5.511e+03 5.709e+03 * time (time) datetime64[ns] 1958-01-16T12:00:00 ... 1958-03-16T12:00:00 Attributes: (12/13) long_name: i-current units: m/sec valid_range: [-10. 10.] cell_methods: time: mean time_avg_info: average_T1,average_T2,average_DT coordinates: geolon_c geolat_c ... ... ncfiles: ['/g/data/cj50/access-om2/raw-output/access-om2-01/01deg_... contact: Andrew Kiss email: andrew.kiss@anu.edu.au created: 2020-06-09 description: 0.1 degree ACCESS-OM2 global model configuration under in... notes: Source code: https://github.com/COSIMA/access-om2 License...
The question then becomes, how do I find out what experiment to use, and what variables are available? Currently the API provides get_experiments
to give a list of experiments and get_variables
which returns a list of variables for a given experiment
[6]:
cc.querying.get_experiments(session, all=True)
[6]:
experiment | contact | created | description | notes | url | root_dir | ncfiles | ||
---|---|---|---|---|---|---|---|---|---|
0 | woa18 | Ocean Climate Laboratory, National Centers for... | NCEI.info@noaa.gov | 2019-07-29 | Climatological mean state for the global ocean... | These data are openly available to the public.... | http://www.ncei.noaa.gov | /g/data/ik11/observations/woa18 | 24 |
1 | eac-zstar-v1 | None | None | None | None | None | None | /g/data/ik11/outputs/mom6-eac/eac-zstar-v1 | 29 |
2 | eac-zstar-v2 | None | None | None | None | None | None | /g/data/ik11/outputs/mom6-eac/eac-zstar-v2 | 76 |
3 | 025deg_jra55_ryf9091_gadi_norediGM | Ryan Holmes | ryan.holmes@unsw.edu.au | 2020-04-01 | 0.25 degree ACCESS-OM2 global model configurat... | None | None | /g/data/ik11/outputs/access-om2-025/025deg_jra... | 312 |
4 | 025deg_jra55_ryf9091_gadi_noGM | Ryan Holmes | ryan.holmes@unsw.edu.au | 2020-04-01 | 0.25 degree ACCESS-OM2 global model configurat... | None | None | /g/data/ik11/outputs/access-om2-025/025deg_jra... | 316 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
171 | 1deg_jra55_iaf_omip2spunup_cycle50 | Pat Wongpan | pat.wongpan@utas.edu.au | 2023-01-20 | Continuation of omip2spunup up to 50 cycles (i... | None | None | /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... | 998 |
172 | 1deg_jra55_iaf_omip2spunup_cycle46 | Pat Wongpan | pat.wongpan@utas.edu.au | 2023-01-20 | Continuation of omip2spunup up to 50 cycles (i... | None | None | /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... | 1014 |
173 | 1deg_jra55_iaf_omip2spunup_cycle49 | Pat Wongpan | pat.wongpan@utas.edu.au | 2023-01-20 | Continuation of omip2spunup up to 50 cycles (i... | None | None | /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... | 1014 |
174 | 1deg_jra55_iaf_omip2spunup_cycle48 | Pat Wongpan | pat.wongpan@utas.edu.au | 2023-01-20 | Continuation of omip2spunup up to 50 cycles (i... | None | None | /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... | 1014 |
175 | 1deg_jra55_iaf_omip2spunup_cycle47 | Pat Wongpan | pat.wongpan@utas.edu.au | 2023-01-20 | Continuation of omip2spunup up to 50 cycles (i... | None | None | /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... | 1014 |
176 rows × 9 columns
[7]:
variables = cc.querying.get_variables(session, experiment='01deg_jra55v140_iaf')
variables
[7]:
name | long_name | units | frequency | ncfile | cell_methods | # ncfiles | time_start | time_end | |
---|---|---|---|---|---|---|---|---|---|
0 | pfmice_i | None | None | None | output028/ocean/o2i.nc | None | 244 | None | None |
1 | sslx_i | None | None | None | output028/ocean/o2i.nc | None | 244 | None | None |
2 | ssly_i | None | None | None | output028/ocean/o2i.nc | None | 244 | None | None |
3 | sss_i | None | None | None | output028/ocean/o2i.nc | None | 244 | None | None |
4 | sst_i | None | None | None | output028/ocean/o2i.nc | None | 244 | None | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
266 | time | time | days since 1900-01-01 00:00:00 | static | output243/ocean/ocean-2d-drag_coeff.nc | None | 3660 | 1900-01-01 00:00:00 | 2019-01-01 00:00:00 |
267 | xt_ocean | tcell longitude | degrees_E | static | output126/ocean/ocean-2d-ht.nc | None | 1708 | 1900-01-01 00:00:00 | 1900-01-01 00:00:00 |
268 | xu_ocean | ucell longitude | degrees_E | static | output243/ocean/ocean-2d-drag_coeff.nc | None | 1952 | 1900-01-01 00:00:00 | 2019-01-01 00:00:00 |
269 | yt_ocean | tcell latitude | degrees_N | static | output126/ocean/ocean-2d-ht.nc | None | 1708 | 1900-01-01 00:00:00 | 1900-01-01 00:00:00 |
270 | yu_ocean | ucell latitude | degrees_N | static | output243/ocean/ocean-2d-drag_coeff.nc | None | 1952 | 1900-01-01 00:00:00 | 2019-01-01 00:00:00 |
271 rows × 9 columns
But there are sometimes duplicate variables with different frequency:
[8]:
variables[variables.name == 'surface_salt']
[8]:
name | long_name | units | frequency | ncfile | cell_methods | # ncfiles | time_start | time_end | |
---|---|---|---|---|---|---|---|---|---|
54 | surface_salt | Practical Salinity | psu | 1 daily | output243/ocean/ocean-2d-surface_salt-1-daily-... | time: mean | 244 | 1958-01-01 00:00:00 | 2019-01-01 00:00:00 |
201 | surface_salt | Practical Salinity | psu | 1 monthly | output243/ocean/ocean-2d-surface_salt-1-monthl... | time: mean | 244 | 1958-01-01 00:00:00 | 2019-01-01 00:00:00 |
If you just try and load this data you will get an error because you will be trying to load data from different files with different temporal frequency
[9]:
cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='surface_salt', session=session)
---------------------------------------------------------------------------
QueryWarning Traceback (most recent call last)
Cell In[9], line 1
----> 1 cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='surface_salt', session=session)
File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/lib/python3.9/site-packages/cosima_cookbook/querying.py:334, in getvar(expt, variable, session, ncfile, start_time, end_time, n, frequency, attrs, attrs_unique, return_dataset, **kwargs)
331 if attrs_unique is None:
332 attrs_unique = {"cell_methods": "time: mean"}
--> 334 ncfiles = _ncfiles_for_variable(
335 expt,
336 variable,
337 session,
338 ncfile,
339 start_time,
340 end_time,
341 n,
342 frequency,
343 attrs,
344 attrs_unique,
345 )
347 variables = [variable]
348 if return_dataset:
349 # we know at least one variable was returned, so we can index ncfiles
350 # ask for the extra variables associated with cell_methods, etc.
File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/lib/python3.9/site-packages/cosima_cookbook/querying.py:529, in _ncfiles_for_variable(expt, variable, session, ncfile, start_time, end_time, n, frequency, attrs, attrs_unique)
527 unique_freqs = set(f.NCFile.frequency for f in ncfiles)
528 if len(unique_freqs) > 1:
--> 529 warnings.warn(
530 f"Your query returns files with differing frequencies: {unique_freqs}. "
531 "This could lead to unexpected behaviour! Disambiguate by passing "
532 "frequency= to getvar, specifying the desired frequency.",
533 QueryWarning,
534 )
536 return ncfiles
QueryWarning: Your query returns files with differing frequencies: {'1 daily', '1 monthly'}. This could lead to unexpected behaviour! Disambiguate by passing frequency= to getvar, specifying the desired frequency.
Exploring a Cookbook Database¶
The COSIMA Cookbook explore
submodule seeks to solve the issue of how to find relevant experiments and variables within a Cookbook database and simplify the process of loading this data.
It does this by providing GUI elements that users can embed in their jupyter notebooks that can be used to filter and query the database.
Requirements: The explorer
submodule feature requires using the cosima-cookbook
version found in conda/analysis3-20.07
(or later) kernel on NCI (or your own up-to-date cookbook installation).
[10]:
from cosima_cookbook import explore
Database Explorer¶
The first component is DatabaseExplorer
, which is used to find relevant experiments. Re-use an existing session
or don’t specify session
and it will start with the default database.
Filtering can be applied to narrow down the number of experiments. Select one or more keywords to reduce the listed experiments to those that contain all the selected keywords. To show only those experiments which contain a given variable select the variable from the list of available variables in Database and push the ‘>>’ button to move them to the right hand box. Now when filter is pushed only experiments which contain the variables in the right hand box will be shown. Variables can be removed from the filtering box by selecting and pushing ‘<<’. Note that the list of available variables contains all variables contained in the database. The filtering by keyword does not change the available variables. Both filtering methods are applied to find the list of matching experiments, but the two methods are independent in all other respects.
Note also that the list of available variables is pre-filtered: all variables from restart files and variables that can be unambiguously identified as coordinate variables are not listed. It is possible to remove this pre-filtering by deselecting the checkboxes underneath the variable list.
By default all variables from all model components are shown in the selection box. To display only variables from one model component select the required component from the dropdown menu which defaults to “All models”.
The search box can be used to further narrow the list of available variables. When text is entered into the search box only variables that contain that text in their variable name or their long_name
attribute will be displayed in the selection box.
When a variable is selected the long_name
is displayed below the variable selector box. In some cases when filtering and/or searching a variable will be automatically selected but may show as highlighted in the selector box. This is undesirable, but currently unavoidable.
When an experiment is selected and the ‘Load Experiment’ button pushed, it open an Experiment Explorer gui element below the Database Explorer. A detailed explanation of the Experiment Explorer is in the next section.
(Note: The widgets have been exported to be viewable in an HTML page, but they will ONLY function properly if loaded as a jupyter notebook)
[11]:
%%time
from cosima_cookbook import explore
dbx = explore.DatabaseExplorer(session=session)
dbx
CPU times: user 58.2 s, sys: 8.46 s, total: 1min 6s
Wall time: 1min 59s
Experiment Explorer¶
The ExperimentExplorer
can be used independently of the DatabaseExplorer
if you already know the experiment you wish to load.
You can re-use an existing database session, or not supply that argument and a new session will be created automatically with the default database. If you pass an experiment name this experiment will be loaded by default, but it is not necessary to do so, as any experiment present in the database can be selected from a drop-down menu at the top.
The box showing the available variables is the same as the one in the filtering element from DatabaseExplorer
, with exactly the same functionality to show only variables from selected models, search by variable name and long name, and filter out coordinates and restarts.
When a variable is selected the long name is displayed below the box as before, but it also populates the frequency drop down and date range slider to the right. Identical variables can be present in a data set with different temporal frequencies. It is necessary to choose a frequency in this case as those variables cannot be loaded into the same xarray.DataArray
. When a frequency is selected the date range slider may change the range of available dates if they differ between the two
frequencies.
It is advisable to reduce the date range you load if you know you only need the data for a limited time range, as it is much quicker to load the metadata as fewer files need to be opened and their metadata checked.
Once you have selected a variable, confirmed the frequency and date range are correct, push the “Load” button and the data will be loaded into an xarray.DataArray
object. When this is done the metadata from the loaded data will be displayed at the end of the cell output.
The relevant command used to load the data is displayed, so that it can be copied, reused, and/or modified.
The loaded data is available as the .data
attribute of the ExperimentExplorer
object. At any time a different variable from the same or a different experiment can be loaded, and the .data
attribute will be updated to reflect the new data.
[12]:
ee = explore.ExperimentExplorer(session=session, experiment='01deg_jra55v140_iaf')
ee
[13]:
ee.data
[13]:
<xarray.DataArray 'surface_salt' (time: 1734, yt_ocean: 2700, xt_ocean: 3600)> dask.array<concatenate, shape=(1734, 2700, 3600), dtype=float32, chunksize=(1, 540, 720), chunktype=numpy.ndarray> Coordinates: * xt_ocean (xt_ocean) float64 -279.9 -279.8 -279.7 ... 79.75 79.85 79.95 * yt_ocean (yt_ocean) float64 -81.11 -81.07 -81.02 ... 89.89 89.94 89.98 * time (time) datetime64[ns] 1990-01-01T12:00:00 ... 1994-09-30T12:00:00 Attributes: (12/13) long_name: Practical Salinity units: psu valid_range: [-10. 100.] cell_methods: time: mean time_avg_info: average_T1,average_T2,average_DT coordinates: geolon_t geolat_t ... ... ncfiles: ['/g/data/cj50/access-om2/raw-output/access-om2-01/01deg_... contact: Andrew Kiss email: andrew.kiss@anu.edu.au created: 2020-06-09 description: 0.1 degree ACCESS-OM2 global model configuration under in... notes: Source code: https://github.com/COSIMA/access-om2 License...
[ ]: