Loading the ECCOv4 state estimate fields on the native model grid

Objectives

Introduce several methods for loading the ECCOv4 state estimate files on the native model grid.

Introduction

ECCOv4 native-grid state estimate fields are packaged together as NetCDF files. We have been improving how these files are created, what fields they contain, and how they are distributed in directories. As of this writing, Sep 2019, the latest version of these files can be found here:

https://ecco.jpl.nasa.gov/drive/files/Version4/Release3_alt (po.daac drive, recommended)

https://web.corral.tacc.utexas.edu/OceanProjects/ECCO/ECCOv4/Release3_alt/ (mirror at U. Texas, Austin)

This tutorial document is current with the files in the above directories as of September 2019.

NetCDF File Format

ECCOv4 state estimate fields are provided as NetCDF files with one variable and one year per file. State estimate fields are provided as monthly means, daily means and monthly snapshots. The file directories for NetCDF files should look something like (although exact naming may vary):

  • /nctiles_monthly/VARIABLE_NAME/VARIABLE_NAME_YYYY.nc
  • /nctiles_monthly_snapshots/VARIABLE_NAME/VARIABLE_NAME_YYYY.nc
  • /nctiles_daily/VARIABLE_NAME/VARIABLE_NAME_YYYY.nc

While the model grid files are provided here:

  • /nctiles_grid/ECCOv4r3_grid.nc

Typical file sizes are: ~ 3D monthly-mean and monthly-snapshot fields: 265mb (50 levels x 12 months x 25 years x 13 tiles) 2D monthly-mean fields : 6mb ( 1 level x 12 months x 25 years x 13 tiles) 2D daily-mean fields : 150mb ( 1 level x 365 days x 12 years x 13 tiles) ~

The advantages of aggregating one year of data into a single NetCDF file is that the I/O time per data element. One nice feature of using the xarray and Dask libraries is that we do not have to load the entire file contents into RAM to work with them.

Two methods to load one ECCOv4 NetCDF file

In ECCO NetCDF files, all 13 tiles for a given year are aggregated into a single file. Therefore, we can use the open_dataset routine from xarray to open a single NetCDF variable file.

Alternatively, the subroutine load_ecco_var_from_years_nc allows you to optionally specify a subset of vertical levels or tiles to load using optional arguments.

We’ll show both methods. First open_dataset then load_ecco_var_from_years_nc

First set up the environment, load model grid parameters.

[1]:
import numpy as np
import xarray as xr
import sys
import matplotlib.pyplot as plt
%matplotlib inline
[2]:
## Import the ecco_v4_py library into Python
## =========================================

## -- If ecco_v4_py is not installed in your local Python library,
##    tell Python where to find it.  For example, if your ecco_v4_py
##    files are in /Users/ifenty/ECCOv4-py/ecco_v4_py, then use:
sys.path.append('/home/ifenty/ECCOv4-py')

import ecco_v4_py as ecco
[3]:
## Set top-level file directory for the ECCO NetCDF files
## =================================================================
# base_dir = '/home/username/'
base_dir = '/home/ifenty/ECCOv4-release/'

## define a high-level directory for ECCO fields
ECCO_dir = base_dir + '/Release3_alt/'

## define the directory with the model grid
grid_dir = ECCO_dir + 'nctiles_grid/'

## load the grid
grid = ecco.load_ecco_grid_nc(grid_dir, 'ECCOv4r3_grid.nc')

Loading a single ECCOv4 variable NetCDF file using open_dataset

[4]:
SSH_dir = ECCO_dir + '/nctiles_monthly/SSH/'
SSH_dataset = xr.open_dataset(SSH_dir + '/SSH_2010.nc')
SSH_dataset.SSH
[4]:
<xarray.DataArray 'SSH' (time: 12, tile: 13, j: 90, i: 90)>
[1263600 values with dtype=float32]
Coordinates:
  * j        (j) int32 0 1 2 3 4 5 6 7 8 9 10 ... 80 81 82 83 84 85 86 87 88 89
  * i        (i) int32 0 1 2 3 4 5 6 7 8 9 10 ... 80 81 82 83 84 85 86 87 88 89
    XC       (tile, j, i) float32 ...
    YC       (tile, j, i) float32 ...
    rA       (tile, j, i) float32 ...
  * tile     (tile) int32 0 1 2 3 4 5 6 7 8 9 10 11 12
    iter     (time) int32 ...
  * time     (time) datetime64[ns] 2010-01-16T12:00:00 ... 2010-12-16T12:00:00
Attributes:
    units:          m
    long_name:      Surface Height Anomaly adjusted with global steric height...
    standard_name:  sea_surface_height

SSH_dataset.SSH contains 12 months of data across 13 tiles.

Let’s plot the first time record of this file.

[5]:
SSH  = SSH_dataset.SSH.isel(time=0)
# mask to nan where hFacC(k=0) = 0
SSH  = SSH.where(grid.hFacC.isel(k=0))

fig  = plt.figure(figsize=(16,7))
ecco.plot_proj_to_latlon_grid(grid.XC, grid.YC, SSH, show_colorbar=True, cmin=-1.5, cmax=1.5);plt.title('SSH [m]');
_images/ECCO_v4_Loading_the_ECCOv4_state_estimate_fields_on_the_native_model_grid_8_0.png

Loading a single ECCOv4 variable NetCDF file using load_ecco_var_from_years_nc

We’ll now load the same 2010 SSH file using load_ecco_var_from_years_nc. This time we specify the directory containing the NetCDF file, the variable that want to load and the year of interest.

[6]:
# single year:  2010
SSH_dataset_2010 = ecco.load_ecco_var_from_years_nc(SSH_dir, \
                           'SSH', years_to_load = [2010]).load()
SSH_dataset_2010.attrs = []
SSH_dataset_2010
[6]:
<xarray.Dataset>
Dimensions:    (i: 90, j: 90, nv: 2, tile: 13, time: 12)
Coordinates:
  * j          (j) int32 0 1 2 3 4 5 6 7 8 9 ... 80 81 82 83 84 85 86 87 88 89
  * i          (i) int32 0 1 2 3 4 5 6 7 8 9 ... 80 81 82 83 84 85 86 87 88 89
    XC         (tile, j, i) float32 -111.60647 -111.303 ... -111.86579
    YC         (tile, j, i) float32 -88.24259 -88.382515 ... -88.07871 -88.10267
    rA         (tile, j, i) float32 362256450.0 363300960.0 ... 361119100.0
  * tile       (tile) int32 0 1 2 3 4 5 6 7 8 9 10 11 12
    time_bnds  (time, nv) datetime64[ns] 2010-01-01 2010-02-01 ... 2011-01-01
    iter       (time) int32 158532 159204 159948 160668 ... 165084 165804 166548
  * time       (time) datetime64[ns] 2010-01-16T12:00:00 ... 2010-12-16T12:00:00
Dimensions without coordinates: nv
Data variables:
    SSH        (time, tile, j, i) float32 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0

Loading a subset of single ECCOv4 variable NetCDF file using load_ecco_var_from_years_nc

One benefit of using load_ecco_var_from_years_nc over open_dataset is that you can optionally specify a subset of vertical levels or tiles to load using optional arguments with the tiles_to_load and k_subset optional arguments.

By default

  • tiles_to_load = [0, 1, … 12]
  • k_subset = []

To load a subset of tiles, specify the desired tile indices in tiles_to_load. For example, to load tiles 3,4 and 5: ~~ tiles_to_load = [3, 4, 5] ~~

To load a subset of depth levels, specify the desired depth level indices in k_subset. For example, to load the top 5 levels: ~~ k_subset = [0,1,2,3,4] ~~

In the following example we load THETA for tiles 7,8,9 and depth levels 0:34.

[7]:
theta_dir= ECCO_dir + '/nctiles_monthly/THETA/'
theta_subset = ecco.load_ecco_var_from_years_nc(theta_dir, \
                                                'THETA', years_to_load = [2010], \
                                                tiles_to_load = [ 7,8,9], \
                                                k_subset = [0,1,2,3,4]).load()
theta_subset.attrs = []
theta_subset
[7]:
<xarray.Dataset>
Dimensions:    (i: 90, j: 90, k: 5, nv: 2, tile: 3, time: 12)
Coordinates:
  * j          (j) int32 0 1 2 3 4 5 6 7 8 9 ... 80 81 82 83 84 85 86 87 88 89
  * i          (i) int32 0 1 2 3 4 5 6 7 8 9 ... 80 81 82 83 84 85 86 87 88 89
  * k          (k) int32 0 1 2 3 4
    Z          (k) float32 -5.0 -15.0 -25.0 -35.0 -45.0
    PHrefC     (k) float32 49.05 147.15 245.25 343.35 441.45
    drF        (k) float32 10.0 10.0 10.0 10.0 10.0
    XC         (tile, j, i) float32 142.16208 142.22801 ... -115.5476 -115.18083
    YC         (tile, j, i) float32 67.47211 67.33552 ... -80.43542 -80.43992
    rA         (tile, j, i) float32 212633870.0 351016450.0 ... 47093870.0
    hFacC      (tile, k, j, i) float32 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
  * tile       (tile) int32 7 8 9
    time_bnds  (time, nv) datetime64[ns] 2010-01-01 2010-02-01 ... 2011-01-01
    iter       (time) int32 158532 159204 159948 160668 ... 165084 165804 166548
  * time       (time) datetime64[ns] 2010-01-16T12:00:00 ... 2010-12-16T12:00:00
Dimensions without coordinates: nv
Data variables:
    THETA      (time, tile, k, j, i) float32 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0

As expected, theta_subset has 3 tiles and 5 vertical levels.

Loading multiple years of single ECCOv4 variable using load_ecco_var_from_years_nc

Another benefit of load_ecco_var_from_years is that you can load more than one year of output. First we’ll show loading two years then all of the years of output available in the file directory

[8]:
# two years:  2010 and 2011
SSH_2010_2011 = ecco.load_ecco_var_from_years_nc(SSH_dir, \
                                                'SSH',
                                                 years_to_load = [2010, 2011]).load()
SSH_2010_2011.attrs = []
SSH_2010_2011
[8]:
<xarray.Dataset>
Dimensions:    (i: 90, j: 90, nv: 2, tile: 13, time: 24)
Coordinates:
    rA         (tile, j, i) float32 362256450.0 363300960.0 ... 361119100.0
    XC         (tile, j, i) float32 -111.60647 -111.303 ... -111.86579
    YC         (tile, j, i) float32 -88.24259 -88.382515 ... -88.07871 -88.10267
  * i          (i) int32 0 1 2 3 4 5 6 7 8 9 ... 80 81 82 83 84 85 86 87 88 89
  * j          (j) int32 0 1 2 3 4 5 6 7 8 9 ... 80 81 82 83 84 85 86 87 88 89
  * tile       (tile) int32 0 1 2 3 4 5 6 7 8 9 10 11 12
    time_bnds  (time, nv) datetime64[ns] 2010-01-01 2010-02-01 ... 2012-01-01
    iter       (time) int32 158532 159204 159948 160668 ... 173844 174564 175308
  * time       (time) datetime64[ns] 2010-01-16T12:00:00 ... 2011-12-16T12:00:00
Dimensions without coordinates: nv
Data variables:
    SSH        (time, tile, j, i) float32 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0

Notice that SSH_2010_2011 has 24 time records

[9]:
# all years
SSH_all = ecco.load_ecco_var_from_years_nc(SSH_dir, \
                                           'SSH',
                                           years_to_load = 'all').load()
SSH_all.attrs = []
SSH_all
[9]:
<xarray.Dataset>
Dimensions:    (i: 90, j: 90, nv: 2, tile: 13, time: 288)
Coordinates:
    rA         (tile, j, i) float32 362256450.0 363300960.0 ... 361119100.0
    XC         (tile, j, i) float32 -111.60647 -111.303 ... -111.86579
    YC         (tile, j, i) float32 -88.24259 -88.382515 ... -88.07871 -88.10267
  * i          (i) int32 0 1 2 3 4 5 6 7 8 9 ... 80 81 82 83 84 85 86 87 88 89
  * j          (j) int32 0 1 2 3 4 5 6 7 8 9 ... 80 81 82 83 84 85 86 87 88 89
  * tile       (tile) int32 0 1 2 3 4 5 6 7 8 9 10 11 12
    time_bnds  (time, nv) datetime64[ns] 1992-01-01 1992-02-01 ... 2015-12-31
    iter       (time) int32 732 1428 2172 2892 ... 208164 208908 209628 210360
  * time       (time) datetime64[ns] 1992-01-16T12:00:00 ... 2015-12-16T12:00:00
Dimensions without coordinates: nv
Data variables:
    SSH        (time, tile, j, i) float32 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0

Now that we have all 288 month records of SSH.

Loading one or more years of more than variable using recursive_load_ecco_var_from_years_nc

With recursive_load_ecco_var_from_years_nc one can specify one or more variables to load, one or more years to load, while also requesting only a subset of tiles and vertical levels.

Let’s demonstrate by first loading all of the information for monthly-averaged SSH and THETA fields for the years 2010, 2011, and 2012. We’ll then load SSH and OBP for all of the years and tile 6. Finally, we’ll load all of the monthly-mean variables for the year 2010.

Loading SSH and THETA for 2010-2012

[10]:
def is_year_tryexcept(s):
    try:
        return int(s) >= 1900
    except ValueError:
        return False
[11]:
x='ADVr_SLT_1992'
y=x.split('_')
for yy in y:
    print(is_year_tryexcept(yy))
False
False
True
[12]:
nctiles_monthly_dir = ECCO_dir + 'nctiles_monthly/'
SSH_THETA_2010_2012 = \
        ecco.recursive_load_ecco_var_from_years_nc(nctiles_monthly_dir, \
                                                   vars_to_load=['SSH','THETA'], \
                                                   years_to_load = [2010, 2011, 2012], less_output=True).load()
loading files of  THETA
loading files of  SSH
[13]:
SSH_THETA_2010_2012
[13]:
<xarray.Dataset>
Dimensions:    (i: 90, j: 90, k: 50, nv: 2, tile: 13, time: 36)
Coordinates:
    rA         (tile, j, i) float32 362256450.0 363300960.0 ... 361119100.0
    drF        (k) float32 10.0 10.0 10.0 10.0 10.0 ... 387.5 410.5 433.5 456.5
    YC         (tile, j, i) float32 -88.24259 -88.382515 ... -88.07871 -88.10267
    PHrefC     (k) float32 49.05 147.15 245.25 ... 49435.043 53574.863 57940.312
    Z          (k) float32 -5.0 -15.0 -25.0 -35.0 ... -5039.25 -5461.25 -5906.25
    hFacC      (tile, k, j, i) float32 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
    XC         (tile, j, i) float32 -111.60647 -111.303 ... -111.86579
  * i          (i) int32 0 1 2 3 4 5 6 7 8 9 ... 80 81 82 83 84 85 86 87 88 89
  * j          (j) int32 0 1 2 3 4 5 6 7 8 9 ... 80 81 82 83 84 85 86 87 88 89
  * k          (k) int32 0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
  * tile       (tile) int32 0 1 2 3 4 5 6 7 8 9 10 11 12
    time_bnds  (time, nv) datetime64[ns] 2010-01-01 2010-02-01 ... 2013-01-01
    iter       (time) int32 158532 159204 159948 160668 ... 182628 183348 184092
  * time       (time) datetime64[ns] 2010-01-16T12:00:00 ... 2012-12-16T12:00:00
Dimensions without coordinates: nv
Data variables:
    THETA      (time, tile, k, j, i) float32 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
    SSH        (time, tile, j, i) float32 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Attributes:
    Conventions:                  CF-1.6
    Insitution:                   JPL
    Metadata_Conventions:         CF-1.6, Unidata Dataset Discovery v1.0, GDS...
    Project:                      Estimating the Circulation and Climate of t...
    author:                       Ian Fenty and Ou Wang
    cdm_data_type:                Grid
    date_created:                 Mon May 13 18:32:24 2019
    geospatial_lat_max:           89.739395
    geospatial_lat_min:           -89.873055
    geospatial_lat_units:         degrees_north
    geospatial_lon_max:           179.98691
    geospatial_lon_min:           -179.98895
    geospatial_lon_units:         degrees_east
    geospatial_vertical_max:      -5.0
    geospatial_vertical_min:      -5906.25
    geospatial_vertical_units:    meter
    no_data:                      NaNf
    nx:                           90
    ny:                           90
    nz:                           50
    product_time_coverage_end:    2015-12-31T12:00:00
    product_time_coverage_start:  1992-01-01T12:00:00
    product_version:              ECCO Version 4 Release 3 (ECCOv4r3) 1992-2015
    time_coverage_end:            2013-01-01T00:00:00
    time_coverage_start:          2010-01-01T00:00:00
    time_units:                   days since 1992-01-01 00:00:00

We see three years (36 months) of data and 13 tiles and 50 vertical levels.

Loading SSH and OBP for all years and tile 6

Now let’s demonstrate how the recursive_load_ecco_var_from_years_nc routine enable us to load all of the years of output for multiple variables. The trick is to specify the directory that contains all of the variables. The routine will recursively search all subdirectories for these fields. Note, this only works if the subdirectories are of the same temporal period (monthly mean, daily mean, or snapshots).

[14]:
SSH_OBP_2010_tile_6 = \
        ecco.recursive_load_ecco_var_from_years_nc(nctiles_monthly_dir, \
                                                   vars_to_load=['SSH', 'OBP'], \
                                                   years_to_load = 'all', \
                                                   tiles_to_load = 6).load()
loading files of  SSH
loading files of  OBP
[15]:
SSH_OBP_2010_tile_6
[15]:
<xarray.Dataset>
Dimensions:    (i: 90, j: 90, nv: 2, tile: 1, time: 288)
Coordinates:
    rA         (tile, j, i) float32 246414940.0 412417600.0 ... 246414940.0
    XC         (tile, j, i) float32 52.0 52.331654 ... -127.66834 -128.0
    YC         (tile, j, i) float32 67.57341 67.67698 ... 67.67698 67.57341
  * i          (i) int32 0 1 2 3 4 5 6 7 8 9 ... 80 81 82 83 84 85 86 87 88 89
  * j          (j) int32 0 1 2 3 4 5 6 7 8 9 ... 80 81 82 83 84 85 86 87 88 89
  * tile       (tile) int32 6
    time_bnds  (time, nv) datetime64[ns] 1992-01-01 1992-02-01 ... 2015-12-31
    iter       (time) int32 732 1428 2172 2892 ... 208164 208908 209628 210360
  * time       (time) datetime64[ns] 1992-01-16T12:00:00 ... 2015-12-16T12:00:00
Dimensions without coordinates: nv
Data variables:
    SSH        (time, tile, j, i) float32 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
    OBP        (time, tile, j, i) float32 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Attributes:
    Conventions:                  CF-1.6
    Insitution:                   JPL
    Metadata_Conventions:         CF-1.6, Unidata Dataset Discovery v1.0, GDS...
    Project:                      Estimating the Circulation and Climate of t...
    author:                       Ian Fenty and Ou Wang
    cdm_data_type:                Grid
    date_created:                 Tue May 14 10:19:12 2019
    geospatial_lat_max:           89.739395
    geospatial_lat_min:           67.57341
    geospatial_lat_units:         degrees_north
    geospatial_lon_max:           179.9739
    geospatial_lon_min:           -179.98895
    geospatial_lon_units:         degrees_east
    geospatial_vertical_max:      0
    geospatial_vertical_min:      0
    geospatial_vertical_units:    meter
    no_data:                      NaNf
    nx:                           90
    ny:                           90
    nz:                           1
    product_time_coverage_end:    2015-12-31T12:00:00
    product_time_coverage_start:  1992-01-01T12:00:00
    product_version:              ECCO Version 4 Release 3 (ECCOv4r3) 1992-2015
    time_coverage_end:            2015-12-31T00:00:00
    time_coverage_start:          1992-01-01T00:00:00
    time_units:                   days since 1992-01-01 00:00:00

Loading the entire (or large fractions) of the entire ECCOv4 solution using Dask

We can load the entire ECCOv4 solution into our workspace and perform calcuations with the entire solution thanks to the amazing Dask library implemented in the xarray package. Why is this useful? Because it is unlikely that the machine you are working on has enough RAM to load the entire ECCOv4 solution at one time. By using Dask we can virtually load all of the ECCO fields into memory. Dask even enables us to do calculations with these fields even though the entire data is never stored in memory.

Here is some more information about these features from the Dask website, https://docs.dask.org/

  1. Larger-than-memory: Lets you work on datasets that are larger than your available memory by breaking up your array into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.*
  2. Blocked Algorithms: Perform large computations by performing many smaller computations

Finally, in cluster environments Dask can distribute computations across many cores to speed up large redundant calculation.

An in-depth description of Dask is outside the scope of this tutorial. For the moment, let us compare the operation of loading all the 2D daily-mean fields for 2010 both using Dask and without using Dask. Without Dask these fields will be loaded into memory. With Dask we will only load a minimum of the Datasets, the Dimensions and Coordinates.

To demonstate some of the advantages of using DASK to load and analyze ECCO fields, let’s load several monthly-mean variable for two years with and without Dask. Then we’ll load the monthly-mean fields for four years using Dask. At the end we’ll compare their times to load and memory footprints.

Example 1a: Load two years monthly-mean ECCO fields into memory without Dask

The .load() suffix appended to the end of these load commands is a command to fully load fields into memory. Without it, Dask only virtually loads the fields.

[16]:
mon_mean_dir = ECCO_dir + 'nctiles_monthly/'
import time
t_0 = time.time()

large_subset_no_dask = \
    ecco.recursive_load_ecco_var_from_years_nc(mon_mean_dir, \
                                               vars_to_load=['THETA','SSH','UVEL','VVEL',\
                                                             'ADVr_SLT','ADVr_TH',\
                                                             'ADVx_SLT','ADVy_SLT',\
                                                             'ADVxHEFF','ADVxSNOW', \
                                                             'OBP','SIarea','SIheff'],
                                               years_to_load=[2010, 2011],\
                                               less_output=True).load()

delta_t_2_yrs_no_dask =  time.time() - t_0
print(delta_t_2_yrs_no_dask)
loading files of  THETA
loading files of  SSH
loading files of  UVEL
loading files of  VVEL
loading files of  ADVr_SLT
loading files of  ADVr_TH
loading files of  ADVxHEFF
loading files of  ADVxSNOW
loading files of  ADVx_SLT
loading files of  SIarea
loading files of  OBP
loading files of  SIheff
11.277666807174683

Example 1b: Load two years of monthly-mean ECCO fields into memory using Dask

This time we will omit the .load() suffix and use Dask.

[17]:
t_0 = time.time()
large_subset_with_dask = ecco.recursive_load_ecco_var_from_years_nc(mon_mean_dir, \
                                           vars_to_load=['THETA','SSH','UVEL','VVEL',\
                                                         'ADVr_SLT','ADVr_TH',\
                                                         'ADVx_SLT','ADVy_SLT',\
                                                         'ADVxHEFF','ADVxSNOW', \
                                                         'OBP','SIarea','SIheff'],
                                           years_to_load=[2010, 2011])
delta_t_2_yrs_with_dask =  time.time() - t_0
print(delta_t_2_yrs_with_dask)
loading files of  THETA
loading files of  SSH
loading files of  UVEL
loading files of  VVEL
loading files of  ADVr_SLT
loading files of  ADVr_TH
loading files of  ADVxHEFF
loading files of  ADVxSNOW
loading files of  ADVx_SLT
loading files of  SIarea
loading files of  OBP
loading files of  SIheff
3.1377756595611572

Example 2: Load 5 years of monthly-mean field with DASK

[18]:
t_0 = time.time()
all_fields_with_dask = ecco.recursive_load_ecco_var_from_years_nc(mon_mean_dir, \
                                           vars_to_load=['THETA','SSH','UVEL','VVEL',\
                                                         'ADVr_SLT','ADVr_TH',\
                                                         'ADVx_SLT','ADVy_SLT',\
                                                         'ADVxHEFF','ADVxSNOW', \
                                                         'OBP','SIarea','SIheff'], \
                                           years_to_load=[2008, 2009, 2010, 2011, 2012], \
                                           dask_chunk=True)
delta_t_5_yrs_with_dask =  time.time() - t_0
print (delta_t_5_yrs_with_dask)
loading files of  THETA
loading files of  SSH
loading files of  UVEL
loading files of  VVEL
loading files of  ADVr_SLT
loading files of  ADVr_TH
loading files of  ADVxHEFF
loading files of  ADVxSNOW
loading files of  ADVx_SLT
loading files of  SIarea
loading files of  OBP
loading files of  SIheff
6.462316274642944

Results

Now we examine the time it took to load these fields and the comparative memory footprints

[19]:
print ('loaded 2 years without dask in ', np.round(delta_t_2_yrs_no_dask), 'sec')
print ('loaded 2 years with_dask in    ', np.round(delta_t_2_yrs_with_dask), 'sec')
print ('loaded 5 years with_dask in    ', np.round(delta_t_5_yrs_with_dask), 'sec')
loaded 2 years without dask in  11.0 sec
loaded 2 years with_dask in     3.0 sec
loaded 5 years with_dask in     6.0 sec

The real advantage of using Dask comes when examining the size of these objects in memory. The ‘pympler’ package allows you to see the memory footprint of these objects.

https://pythonhosted.org/Pympler/index.html

To proceed, install the pympler library into your Python environment

[20]:
# Estimate the memory footprint in MB
from pympler import asizeof

s = 0
F = large_subset_no_dask
for i in F.variables.keys():
    s += asizeof.asizeof(large_subset_no_dask[i])
print('large_subset_no_dask   : ', np.round(s/2**20), 'mb')

F = large_subset_with_dask
s = 0
for i in F.variables.keys():
    s += asizeof.asizeof(F[i])
print('large_subset_with_dask : ', np.round(s/2**20), 'mb')

F = all_fields_with_dask
s = 0
for i in F.variables.keys():
    s += asizeof.asizeof(F[i])
print('all_fields_with_dask   : ', np.round(s/2**20), 'mb')
large_subset_no_dask   :  3120.0 mb
large_subset_with_dask :  171.0 mb
all_fields_with_dask   :  172.0 mb

Using Dask, we were able to load 5 years of data faster than 2 years without Dask.

In terms of memory, the 5 years of data in all_fields_with_dask object takes a fraction of the memory of the 2 years of data in large_subset_no_dask. With much less memory reserved to hold all of the fields, we have more memory avaiable for calculations on the parts of the fields that we care about.

Go ahead and experiment with using Dask to load the daily-averaged fields. Because all of the daily-averaged fields in the standard ECCO product are 2D, loading them with Dask takes very little time!

Summary

Now you know how effecient ways to load ECCOv4 NetCDF files, both when you are reading variables split into tiles and when you are reading variables aggregated by year.

Using Dask we showed that one can prepare a work environment where ECCO model variables are accessible for calculations even without fully loading the fields into memory.