Combining multiple `Datasets`¶

Objectives:¶

Show how to combine multiple ECCO v4 state estimate Datasets after loading.

Opening multiple `Datasets` centered on different coordinates¶

In previous tutorials we’ve loaded single lat-lon-cap NetCDF tile files (granules) for ECCO state estimate variables and model grid parameters. Here we will demonstrate merging Datasets together. Some benefits of merging Datasets include having a tidier workspace and simplifying subsetting operations (e.g., using xarray.isel or xarray.sel as shown in the previous tutorial).

First, we’ll load three ECCOv4 NetCDF state estimate variables (each centered on different coordinates) as well as the model grid file. For this, you will need 2 datasets of monthly mean fields for the year 2010, as well as the grid parameters file. The ShortNames for the datasets are:

ECCO_L4_GEOMETRY_LLC0090GRID_V4R4 (no time dimension)
ECCO_L4_SSH_LLC0090GRID_MONTHLY_V4R4 (Jan-Dec 2010)
ECCO_L4_OCEAN_3D_TEMPERATURE_FLUX_LLC0090GRID_MONTHLY_V4R4 (Jan-Dec 2010)

The ecco_access Python package will handle download or retrieval of the necessary data.

Let’s define our environment:

[1]:

import numpy as np
import xarray as xr
from os.path import join,expanduser
import sys
import matplotlib.pyplot as plt
import json

import ecco_v4_py as ecco
import ecco_access as ea


# are you working in the AWS Cloud?
incloud_access = False

# indicate mode of access from PO.DAAC
# options are:
# 'download': direct download from internet to your local machine
# 'download_ifspace': like download, but only proceeds
#                     if your machine have sufficient storage
# 's3_open': access datasets in-cloud from an AWS instance
# 's3_open_fsspec': use jsons generated with fsspec and
#                   kerchunk libraries to speed up in-cloud access
# 's3_get': direct download from S3 in-cloud to an AWS instance
# 's3_get_ifspace': like s3_get, but only proceeds if your instance
#                   has sufficient storage
user_home_dir = expanduser('~')
download_dir = join(user_home_dir,'Downloads','ECCO_V4r4_PODAAC')
if incloud_access:
    access_mode = 's3_open_fsspec'
    download_root_dir = None
    jsons_root_dir = join(user_home_dir,'MZZ')
else:
    access_mode = 'download_ifspace'
    download_root_dir = download_dir
    jsons_root_dir = None

[4]:

## Access datasets needed for this tutorial

ShortNames_list = ["ECCO_L4_GEOMETRY_LLC0090GRID_V4R4",\
                   "ECCO_L4_SSH_LLC0090GRID_MONTHLY_V4R4",\
                   "ECCO_L4_OCEAN_3D_TEMPERATURE_FLUX_LLC0090GRID_MONTHLY_V4R4"]

ds_dict = ea.ecco_podaac_to_xrdataset(ShortNames_list,\
                                              StartDate='2010-01',EndDate='2010-12',\
                                              mode=access_mode,\
                                              download_root_dir=download_root_dir,\
                                              jsons_root_dir=jsons_root_dir,\
                                              max_avail_frac=0.5)

Open c point variable: `SSH`¶

[5]:

# load dataset containing monthly SSH in 2010
ecco_dataset_A = ds_dict[ShortNames_list[1]]

to see the data variables in a dataset, use .data_vars:

[6]:

ecco_dataset_A.data_vars

[6]:

Data variables:
    SSH       (time, tile, j, i) float32 5MB dask.array<chunksize=(1, 13, 90, 90), meta=np.ndarray>
    SSHIBC    (time, tile, j, i) float32 5MB dask.array<chunksize=(1, 13, 90, 90), meta=np.ndarray>
    SSHNOIBC  (time, tile, j, i) float32 5MB dask.array<chunksize=(1, 13, 90, 90), meta=np.ndarray>
    ETAN      (time, tile, j, i) float32 5MB dask.array<chunksize=(1, 13, 90, 90), meta=np.ndarray>

ecco_dataset_A has four data variables, all having dimensions i, j, tile, and time, which mean that they are centered with respect to the grid cells of the model. The coordinates associated with the SSH variable are:

[7]:

ecco_dataset_A.SSH.coords

[7]:

Coordinates:
  * i        (i) int32 360B 0 1 2 3 4 5 6 7 8 9 ... 81 82 83 84 85 86 87 88 89
  * j        (j) int32 360B 0 1 2 3 4 5 6 7 8 9 ... 81 82 83 84 85 86 87 88 89
  * tile     (tile) int32 52B 0 1 2 3 4 5 6 7 8 9 10 11 12
  * time     (time) datetime64[ns] 96B 2010-01-16T12:00:00 ... 2010-12-16T12:...
    XC       (tile, j, i) float32 421kB dask.array<chunksize=(13, 90, 90), meta=np.ndarray>
    YC       (tile, j, i) float32 421kB dask.array<chunksize=(13, 90, 90), meta=np.ndarray>

You can see the coordinates that are also dimensions (dimensional coordinates) have an asterisk. The non-dimensional coordinates XC and YC are not used for indexing, but are very important as they indicate the longitude and latitude respectively associated with the dimensional coordinates.

Open u and v point variables: `ADVx_TH` and `ADVy_TH`¶

Now let’s open the ECCOv4 output files containing the horizontal advective fluxes of potential temperature, ADVx_TH and ADVy_TH.

[8]:

# open dataset containing monthly mean 3D temperature fluxes in 2010
ecco_dataset_B = ds_dict[ShortNames_list[2]]

ecco_dataset_B.data_vars

[8]:

Data variables:
    ADVx_TH  (time, k, tile, j, i_g) float32 253MB dask.array<chunksize=(1, 25, 7, 45, 45), meta=np.ndarray>
    DFxE_TH  (time, k, tile, j, i_g) float32 253MB dask.array<chunksize=(1, 25, 7, 45, 45), meta=np.ndarray>
    ADVy_TH  (time, k, tile, j_g, i) float32 253MB dask.array<chunksize=(1, 25, 7, 45, 45), meta=np.ndarray>
    DFyE_TH  (time, k, tile, j_g, i) float32 253MB dask.array<chunksize=(1, 25, 7, 45, 45), meta=np.ndarray>
    ADVr_TH  (time, k_l, tile, j, i) float32 253MB dask.array<chunksize=(1, 25, 7, 45, 45), meta=np.ndarray>
    DFrE_TH  (time, k_l, tile, j, i) float32 253MB dask.array<chunksize=(1, 25, 7, 45, 45), meta=np.ndarray>
    DFrI_TH  (time, k_l, tile, j, i) float32 253MB dask.array<chunksize=(1, 25, 7, 45, 45), meta=np.ndarray>

ecco_dataset_B has seven data variables! These include three variables (starting with AD) that quantify 3D advection fluxes, and four variables (starting with DF) that quantify 3D diffusive fluxes, including one DFrI_TH that is an implicit flux from the vertical mixing parameterization of the model. In this tutorial we will focus on the two variables that quantify horizontal advective fluxes.

Let’s look at one of these variables, ADVx_TH

[9]:

ecco_dataset_B.ADVx_TH

[9]:

<xarray.DataArray 'ADVx_TH' (time: 12, k: 50, tile: 13, j: 90, i_g: 90)> Size: 253MB
dask.array<concatenate, shape=(12, 50, 13, 90, 90), dtype=float32, chunksize=(1, 25, 7, 45, 45), chunktype=numpy.ndarray>
Coordinates:
  * i_g      (i_g) int32 360B 0 1 2 3 4 5 6 7 8 9 ... 81 82 83 84 85 86 87 88 89
  * j        (j) int32 360B 0 1 2 3 4 5 6 7 8 9 ... 81 82 83 84 85 86 87 88 89
  * k        (k) int32 200B 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49
  * tile     (tile) int32 52B 0 1 2 3 4 5 6 7 8 9 10 11 12
  * time     (time) datetime64[ns] 96B 2010-01-16T12:00:00 ... 2010-12-16T12:...
    Z        (k) float32 200B dask.array<chunksize=(50,), meta=np.ndarray>
Attributes:
    long_name:              Lateral advective flux of potential temperature i...
    units:                  degree_C m3 s-1
    mate:                   ADVy_TH
    coverage_content_type:  modelResult
    direction:              >0 increases potential temperature (THETA)
    comment:                Lateral advective flux of potential temperature (...
    valid_min:              -28231902.0
    valid_max:              36523468.0

The long_name and the comment tell us that this variable is the advective flux of potential temperature in the model + $x$ direction. It has dimensional coordinates i_g, j, k, tile, and time. The k dimension indexes depth (which was not a part of the SSH dataset), and the i_g dimension has replaced the i dimension that SSH had.

Since ADVx_TH has a i_g dimensional coordinate instead of i, we know that the flux is quantified not at the center of each cell, but is offset in the model $x$ direction. Specifically, the flux is quantified on the left or “west” face of each grid cell. In this context “west” may not actually refer to geographical west, but rather the left side of a grid cell on an axis where i and i_g increase to the right.

Now consider ADVy_TH:

[10]:

ecco_dataset_B.ADVy_TH

[10]:

<xarray.DataArray 'ADVy_TH' (time: 12, k: 50, tile: 13, j_g: 90, i: 90)> Size: 253MB
dask.array<concatenate, shape=(12, 50, 13, 90, 90), dtype=float32, chunksize=(1, 25, 7, 45, 45), chunktype=numpy.ndarray>
Coordinates:
  * i        (i) int32 360B 0 1 2 3 4 5 6 7 8 9 ... 81 82 83 84 85 86 87 88 89
  * j_g      (j_g) int32 360B 0 1 2 3 4 5 6 7 8 9 ... 81 82 83 84 85 86 87 88 89
  * k        (k) int32 200B 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49
  * tile     (tile) int32 52B 0 1 2 3 4 5 6 7 8 9 10 11 12
  * time     (time) datetime64[ns] 96B 2010-01-16T12:00:00 ... 2010-12-16T12:...
    Z        (k) float32 200B dask.array<chunksize=(50,), meta=np.ndarray>
Attributes:
    long_name:              Lateral advective flux of potential temperature i...
    units:                  degree_C m3 s-1
    mate:                   ADVx_TH
    coverage_content_type:  modelResult
    direction:              >0 increases potential temperature (THETA)
    comment:                Lateral advective flux of potential temperature (...
    valid_min:              -31236064.0
    valid_max:              43466144.0

ADVy_TH is the horizontal advective flux of potential temperature in each tile’s $y$ direction. The dimensional coordinates are i, j_g, k, tile, and time. In this case we have the centered $x$ coordinate i but the off-center (shifted) $y$ coordinate j_g, which indicates that these fluxes are located on the lower or “south” face of each grid cell—again with the caveat that this does not always correspond to geographical south.

Examining the dimensions and coordinates of these `Datasets`¶

Each of the three variables we have discussed comprises an xarray DataArray, and each of these DataArray objects has different horizontal dimension labels.

i and j for SSH
i_g and j for ADVx_TH
i and j_g for ADVy_TH

[11]:

# print just the first line of each DataArray's information
print((str(ecco_dataset_A.SSH)).split('\n')[0])
print((str(ecco_dataset_B.ADVx_TH)).split('\n')[0])
print((str(ecco_dataset_B.ADVy_TH)).split('\n')[0])

<xarray.DataArray 'SSH' (time: 12, tile: 13, j: 90, i: 90)> Size: 5MB
<xarray.DataArray 'ADVx_TH' (time: 12, k: 50, tile: 13, j: 90, i_g: 90)> Size: 253MB
<xarray.DataArray 'ADVy_TH' (time: 12, k: 50, tile: 13, j_g: 90, i: 90)> Size: 253MB

Merging and memory¶

Merging Datasets together does not make copies of the data in memory. Instead, merged Datasets are in fact just a reorganized collection of pointers. You may want to delete the original variables to clear your namespace, but it is not necessary.

Summary¶

Now you know how to merge multiple Datasets using the merge command. We demonstrated merging of Datasets constructed from three different variables types and the model grid parameters.

Combining multiple `Datasets`¶

Objectives:¶

Opening multiple `Datasets` centered on different coordinates¶

Open c point variable: `SSH`¶

Open u and v point variables: `ADVx_TH` and `ADVy_TH`¶

Examining the dimensions and coordinates of these `Datasets`¶

Merging multiple `Datasets` from state estimate variables¶

Examining the merged `Dataset`¶

1. Dimensions¶

2. Dimension Coordinates¶

3. Non-Dimension Coordinates¶

4. Attributes¶

Adding the model grid `Dataset`¶

Load the model grid parameters¶

Merge `grid_all_tiles` with `output_merged`¶

Examining the merged `Dataset`¶

Merging and memory¶

Summary¶

Combining multiple Datasets¶

Objectives:¶

Opening multiple Datasets centered on different coordinates¶

Open c point variable: SSH¶

Open u and v point variables: ADVx_TH and ADVy_TH¶

Examining the dimensions and coordinates of these Datasets¶

Merging multiple Datasets from state estimate variables¶

Examining the merged Dataset¶

1. Dimensions¶

2. Dimension Coordinates¶

3. Non-Dimension Coordinates¶

4. Attributes¶

Adding the model grid Dataset¶

Load the model grid parameters¶

Merge grid_all_tiles with output_merged¶

Examining the merged Dataset¶

Merging and memory¶

Summary¶

Combining multiple `Datasets`¶

Opening multiple `Datasets` centered on different coordinates¶

Open c point variable: `SSH`¶

Open u and v point variables: `ADVx_TH` and `ADVy_TH`¶

Examining the dimensions and coordinates of these `Datasets`¶

Merging multiple `Datasets` from state estimate variables¶

Examining the merged `Dataset`¶

Adding the model grid `Dataset`¶

Merge `grid_all_tiles` with `output_merged`¶

Examining the merged `Dataset`¶