Operating on Numpy arrays

Objectives

Introduce numpy’s pass-by-reference approach handling numpy arrays and methods for avoiding pitfalls when operating on numpy arrays.

Introduction

From: http://span.ece.utah.edu/common-mistakes-in-moving-from-matlab-to-python:

“Whenever you reference an array in Python, the computer will provide the memory address for the thing you are accessing, not the actual value. This is called pass-by-reference. This saves memory and makes your programs faster, but it is also harder to keep straight.”

From: https://docs.python.org/2/library/copy.html

“Assignment statements in Python do not copy objects, they create bindings [pointers] between a target and an object.” “… a copy is sometimes needed so one can change one copy without changing the other. The ‘copy’ module provides generic … copy operations.”

If you are not familiar with the pass-by-reference aspect of Python then I strongly suggest you read this short, informative essay on “Python Names and Values”: https://nedbatchelder.com/text/names.html

We’ve briefly touched on this important subject in earlier tutorials. Now we’ll go into a bit more detail.

Note: For this tutorial you will need the 2010 monthly SSH fields (ShortName ECCO_L4_SSH_LLC0090GRID_MONTHLY_V4R4) and the grid file. The ecco_access library used in the notebook will handle download or retrieval of the necessary data, if you have set up the library in your Python path.

Variable assignments

Unlike some other languages, creating a new variable with an assignment statement in Python such as x = some_numpy_array

does not make a copy of some_numpy_array. Instead, the assignment statement makes x and some_numpy_array both point to the same numpy array in memory. Because x and some_numpy_array are both refer (or pointer) to the same numpy array in memory, the numpy array can be changed by operations on either x or some_numpy_array. If you aren’t aware of this behavior then you may run into very difficult to identify bugs in your calculations!

A simple demonstration

Let’s demonstrate this issue with a very simple numpy array

[1]:
import numpy as np
import xarray as xr
import sys
import matplotlib.pyplot as plt
%matplotlib inline
import json
import glob
from copy import deepcopy
import warnings
warnings.filterwarnings('ignore')

import ecco_access as ea

# indicate mode of access
# options are:
# 'download': direct download from internet to your local machine
# 'download_ifspace': like download, but only proceeds
#                     if your machine have sufficient storage
# 's3_open': access datasets in-cloud from an AWS instance
# 's3_open_fsspec': use jsons generated with fsspec and
#                   kerchunk libraries to speed up in-cloud access
# 's3_get': direct download from S3 in-cloud to an AWS instance
# 's3_get_ifspace': like s3_get, but only proceeds if your instance
#                   has sufficient storage
access_mode = 'download_ifspace'
[2]:
## Import the ecco_v4_py library into Python
## =========================================
##    If ecco_v4_py is not installed in your local Python library,
##    tell Python where to find it.  The example below adds
##    ecco_v4_py to the user's path if it is stored in the folder
##    ECCOv4-py under the user's home directory

from os.path import join,expanduser
user_home_dir = expanduser('~')

sys.path.append(join(user_home_dir,'ECCOv4-py'))

import ecco_v4_py as ecco
[3]:
## Set top-level file directory for the ECCO NetCDF files
## =================================================================

## currently set to /Downloads/ECCO_V4r4_PODAAC
ECCO_dir = join(user_home_dir,'Downloads','ECCO_V4r4_PODAAC')

# # for access_mode = 's3_open_fsspec', need to specify the root directory
# # containing the jsons
# jsons_root_dir = join('/efs_ecco','mzz-jsons')

Create a simple numpy array

[4]:
a=np.array([1, 2, 3, 4, 5])

# Assign 'b' to point to the same numpy array
b=a

# Test to see if b and a point to the same thing
b is a
[4]:
True

Now change the fourth element of b and print both a and b

[5]:
b[3] = 10
print (a)
print (b)
[ 1  2  3 10  5]
[ 1  2  3 10  5]

A fancier demonstration

Let’s now demonstrate with a numpy array that stores SSH output.

[6]:
## load ECCO datasets needed for the tutorial

ShortNames_list = ["ECCO_L4_GEOMETRY_LLC0090GRID_V4R4",\
                   "ECCO_L4_SSH_LLC0090GRID_MONTHLY_V4R4"]

ds_dict = ea.ecco_podaac_to_xrdataset(ShortNames_list,\
                                              StartDate='2010-01',EndDate='2010-12',\
                                              mode=access_mode,\
                                              download_root_dir=ECCO_dir,\
                                              max_avail_frac=0.5)

ecco_grid = ds_dict[ShortNames_list[0]].compute()
ssh_dataset = ds_dict[ShortNames_list[1]].compute()

## Merge SSH and grid
output_all = xr.merge((ssh_dataset, ecco_grid))

Recall the dimensions of our SSH DataArray:

[7]:
output_all.SSH.dims
[7]:
('time', 'tile', 'j', 'i')

Show the first four SSH values in j and i for the fifth month (May 2010) and second tile:

[8]:
output_all.SSH[4,1,0:4,0:4].values
[8]:
array([[-1.3647507, -1.3654912, -1.3614875, -1.3682245],
       [-1.3252122, -1.3242471, -1.3242686, -1.3410169],
       [-1.2957976, -1.2947818, -1.3008444, -1.3268987],
       [-1.285025 , -1.2845778, -1.2949846, -1.325418 ]], dtype=float32)

Assign the variable ssh_tmp to this subset of the numpy array that SSH points to:

[9]:
ssh_tmp = output_all.SSH[4,1,0:2,0:2].values
ssh_tmp
[9]:
array([[-1.3647507, -1.3654912],
       [-1.3252122, -1.3242471]], dtype=float32)

Now change the values of all elements of ssh_tmp to 10

[10]:
ssh_tmp[:] = 10
ssh_tmp
[10]:
array([[10., 10.],
       [10., 10.]], dtype=float32)

And see that yes, in fact, this change is reflected in our SSH DataArray:

[11]:
output_all.SSH[4,1,0:4,0:4].values
[11]:
array([[10.       , 10.       , -1.3614875, -1.3682245],
       [10.       , 10.       , -1.3242686, -1.3410169],
       [-1.2957976, -1.2947818, -1.3008444, -1.3268987],
       [-1.285025 , -1.2845778, -1.2949846, -1.325418 ]], dtype=float32)

Dealing with pass-by-reference: right hand side operations

One way to have a new variable assignment not point to the original variable is to perform an operation on the right hand side of the assignment statement.

“Python evaluates expressions from left to right. Notice that while evaluating an assignment, the right-hand side is evaluated before the left-hand side.” https://docs.python.org/2/reference/expressions.html#evaluation-order

Performing an operation on the right hand side creates new values in memory. The new variable assignment will then point to these new values, leaving the original untouched.

Simple demonstration 1

Operate on a by adding 1 before the assigment statement

[12]:
# Create a simple numpy array
a=np.array([1, 2, 3, 4, 5])

b = a + 1

print (a)
print (b)
[1 2 3 4 5]
[2 3 4 5 6]

Now change the fourth element of b and print both a and b

[13]:
b[3] = 10
print (a)
print (b)
[1 2 3 4 5]
[ 2  3  4 10  6]

a and b do indeed point to different values in memory.

Simple demonstration 2

Operate on a by adding 0 before the assigment statement. This is a kind of dummy operation.

[14]:
# Create a simple numpy array
a=np.array([1, 2, 3, 4, 5])

# Add 0 to `a`:
b = a + 0

print (a)
print (b)
[1 2 3 4 5]
[1 2 3 4 5]
[15]:
# Test to see if b and a point to the same thing
b is a
[15]:
False

Now change the fourth element of b and print both a and b

[16]:
b[3] = 10
print (a)
print (b)
[1 2 3 4 5]
[ 1  2  3 10  5]

Once again we see that a and b do indeed point to different values in memory.

A fancier demonstration

Let’s now demonstrate with a numpy array that stores SSH output.

[17]:
output_all.SSH[4,1,5:9,5:9].values
[17]:
array([[-1.3839613, -1.3924128, -1.3913262, -1.3853507],
       [-1.3569907, -1.3593892, -1.3547356, -1.3458692],
       [-1.3178127, -1.316035 , -1.3086342, -1.2973909],
       [-1.267521 , -1.2628312, -1.2530723, -1.239709 ]], dtype=float32)
[18]:
ssh_tmp = output_all.SSH[4,1,5:9,5:9].values * output_all.rA[1,5:9,5:9].values
ssh_tmp[:] = 10
ssh_tmp
[18]:
array([[10., 10., 10., 10.],
       [10., 10., 10., 10.],
       [10., 10., 10., 10.],
       [10., 10., 10., 10.]], dtype=float32)
[19]:
output_all.SSH[4,1,5:9,5:9].values
[19]:
array([[-1.3839613, -1.3924128, -1.3913262, -1.3853507],
       [-1.3569907, -1.3593892, -1.3547356, -1.3458692],
       [-1.3178127, -1.316035 , -1.3086342, -1.2973909],
       [-1.267521 , -1.2628312, -1.2530723, -1.239709 ]], dtype=float32)

Operating on the right hand side of the assignment does indeed new arrays in memory leaving the original SSH numpy array untouched.

Dealing with pass-by-reference: copy and deepcopy

A second way to have a new variable assignment not point to the original variable is to use the copy or deepcopy command.

Simple demonstration

Use the numpy command.

[20]:
# Create a simple numpy array
a=np.array([1, 2, 3, 4, 5])
b=np.copy(a)

print (a)
print (b)
[1 2 3 4 5]
[1 2 3 4 5]

Now change the fourth element of b and print both a and b

[21]:
b[3] = 10
print (a)
print (b)
[1 2 3 4 5]
[ 1  2  3 10  5]
[22]:
output_all.SSH
[22]:
<xarray.DataArray 'SSH' (time: 12, tile: 13, j: 90, i: 90)> Size: 5MB
array([[[[        nan,         nan,         nan, ...,         nan,
                  nan,         nan],
         [        nan,         nan,         nan, ...,         nan,
                  nan,         nan],
         [        nan,         nan,         nan, ...,         nan,
                  nan,         nan],
         ...,
         [-1.4194752 , -1.4261409 , -1.4262563 , ..., -1.335799  ,
          -1.3286246 , -1.3223515 ],
         [-1.3915291 , -1.3982822 , -1.3976623 , ..., -1.3265072 ,
          -1.3189714 , -1.3117282 ],
         [-1.3617151 , -1.3670161 , -1.3660299 , ..., -1.3197623 ,
          -1.3118473 , -1.303166  ]],

        [[-1.3301036 , -1.3331454 , -1.3329679 , ..., -1.3132814 ,
          -1.3051906 , -1.2948027 ],
         [-1.2997499 , -1.3010498 , -1.303525  , ..., -1.3048117 ,
          -1.2970599 , -1.2852311 ],
         [-1.2769612 , -1.2781848 , -1.2847781 , ..., -1.2925622 ,
          -1.2858012 , -1.2733294 ],
...
         [ 0.14350861,  0.14915593,  0.16067028, ..., -1.2250897 ,
          -1.2384485 , -1.2555339 ],
         [ 0.14452162,  0.15083845,  0.1632475 , ..., -1.245257  ,
          -1.2634263 , -1.2841328 ],
         [ 0.14675961,  0.15251625,  0.16487278, ..., -1.2580327 ,
          -1.2814457 , -1.3074644 ]],

        [[-0.73443437, -0.8210122 , -0.907068  , ...,         nan,
                  nan,         nan],
         [-0.7157703 , -0.8134533 , -0.9050429 , ...,         nan,
                  nan,         nan],
         [-0.71003354, -0.81543124, -0.90919775, ...,         nan,
                  nan,         nan],
         ...,
         [-1.2795744 , -1.3132012 , -1.3549699 , ...,         nan,
                  nan,         nan],
         [-1.3077799 , -1.3369125 , -1.3719907 , ...,         nan,
                  nan,         nan],
         [-1.3335364 , -1.3611741 , -1.3917223 , ...,         nan,
                  nan,         nan]]]], dtype=float32)
Coordinates:
  * i        (i) int32 360B 0 1 2 3 4 5 6 7 8 9 ... 81 82 83 84 85 86 87 88 89
  * j        (j) int32 360B 0 1 2 3 4 5 6 7 8 9 ... 81 82 83 84 85 86 87 88 89
  * tile     (tile) int32 52B 0 1 2 3 4 5 6 7 8 9 10 11 12
  * time     (time) datetime64[ns] 96B 2010-01-16T12:00:00 ... 2010-12-16T12:...
    XC       (tile, j, i) float32 421kB -111.6 -111.3 -110.9 ... -105.6 -111.9
    YC       (tile, j, i) float32 421kB -88.24 -88.38 -88.52 ... -88.08 -88.1
Attributes:
    long_name:              Dynamic sea surface height anomaly
    units:                  m
    coverage_content_type:  modelResult
    standard_name:          sea_surface_height_above_geoid
    comment:                Dynamic sea surface height anomaly above the geoi...
    valid_min:              -1.8805772066116333
    valid_max:              1.4207719564437866

Fancier demonstration

Dataset and DataArray objects are too complicated for numpy’s copy command. For complex objects such as these use the deepcopy command.

[23]:
ssh_tmp = deepcopy(output_all.SSH)
ssh_tmp[:] = 10
ssh_tmp[4,1,5:9,5:9].values
[23]:
array([[10., 10., 10., 10.],
       [10., 10., 10., 10.],
       [10., 10., 10., 10.],
       [10., 10., 10., 10.]], dtype=float32)
[24]:
output_all.SSH[4,1,5:9,5:9].values
[24]:
array([[-1.3839613, -1.3924128, -1.3913262, -1.3853507],
       [-1.3569907, -1.3593892, -1.3547356, -1.3458692],
       [-1.3178127, -1.316035 , -1.3086342, -1.2973909],
       [-1.267521 , -1.2628312, -1.2530723, -1.239709 ]], dtype=float32)

Using deepcopy gives us an entirely new array in memory. Operations on ssh_tmp do not affect the original fields that we found in the output_all_SSH DataArray.

alternative to deepcopy

xarray give us another way to deepcopy DataArrays and Datasets:

ssh_tmp = output_all.copy(deep=True)

Conclusion

You now know about the possible pitfalls for dealing with Python’s pass-by-reference way of handling assignment statements and different methods for making copies of numpy arrays and Datasets and DataArrays.