Operating on Numpy arrays¶
Objectives¶
Introduce numpy’s pass-by-reference approach handling numpy arrays and methods for avoiding pitfalls when operating on numpy arrays.
Introduction¶
From: http://span.ece.utah.edu/common-mistakes-in-moving-from-matlab-to-python:
“Whenever you reference an array in Python, the computer will provide the memory address for the thing you are accessing, not the actual value. This is called pass-by-reference. This saves memory and makes your programs faster, but it is also harder to keep straight.”
From: https://docs.python.org/2/library/copy.html
“Assignment statements in Python do not copy objects, they create bindings [pointers] between a target and an object.” “… a copy is sometimes needed so one can change one copy without changing the other. The ‘copy’ module provides generic … copy operations.”
If you are not familiar with the pass-by-reference aspect of Python then I strongly suggest you read this short, informative essay on “Python Names and Values”: https://nedbatchelder.com/text/names.html
We’ve briefly touched on this important subject in earlier tutorials. Now we’ll go into a bit more detail.
Note: For this tutorial you will need the 2010 monthly SSH fields (ShortName ECCO_L4_SSH_LLC0090GRID_MONTHLY_V4R4) and the grid file. The
ecco_access
library used in the notebook will handle download or retrieval of the necessary data, if you have set up the library in your Python path.
Variable assignments¶
Unlike some other languages, creating a new variable with an assignment statement in Python such as x = some_numpy_array
does not make a copy of some_numpy_array
. Instead, the assignment statement makes x
and some_numpy_array
both point to the same numpy
array in memory. Because x
and some_numpy_array
are both refer (or pointer) to the same numpy
array in memory, the numpy
array can be changed by operations on either x
or some_numpy_array
. If you aren’t aware of this behavior then you may run into very difficult to identify bugs in your calculations!
A simple demonstration¶
Let’s demonstrate this issue with a very simple numpy
array
[1]:
import numpy as np
import xarray as xr
import sys
import matplotlib.pyplot as plt
%matplotlib inline
import json
import glob
from copy import deepcopy
import warnings
warnings.filterwarnings('ignore')
import ecco_access as ea
# indicate mode of access
# options are:
# 'download': direct download from internet to your local machine
# 'download_ifspace': like download, but only proceeds
# if your machine have sufficient storage
# 's3_open': access datasets in-cloud from an AWS instance
# 's3_open_fsspec': use jsons generated with fsspec and
# kerchunk libraries to speed up in-cloud access
# 's3_get': direct download from S3 in-cloud to an AWS instance
# 's3_get_ifspace': like s3_get, but only proceeds if your instance
# has sufficient storage
access_mode = 'download_ifspace'
[2]:
## Import the ecco_v4_py library into Python
## =========================================
## If ecco_v4_py is not installed in your local Python library,
## tell Python where to find it. The example below adds
## ecco_v4_py to the user's path if it is stored in the folder
## ECCOv4-py under the user's home directory
from os.path import join,expanduser
user_home_dir = expanduser('~')
sys.path.append(join(user_home_dir,'ECCOv4-py'))
import ecco_v4_py as ecco
[3]:
## Set top-level file directory for the ECCO NetCDF files
## =================================================================
## currently set to /Downloads/ECCO_V4r4_PODAAC
ECCO_dir = join(user_home_dir,'Downloads','ECCO_V4r4_PODAAC')
# # for access_mode = 's3_open_fsspec', need to specify the root directory
# # containing the jsons
# jsons_root_dir = join('/efs_ecco','mzz-jsons')
Create a simple numpy array
[4]:
a=np.array([1, 2, 3, 4, 5])
# Assign 'b' to point to the same numpy array
b=a
# Test to see if b and a point to the same thing
b is a
[4]:
True
Now change the fourth element of b
and print both a
and b
[5]:
b[3] = 10
print (a)
print (b)
[ 1 2 3 10 5]
[ 1 2 3 10 5]
A fancier demonstration¶
Let’s now demonstrate with a numpy
array that stores SSH
output.
[6]:
## load ECCO datasets needed for the tutorial
ShortNames_list = ["ECCO_L4_GEOMETRY_LLC0090GRID_V4R4",\
"ECCO_L4_SSH_LLC0090GRID_MONTHLY_V4R4"]
ds_dict = ea.ecco_podaac_to_xrdataset(ShortNames_list,\
StartDate='2010-01',EndDate='2010-12',\
mode=access_mode,\
download_root_dir=ECCO_dir,\
max_avail_frac=0.5)
ecco_grid = ds_dict[ShortNames_list[0]].compute()
ssh_dataset = ds_dict[ShortNames_list[1]].compute()
## Merge SSH and grid
output_all = xr.merge((ssh_dataset, ecco_grid))
Recall the dimensions of our SSH
DataArray
:
[7]:
output_all.SSH.dims
[7]:
('time', 'tile', 'j', 'i')
Show the first four SSH values in j and i for the fifth month (May 2010) and second tile:
[8]:
output_all.SSH[4,1,0:4,0:4].values
[8]:
array([[-1.3647507, -1.3654912, -1.3614875, -1.3682245],
[-1.3252122, -1.3242471, -1.3242686, -1.3410169],
[-1.2957976, -1.2947818, -1.3008444, -1.3268987],
[-1.285025 , -1.2845778, -1.2949846, -1.325418 ]], dtype=float32)
Assign the variable ssh_tmp
to this subset of the numpy
array that SSH
points to:
[9]:
ssh_tmp = output_all.SSH[4,1,0:2,0:2].values
ssh_tmp
[9]:
array([[-1.3647507, -1.3654912],
[-1.3252122, -1.3242471]], dtype=float32)
Now change the values of all elements of ssh_tmp
to 10
[10]:
ssh_tmp[:] = 10
ssh_tmp
[10]:
array([[10., 10.],
[10., 10.]], dtype=float32)
And see that yes, in fact, this change is reflected in our SSH
DataArray
:
[11]:
output_all.SSH[4,1,0:4,0:4].values
[11]:
array([[10. , 10. , -1.3614875, -1.3682245],
[10. , 10. , -1.3242686, -1.3410169],
[-1.2957976, -1.2947818, -1.3008444, -1.3268987],
[-1.285025 , -1.2845778, -1.2949846, -1.325418 ]], dtype=float32)
Dealing with pass-by-reference: right hand side operations¶
One way to have a new variable assignment not point to the original variable is to perform an operation on the right hand side of the assignment statement.
“Python evaluates expressions from left to right. Notice that while evaluating an assignment, the right-hand side is evaluated before the left-hand side.” https://docs.python.org/2/reference/expressions.html#evaluation-order
Performing an operation on the right hand side creates new values in memory. The new variable assignment will then point to these new values, leaving the original untouched.
Simple demonstration 1¶
Operate on a
by adding 1 before the assigment statement
[12]:
# Create a simple numpy array
a=np.array([1, 2, 3, 4, 5])
b = a + 1
print (a)
print (b)
[1 2 3 4 5]
[2 3 4 5 6]
Now change the fourth element of b
and print both a
and b
[13]:
b[3] = 10
print (a)
print (b)
[1 2 3 4 5]
[ 2 3 4 10 6]
a
and b
do indeed point to different values in memory.
Simple demonstration 2¶
Operate on a
by adding 0 before the assigment statement. This is a kind of dummy operation.
[14]:
# Create a simple numpy array
a=np.array([1, 2, 3, 4, 5])
# Add 0 to `a`:
b = a + 0
print (a)
print (b)
[1 2 3 4 5]
[1 2 3 4 5]
[15]:
# Test to see if b and a point to the same thing
b is a
[15]:
False
Now change the fourth element of b
and print both a
and b
[16]:
b[3] = 10
print (a)
print (b)
[1 2 3 4 5]
[ 1 2 3 10 5]
Once again we see that a
and b
do indeed point to different values in memory.
A fancier demonstration¶
Let’s now demonstrate with a numpy
array that stores SSH
output.
[17]:
output_all.SSH[4,1,5:9,5:9].values
[17]:
array([[-1.3839613, -1.3924128, -1.3913262, -1.3853507],
[-1.3569907, -1.3593892, -1.3547356, -1.3458692],
[-1.3178127, -1.316035 , -1.3086342, -1.2973909],
[-1.267521 , -1.2628312, -1.2530723, -1.239709 ]], dtype=float32)
[18]:
ssh_tmp = output_all.SSH[4,1,5:9,5:9].values * output_all.rA[1,5:9,5:9].values
ssh_tmp[:] = 10
ssh_tmp
[18]:
array([[10., 10., 10., 10.],
[10., 10., 10., 10.],
[10., 10., 10., 10.],
[10., 10., 10., 10.]], dtype=float32)
[19]:
output_all.SSH[4,1,5:9,5:9].values
[19]:
array([[-1.3839613, -1.3924128, -1.3913262, -1.3853507],
[-1.3569907, -1.3593892, -1.3547356, -1.3458692],
[-1.3178127, -1.316035 , -1.3086342, -1.2973909],
[-1.267521 , -1.2628312, -1.2530723, -1.239709 ]], dtype=float32)
Operating on the right hand side of the assignment does indeed new arrays in memory leaving the original SSH numpy
array untouched.
Dealing with pass-by-reference: copy and deepcopy¶
A second way to have a new variable assignment not point to the original variable is to use the copy or deepcopy command.
Simple demonstration¶
Use the numpy
command.
[20]:
# Create a simple numpy array
a=np.array([1, 2, 3, 4, 5])
b=np.copy(a)
print (a)
print (b)
[1 2 3 4 5]
[1 2 3 4 5]
Now change the fourth element of b
and print both a
and b
[21]:
b[3] = 10
print (a)
print (b)
[1 2 3 4 5]
[ 1 2 3 10 5]
[22]:
output_all.SSH
[22]:
<xarray.DataArray 'SSH' (time: 12, tile: 13, j: 90, i: 90)> Size: 5MB array([[[[ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], ..., [-1.4194752 , -1.4261409 , -1.4262563 , ..., -1.335799 , -1.3286246 , -1.3223515 ], [-1.3915291 , -1.3982822 , -1.3976623 , ..., -1.3265072 , -1.3189714 , -1.3117282 ], [-1.3617151 , -1.3670161 , -1.3660299 , ..., -1.3197623 , -1.3118473 , -1.303166 ]], [[-1.3301036 , -1.3331454 , -1.3329679 , ..., -1.3132814 , -1.3051906 , -1.2948027 ], [-1.2997499 , -1.3010498 , -1.303525 , ..., -1.3048117 , -1.2970599 , -1.2852311 ], [-1.2769612 , -1.2781848 , -1.2847781 , ..., -1.2925622 , -1.2858012 , -1.2733294 ], ... [ 0.14350861, 0.14915593, 0.16067028, ..., -1.2250897 , -1.2384485 , -1.2555339 ], [ 0.14452162, 0.15083845, 0.1632475 , ..., -1.245257 , -1.2634263 , -1.2841328 ], [ 0.14675961, 0.15251625, 0.16487278, ..., -1.2580327 , -1.2814457 , -1.3074644 ]], [[-0.73443437, -0.8210122 , -0.907068 , ..., nan, nan, nan], [-0.7157703 , -0.8134533 , -0.9050429 , ..., nan, nan, nan], [-0.71003354, -0.81543124, -0.90919775, ..., nan, nan, nan], ..., [-1.2795744 , -1.3132012 , -1.3549699 , ..., nan, nan, nan], [-1.3077799 , -1.3369125 , -1.3719907 , ..., nan, nan, nan], [-1.3335364 , -1.3611741 , -1.3917223 , ..., nan, nan, nan]]]], dtype=float32) Coordinates: * i (i) int32 360B 0 1 2 3 4 5 6 7 8 9 ... 81 82 83 84 85 86 87 88 89 * j (j) int32 360B 0 1 2 3 4 5 6 7 8 9 ... 81 82 83 84 85 86 87 88 89 * tile (tile) int32 52B 0 1 2 3 4 5 6 7 8 9 10 11 12 * time (time) datetime64[ns] 96B 2010-01-16T12:00:00 ... 2010-12-16T12:... XC (tile, j, i) float32 421kB -111.6 -111.3 -110.9 ... -105.6 -111.9 YC (tile, j, i) float32 421kB -88.24 -88.38 -88.52 ... -88.08 -88.1 Attributes: long_name: Dynamic sea surface height anomaly units: m coverage_content_type: modelResult standard_name: sea_surface_height_above_geoid comment: Dynamic sea surface height anomaly above the geoi... valid_min: -1.8805772066116333 valid_max: 1.4207719564437866
Fancier demonstration¶
Dataset
and DataArray
objects are too complicated for numpy
’s copy
command. For complex objects such as these use the deepcopy
command.
[23]:
ssh_tmp = deepcopy(output_all.SSH)
ssh_tmp[:] = 10
ssh_tmp[4,1,5:9,5:9].values
[23]:
array([[10., 10., 10., 10.],
[10., 10., 10., 10.],
[10., 10., 10., 10.],
[10., 10., 10., 10.]], dtype=float32)
[24]:
output_all.SSH[4,1,5:9,5:9].values
[24]:
array([[-1.3839613, -1.3924128, -1.3913262, -1.3853507],
[-1.3569907, -1.3593892, -1.3547356, -1.3458692],
[-1.3178127, -1.316035 , -1.3086342, -1.2973909],
[-1.267521 , -1.2628312, -1.2530723, -1.239709 ]], dtype=float32)
Using deepcopy
gives us an entirely new array in memory. Operations on ssh_tmp
do not affect the original fields that we found in the output_all_SSH
DataArray
.
alternative to deepcopy
¶
xarray
give us another way to deepcopy DataArrays
and Datasets
:
ssh_tmp = output_all.copy(deep=True)
Conclusion¶
You now know about the possible pitfalls for dealing with Python’s pass-by-reference way of handling assignment statements and different methods for making copies of numpy
arrays and Datasets
and DataArrays
.