Operating on Numpy arrays¶
Objectives¶
Introduce numpy’s pass-by-reference approach handling numpy arrays and methods for avoiding pitfalls when operating on numpy arrays.
Introduction¶
From: http://span.ece.utah.edu/common-mistakes-in-moving-from-matlab-to-python:
“Whenever you reference an array in Python, the computer will provide the memory address for the thing you are accessing, not the actual value. This is called pass-by-reference. This saves memory and makes your programs faster, but it is also harder to keep straight.”
From: https://docs.python.org/2/library/copy.html
“Assignment statements in Python do not copy objects, they create bindings [pointers] between a target and an object.” “… a copy is sometimes needed so one can change one copy without changing the other. The ‘copy’ module provides generic … copy operations.”
If you are not familiar with the pass-by-reference aspect of Python then I strongly suggest you read this short, informative essay on “Python Names and Values”: https://nedbatchelder.com/text/names.html
We’ve briefly touched on this important subject in earlier tutorials. Now we’ll go into a bit more detail.
Note: For this tutorial you will need the 2010 monthly SSH fields (ShortName ECCO_L4_SSH_LLC0090GRID_MONTHLY_V4R4) and the grid file downloaded, if you don’t have them already.
Variable assignments¶
Unlike some other languages, creating a new variable with an assignment statement in Python such as x = some_numpy_array
does not make a copy of some_numpy_array
. Instead, the assignment statement makes x
and some_numpy_array
both point to the same numpy
array in memory. Because x
and some_numpy_array
are both refer (or pointer) to the same numpy
array in memory, the numpy
array can be changed by operations on either x
or some_numpy_array
. If you aren’t aware of this behavior then you may run into very difficult to identify bugs in your calculations!
A simple demonstration¶
Let’s demonstrate this issue with a very simple numpy
array
[1]:
import numpy as np
import xarray as xr
import sys
import matplotlib.pyplot as plt
%matplotlib inline
import json
import glob
from copy import deepcopy
import warnings
warnings.filterwarnings('ignore')
[2]:
## Import the ecco_v4_py library into Python
## =========================================
## If ecco_v4_py is not installed in your local Python library,
## tell Python where to find it. The example below adds
## ecco_v4_py to the user's path if it is stored in the folder
## ECCOv4-py under the user's home directory
from os.path import join,expanduser
user_home_dir = expanduser('~')
sys.path.append(join(user_home_dir,'ECCOv4-py'))
import ecco_v4_py as ecco
[3]:
## Set top-level file directory for the ECCO NetCDF files
## =================================================================
## currently set to ~/Downloads/ECCO_V4r4_PODAAC,
## the default if ecco_podaac_download was used to download dataset granules
ECCO_dir = join(user_home_dir,'Downloads','ECCO_V4r4_PODAAC')
Create a simple numpy array
[4]:
a=np.array([1, 2, 3, 4, 5])
# Assign 'b' to point to the same numpy array
b=a
# Test to see if b and a point to the same thing
b is a
[4]:
True
Now change the fourth element of b
and print both a
and b
[5]:
b[3] = 10
print (a)
print (b)
[ 1 2 3 10 5]
[ 1 2 3 10 5]
A fancier demonstration¶
Let’s now demonstrate with a numpy
array that stores SSH
output.
[6]:
## LOAD NETCDF SSH FILE
ssh_dataset = xr.open_mfdataset(join(ECCO_dir,'*SSH*MONTHLY*','*_2010-*.nc')).load()
## Load the model grid
ecco_grid = xr.open_dataset(glob.glob(join(ECCO_dir,'*GEOMETRY*','*.nc'))[0])
## Merge SSH and grid
output_all = xr.merge((ssh_dataset, ecco_grid))
Recall the dimensions of our SSH
DataArray
:
[7]:
output_all.SSH.dims
[7]:
('time', 'tile', 'j', 'i')
Show the first four SSH values in j and i for the fifth month (May 2010) and second tile:
[8]:
output_all.SSH[4,1,0:4,0:4].values
[8]:
array([[-1.3647507, -1.3654912, -1.3614875, -1.3682245],
[-1.3252122, -1.3242471, -1.3242686, -1.3410169],
[-1.2957976, -1.2947818, -1.3008444, -1.3268987],
[-1.285025 , -1.2845778, -1.2949846, -1.325418 ]], dtype=float32)
Assign the variable ssh_tmp
to this subset of the numpy
array that SSH
points to:
[9]:
ssh_tmp = output_all.SSH[4,1,0:2,0:2].values
ssh_tmp
[9]:
array([[-1.3647507, -1.3654912],
[-1.3252122, -1.3242471]], dtype=float32)
Now change the values of all elements of ssh_tmp
to 10
[10]:
ssh_tmp[:] = 10
ssh_tmp
[10]:
array([[10., 10.],
[10., 10.]], dtype=float32)
And see that yes, in fact, this change is reflected in our SSH
DataArray
:
[11]:
output_all.SSH[4,1,0:4,0:4].values
[11]:
array([[10. , 10. , -1.3614875, -1.3682245],
[10. , 10. , -1.3242686, -1.3410169],
[-1.2957976, -1.2947818, -1.3008444, -1.3268987],
[-1.285025 , -1.2845778, -1.2949846, -1.325418 ]], dtype=float32)
Dealing with pass-by-reference: right hand side operations¶
One way to have a new variable assignment not point to the original variable is to perform an operation on the right hand side of the assignment statement.
“Python evaluates expressions from left to right. Notice that while evaluating an assignment, the right-hand side is evaluated before the left-hand side.” https://docs.python.org/2/reference/expressions.html#evaluation-order
Performing an operation on the right hand side creates new values in memory. The new variable assignment will then point to these new values, leaving the original untouched.
Simple demonstration 1¶
Operate on a
by adding 1 before the assigment statement
[12]:
# Create a simple numpy array
a=np.array([1, 2, 3, 4, 5])
b = a + 1
print (a)
print (b)
[1 2 3 4 5]
[2 3 4 5 6]
Now change the fourth element of b
and print both a
and b
[13]:
b[3] = 10
print (a)
print (b)
[1 2 3 4 5]
[ 2 3 4 10 6]
a
and b
do indeed point to different values in memory.
Simple demonstration 2¶
Operate on a
by adding 0 before the assigment statement. This is a kind of dummy operation.
[14]:
# Create a simple numpy array
a=np.array([1, 2, 3, 4, 5])
# Add 0 to `a`:
b = a + 0
print (a)
print (b)
[1 2 3 4 5]
[1 2 3 4 5]
[15]:
# Test to see if b and a point to the same thing
b is a
[15]:
False
Now change the fourth element of b
and print both a
and b
[16]:
b[3] = 10
print (a)
print (b)
[1 2 3 4 5]
[ 1 2 3 10 5]
Once again we see that a
and b
do indeed point to different values in memory.
A fancier demonstration¶
Let’s now demonstrate with a numpy
array that stores SSH
output.
[17]:
output_all.SSH[4,1,5:9,5:9].values
[17]:
array([[-1.3839613, -1.3924128, -1.3913262, -1.3853507],
[-1.3569907, -1.3593892, -1.3547356, -1.3458692],
[-1.3178127, -1.316035 , -1.3086342, -1.2973909],
[-1.267521 , -1.2628312, -1.2530723, -1.239709 ]], dtype=float32)
[18]:
ssh_tmp = output_all.SSH[4,1,5:9,5:9].values * output_all.rA[1,5:9,5:9].values
ssh_tmp[:] = 10
ssh_tmp
[18]:
array([[10., 10., 10., 10.],
[10., 10., 10., 10.],
[10., 10., 10., 10.],
[10., 10., 10., 10.]], dtype=float32)
[19]:
output_all.SSH[4,1,5:9,5:9].values
[19]:
array([[-1.3839613, -1.3924128, -1.3913262, -1.3853507],
[-1.3569907, -1.3593892, -1.3547356, -1.3458692],
[-1.3178127, -1.316035 , -1.3086342, -1.2973909],
[-1.267521 , -1.2628312, -1.2530723, -1.239709 ]], dtype=float32)
Operating on the right hand side of the assignment does indeed new arrays in memory leaving the original SSH numpy
array untouched.
Dealing with pass-by-reference: copy and deepcopy¶
A second way to have a new variable assignment not point to the original variable is to use the copy or deepcopy command.
Simple demonstration¶
Use the numpy
command.
[20]:
# Create a simple numpy array
a=np.array([1, 2, 3, 4, 5])
b=np.copy(a)
print (a)
print (b)
[1 2 3 4 5]
[1 2 3 4 5]
Now change the fourth element of b
and print both a
and b
[21]:
b[3] = 10
print (a)
print (b)
[1 2 3 4 5]
[ 1 2 3 10 5]
[22]:
output_all.SSH
[22]:
<xarray.DataArray 'SSH' (time: 12, tile: 13, j: 90, i: 90)> array([[[[ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], ..., [-1.4194752 , -1.4261409 , -1.4262563 , ..., -1.335799 , -1.3286246 , -1.3223515 ], [-1.3915291 , -1.3982822 , -1.3976623 , ..., -1.3265072 , -1.3189714 , -1.3117282 ], [-1.3617151 , -1.3670161 , -1.3660299 , ..., -1.3197623 , -1.3118473 , -1.303166 ]], [[-1.3301036 , -1.3331454 , -1.3329679 , ..., -1.3132814 , -1.3051906 , -1.2948027 ], [-1.2997499 , -1.3010498 , -1.303525 , ..., -1.3048117 , -1.2970599 , -1.2852311 ], [-1.2769612 , -1.2781848 , -1.2847781 , ..., -1.2925622 , -1.2858012 , -1.2733294 ], ... [ 0.14350861, 0.14915593, 0.16067028, ..., -1.2250897 , -1.2384485 , -1.2555339 ], [ 0.14452162, 0.15083845, 0.1632475 , ..., -1.245257 , -1.2634263 , -1.2841328 ], [ 0.14675961, 0.15251625, 0.16487278, ..., -1.2580327 , -1.2814457 , -1.3074644 ]], [[-0.73443437, -0.8210122 , -0.907068 , ..., nan, nan, nan], [-0.7157703 , -0.8134533 , -0.9050429 , ..., nan, nan, nan], [-0.71003354, -0.81543124, -0.90919775, ..., nan, nan, nan], ..., [-1.2795744 , -1.3132012 , -1.3549699 , ..., nan, nan, nan], [-1.3077799 , -1.3369125 , -1.3719907 , ..., nan, nan, nan], [-1.3335364 , -1.3611741 , -1.3917223 , ..., nan, nan, nan]]]], dtype=float32) Coordinates: * i (i) int32 0 1 2 3 4 5 6 7 8 9 10 ... 80 81 82 83 84 85 86 87 88 89 * j (j) int32 0 1 2 3 4 5 6 7 8 9 10 ... 80 81 82 83 84 85 86 87 88 89 * tile (tile) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 * time (time) datetime64[ns] 2010-01-16T12:00:00 ... 2010-12-16T12:00:00 XC (tile, j, i) float32 -111.6 -111.3 -110.9 ... -99.42 -105.6 -111.9 YC (tile, j, i) float32 -88.24 -88.38 -88.52 ... -88.03 -88.08 -88.1 Attributes: long_name: Dynamic sea surface height anomaly units: m coverage_content_type: modelResult standard_name: sea_surface_height_above_geoid comment: Dynamic sea surface height anomaly above the geoi... valid_min: -1.8805772066116333 valid_max: 1.4207719564437866
- time: 12
- tile: 13
- j: 90
- i: 90
- nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan
array([[[[ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], ..., [-1.4194752 , -1.4261409 , -1.4262563 , ..., -1.335799 , -1.3286246 , -1.3223515 ], [-1.3915291 , -1.3982822 , -1.3976623 , ..., -1.3265072 , -1.3189714 , -1.3117282 ], [-1.3617151 , -1.3670161 , -1.3660299 , ..., -1.3197623 , -1.3118473 , -1.303166 ]], [[-1.3301036 , -1.3331454 , -1.3329679 , ..., -1.3132814 , -1.3051906 , -1.2948027 ], [-1.2997499 , -1.3010498 , -1.303525 , ..., -1.3048117 , -1.2970599 , -1.2852311 ], [-1.2769612 , -1.2781848 , -1.2847781 , ..., -1.2925622 , -1.2858012 , -1.2733294 ], ... [ 0.14350861, 0.14915593, 0.16067028, ..., -1.2250897 , -1.2384485 , -1.2555339 ], [ 0.14452162, 0.15083845, 0.1632475 , ..., -1.245257 , -1.2634263 , -1.2841328 ], [ 0.14675961, 0.15251625, 0.16487278, ..., -1.2580327 , -1.2814457 , -1.3074644 ]], [[-0.73443437, -0.8210122 , -0.907068 , ..., nan, nan, nan], [-0.7157703 , -0.8134533 , -0.9050429 , ..., nan, nan, nan], [-0.71003354, -0.81543124, -0.90919775, ..., nan, nan, nan], ..., [-1.2795744 , -1.3132012 , -1.3549699 , ..., nan, nan, nan], [-1.3077799 , -1.3369125 , -1.3719907 , ..., nan, nan, nan], [-1.3335364 , -1.3611741 , -1.3917223 , ..., nan, nan, nan]]]], dtype=float32)
- i(i)int320 1 2 3 4 5 6 ... 84 85 86 87 88 89
- axis :
- X
- long_name :
- grid index in x for variables at tracer and 'v' locations
- swap_dim :
- XC
- comment :
- In the Arakawa C-grid system, tracer (e.g., THETA) and 'v' variables (e.g., VVEL) have the same x coordinate on the model grid.
- coverage_content_type :
- coordinate
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89])
- j(j)int320 1 2 3 4 5 6 ... 84 85 86 87 88 89
- axis :
- Y
- long_name :
- grid index in y for variables at tracer and 'u' locations
- swap_dim :
- YC
- comment :
- In the Arakawa C-grid system, tracer (e.g., THETA) and 'u' variables (e.g., UVEL) have the same y coordinate on the model grid.
- coverage_content_type :
- coordinate
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89])
- tile(tile)int320 1 2 3 4 5 6 7 8 9 10 11 12
- long_name :
- lat-lon-cap tile index
- comment :
- The ECCO V4 horizontal model grid is divided into 13 tiles of 90x90 cells for convenience.
- coverage_content_type :
- coordinate
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
- time(time)datetime64[ns]2010-01-16T12:00:00 ... 2010-12-...
- long_name :
- center time of averaging period
- axis :
- T
- bounds :
- time_bnds
- coverage_content_type :
- coordinate
- standard_name :
- time
array(['2010-01-16T12:00:00.000000000', '2010-02-15T00:00:00.000000000', '2010-03-16T12:00:00.000000000', '2010-04-16T00:00:00.000000000', '2010-05-16T12:00:00.000000000', '2010-06-16T00:00:00.000000000', '2010-07-16T12:00:00.000000000', '2010-08-16T12:00:00.000000000', '2010-09-16T00:00:00.000000000', '2010-10-16T12:00:00.000000000', '2010-11-16T00:00:00.000000000', '2010-12-16T12:00:00.000000000'], dtype='datetime64[ns]')
- XC(tile, j, i)float32-111.6 -111.3 ... -105.6 -111.9
- long_name :
- longitude of tracer grid cell center
- units :
- degrees_east
- coordinate :
- YC XC
- bounds :
- XC_bnds
- comment :
- nonuniform grid spacing
- coverage_content_type :
- coordinate
- standard_name :
- longitude
array([[[-111.60647 , -111.303 , -110.94285 , ..., 64.791115, 64.80521 , 64.81917 ], [-104.8196 , -103.928444, -102.87706 , ..., 64.36745 , 64.41012 , 64.4524 ], [ -98.198784, -96.788055, -95.14185 , ..., 63.936497, 64.008224, 64.0793 ], ..., [ -37.5 , -36.5 , -35.5 , ..., 49.5 , 50.5 , 51.5 ], [ -37.5 , -36.5 , -35.5 , ..., 49.5 , 50.5 , 51.5 ], [ -37.5 , -36.5 , -35.5 , ..., 49.5 , 50.5 , 51.5 ]], [[ -37.5 , -36.5 , -35.5 , ..., 49.5 , 50.5 , 51.5 ], [ -37.5 , -36.5 , -35.5 , ..., 49.5 , 50.5 , 51.5 ], [ -37.5 , -36.5 , -35.5 , ..., 49.5 , 50.5 , 51.5 ], ... [ -40.5 , -40.5 , -40.5 , ..., -40.5 , -40.5 , -40.5 ], [ -39.5 , -39.5 , -39.5 , ..., -39.5 , -39.5 , -39.5 ], [ -38.5 , -38.5 , -38.5 , ..., -38.5 , -38.5 , -38.5 ]], [[-127.5 , -127.5 , -127.5 , ..., -115.850204, -115.50567 , -115.166985], [-126.5 , -126.5 , -126.5 , ..., -115.78025 , -115.464066, -115.153244], [-125.5 , -125.5 , -125.5 , ..., -115.71079 , -115.42275 , -115.139595], ..., [ -40.5 , -40.5 , -40.5 , ..., -101.42989 , -106.83081 , -112.28605 ], [ -39.5 , -39.5 , -39.5 , ..., -100.48844 , -106.24874 , -112.090065], [ -38.5 , -38.5 , -38.5 , ..., -99.42048 , -105.58465 , -111.86579 ]]], dtype=float32)
- YC(tile, j, i)float32-88.24 -88.38 ... -88.08 -88.1
- long_name :
- latitude of tracer grid cell center
- units :
- degrees_north
- coordinate :
- YC XC
- bounds :
- YC_bnds
- comment :
- nonuniform grid spacing
- coverage_content_type :
- coordinate
- standard_name :
- latitude
array([[[-88.24259 , -88.382515 , -88.52242 , ..., -80.57383 , -80.50492 , -80.43992 ], [-88.21676 , -88.354485 , -88.49178 , ..., -80.56925 , -80.50038 , -80.43542 ], [-88.164764 , -88.29827 , -88.43065 , ..., -80.55984 , -80.49105 , -80.42617 ], ..., [-58.321964 , -58.321964 , -58.321964 , ..., -58.321964 , -58.321964 , -58.321964 ], [-57.79962 , -57.79962 , -57.79962 , ..., -57.79962 , -57.79962 , -57.79962 ], [-57.271408 , -57.271408 , -57.271408 , ..., -57.271408 , -57.271408 , -57.271408 ]], [[-56.73891 , -56.73891 , -56.73891 , ..., -56.73891 , -56.73891 , -56.73891 ], [-56.2021 , -56.2021 , -56.2021 , ..., -56.2021 , -56.2021 , -56.2021 ], [-55.65936 , -55.65936 , -55.65936 , ..., -55.65936 , -55.65936 , -55.65936 ], ... [ 9.482398 , 8.516253 , 7.5699615, ..., -55.65936 , -56.2021 , -56.73891 ], [ 9.482398 , 8.516253 , 7.5699615, ..., -55.65936 , -56.2021 , -56.73891 ], [ 9.482398 , 8.516253 , 7.5699615, ..., -55.65936 , -56.2021 , -56.73891 ]], [[-57.271408 , -57.79962 , -58.321964 , ..., -80.36532 , -80.3745 , -80.37896 ], [-57.271408 , -57.79962 , -58.321964 , ..., -80.30862 , -80.317726 , -80.32216 ], [-57.271408 , -57.79962 , -58.321964 , ..., -80.25618 , -80.26523 , -80.26963 ], ..., [-57.271408 , -57.79962 , -58.321964 , ..., -87.7596 , -87.80198 , -87.822876 ], [-57.271408 , -57.79962 , -58.321964 , ..., -87.89526 , -87.94044 , -87.96276 ], [-57.271408 , -57.79962 , -58.321964 , ..., -88.030365 , -88.07871 , -88.10267 ]]], dtype=float32)
- long_name :
- Dynamic sea surface height anomaly
- units :
- m
- coverage_content_type :
- modelResult
- standard_name :
- sea_surface_height_above_geoid
- comment :
- Dynamic sea surface height anomaly above the geoid, suitable for comparisons with altimetry sea surface height data products that apply the inverse barometer (IB) correction. Note: SSH is calculated by correcting model sea level anomaly ETAN for three effects: a) global mean steric sea level changes related to density changes in the Boussinesq volume-conserving model (Greatbatch correction, see sterGloH), b) the inverted barometer (IB) effect (see SSHIBC) and c) sea level displacement due to sea-ice and snow pressure loading (see sIceLoad). SSH can be compared with the similarly-named SSH variable in previous ECCO products that did not include atmospheric pressure loading (e.g., Version 4 Release 3). Use SSHNOIBC for comparisons with altimetry data products that do NOT apply the IB correction.
- valid_min :
- -1.8805772066116333
- valid_max :
- 1.4207719564437866
Fancier demonstration¶
Dataset
and DataArray
objects are too complicated for numpy
’s copy
command. For complex objects such as these use the deepcopy
command.
[23]:
ssh_tmp = deepcopy(output_all.SSH)
ssh_tmp[:] = 10
ssh_tmp[4,1,5:9,5:9].values
[23]:
array([[10., 10., 10., 10.],
[10., 10., 10., 10.],
[10., 10., 10., 10.],
[10., 10., 10., 10.]], dtype=float32)
[24]:
output_all.SSH[4,1,5:9,5:9].values
[24]:
array([[-1.3839613, -1.3924128, -1.3913262, -1.3853507],
[-1.3569907, -1.3593892, -1.3547356, -1.3458692],
[-1.3178127, -1.316035 , -1.3086342, -1.2973909],
[-1.267521 , -1.2628312, -1.2530723, -1.239709 ]], dtype=float32)
Using deepcopy
gives us an entirely new array in memory. Operations on ssh_tmp
do not affect the original fields that we found in the output_all_SSH
DataArray
.
alternative to deepcopy
¶
xarray
give us another way to deepcopy DataArrays
and Datasets
:
ssh_tmp = output_all.copy(deep=True)
Conclusion¶
You now know about the possible pitfalls for dealing with Python’s pass-by-reference way of handling assignment statements and different methods for making copies of numpy
arrays and Datasets
and DataArrays
.