CHAPTER 2
Many of my colleagues, when they first started using Python, were surprised to learn that the language does not have a built-in array data structure. The Python list data structure is versatile enough to handle many programming scenarios, but for scientific and numeric programming, arrays and matrices are needed. The most fundamental object in the SciPy and NumPy libraries is the array data structure. The following screenshot shows you where this chapter is headed.

Figure 21: NumPy Array Demo
In section 2.1, you'll learn the most common ways to create and initialize NumPy arrays, and learn about the various numeric data types supported by NumPy.
In section 2.2, you'll learn how to search an array for a target value using the where() function, using the in keyword, and by using a program-defined function.
In section 2.3, you'll learn how to sort a NumPy array using the three different built-in sorting algorithms (quicksort, merge sort, and heap sort). You'll also learn about the NumPy array reference mechanism.
In section 2.4, you'll learn how to randomize an array using the NumPy shuffle() function and how to randomize an array using a program-defined function and the Fisher-Yates algorithm.
The most fundamental object in the NumPy library is the array data structure. A NumPy array is similar to arrays in other programming languages. Arrays have a fixed size and each cell must hold the same type. The NumPy library has several functions that allow you to create an array.
Take a look at the demo program in Code Listing 4. After two preliminary print statements, program execution begins by creating an array using hard-coded numeric values:
arr = np.array([1., 3., 5., 7., 9.])
dt = np.dtype(arr[0])
print "Cell element type is " + str(dt.name) # displays 'float64'
The NumPy array() function accepts a Python list (as indicated by the square brackets) and returns an array containing the list values. Notice the decimal points. These tell the interpreter to cast the cell values as float64, the default floating-point data type for arrays. Without the decimals, the interpreter would cast the values to int32, the default integer type for arrays.
Code Listing 4: Initializing Numeric Arrays
# arrays.py # Python 2.7 import numpy as np def my_print(arr, cols, dec): n = len(arr) # print array using indexing fmt = "%." + str(dec) + "f" # like %.4f for i in xrange(n): # alt: for x in arr if i > 0 and i % cols == 0: print "" print fmt % arr[i], print "\n" # ===== print "\nBegin array demo \n" print "Creating array arr using np.array() and list with hard-coded values " arr = np.array([1., 3., 5., 7., 9.]) # float64 dt = np.dtype(arr[0]) print "Cell element type is " + str(dt.name) print "" print "Printing array arr using built-in print() " print arr print "" print "Creating int array arr using np.arange(9) " arr = np.arange(9) # [0, 1, . . 8] # int32 print "Printing array arr using built-in print() " print arr print "" cols = 4; dec = 0 print "Printing array arr using my_print() with cols=" + str(cols), print "and dec=" + str(dec) my_print(arr, cols, dec) print "Creating array arr using np.zeros(5) " arr = np.zeros(5) print "Printing array arr using built-in print() " print arr print "" print "Creating array arr using np.linspace(2., 5., 6)" arr = np.linspace(2., 5., 6) # 6 values from [2.0 to 5.0] inc. print "Printing array arr using built-in print() " print arr print "" print "\nEnd demo \n" |
C:\SciPy\Ch2> python arrays.py Begin array demo Creating array arr using np.array() and list with hard-coded values Cell element type is float64 Printing array arr using built-in print() [ 1. 3. 5. 7. 9.] Creating int array arr using np.arange(9) Printing array arr using built-in print() [0 1 2 3 4 5 6 7 8] Printing array arr using my_print() with cols=4 and dec=0 0 1 2 3 4 5 6 7 8 Creating array arr using np.zeros(5) Printing array arr using built-in print() [ 0. 0. 0. 0. 0.] Creating array arr using np.linspace(2., 5., 6) Printing array arr using built-in print() [ 2. 2.6 3.2 3.8 4.4 5. ] End demo |
If you are creating an array and neither float64 nor int32 is appropriate, you can make the data type explicit. For example:
arr = np.array([2.0, 4.0, 6.0], dtype=np.float16)
NumPy has four floating-point data types: float_, float16, float32, and float64. The default floating-point type is float64: a signed value with an 11-bit exponent and a 52-bit mantissa. NumPy also supports complex numbers.
NumPy has 11 integer data types, including int32, int64, and uint64. The default integer data type is int32 (that is, a signed 32-bit integer with possible values between -2,147,483,648 and +2,147,483,647).
You can also create arrays with string values and with Boolean values. For example:
arr_b = np.array([True, False, True])
arr_s = np.array(["ant", "bat", "cow"])
After creating the array, the demo displays the array values using the built-in print statement:
print "Printing array arr using built-in print() "
print arr
The Python 2.7 print statement is simple and effective for displaying NumPy arrays in most situations. If you need to customize the output format, you can use the NumPy set_printoptions() function or write a program-defined display function.
Next, the demo program creates and initializes an array using the NumPy arange() function:
arr = np.arange(9)
print "Printing array arr using built-in print() "
print arr # displays [0 1 2 3 4 5 6 7 8]
A call to arange(n) returns an int32 array with sequential values 0, 1, 2,… (n-1). Note that the NumPy arange() function (the name stands for array-range, and is not a misspelling of the word arrange) is quite different from the Python range() function, which returns a list of integer values, and the Python xrange() function, which returns an iterator object that can be used to traverse a list or an array.
Next, the demo program displays the array generated by the arange() function, using a program-defined function named my_print():
cols = 4; dec = 0
print "Printing array arr using my_print() with cols=" + str(cols),
print "and dec=" + str(dec)
my_print(arr, cols, dec)
The custom function displays an array in a specified number of columns (4 here), using a specified number of decimal places (0 here because the values are integers).
If you are new to Python, you might be puzzled by the trailing comma character after the first print statement. This syntax is used to print without a newline character and is similar to the C# Console.Write() method (as opposed to the WriteLine() method) or the Java System.out.print() method (as opposed to the println() method).
Program-defined function my_print() is implemented as:
def my_print(arr, cols, dec):
n = len(arr)
fmt = "%." + str(dec) + "f" # like %.4f
for i in xrange(n):
if i > 0 and i % cols == 0:
print ""
print fmt % arr[i],
print "\n"
The function first finds the number of cells in the array using the Python len() function. An alternative is to use the more efficient NumPy size property:
n = arr.size
Note that size has no parentheses after it because it's a property, not a function. The my_print() function iterates through the array using traditional array indexing:
for i in xrange(n):
Using this technique, a cell value in array arr is accessed as arr[i]. An alternative is to iterate over the array like so:
for x in arr
Here, x is a cell value. This technique is similar to using a "for-each" loop in other languages such as C#. In most situations, I prefer using array indexing to "for-eaching" but most of my colleagues prefer the "for x in arr" syntax.
Next, the demo program creates an array using the NumPy zeros() function:
arr = np.zeros(5)
print "Printing array arr using built-in print() "
print arr
Based on my experience, using the zeros() function is perhaps the most common way to create a NumPy array. As the name suggests, a call to zeros(n) creates an array with n cells and initializes each cell to a 0.0 value. The default element value is float64, so if you want an integer array initialized to 0 values, you'd have to supply the dtype parameter value to zeros() like so:
arr = np.zeros(5, dtype=np.int32)
A closely related NumPy function is ones(), which initializes an array to all 1.0 (or integer 1) values.
The demo concludes by creating and initializing an array using the NumPy linspace() function:
arr = np.linspace(2., 5., 6)
print "Printing array arr using built-in print() "
print arr
A call to linspace(start, stop, num) returns an array that has num cells with values evenly spaced between start and stop, inclusive. The demo call np.linspace(2., 5., 6) returns an array of six float64 values starting with 2.0 and ending with 5.0 (2.0, 2.6, 3.2, 3.8, 4.4, and 5.0).
Note that almost all Python and NumPy functions that accept start and stop parameters return values in [start, stop), that is, between start inclusive and stop exclusive. The NumPy linspace() function is an exception.
There are many other NumPy functions that can create arrays, but the array() and zeros() functions can handle most programming scenarios. And you can always create specialized arrays using a program-defined function. For example, suppose you needed to create an array of the first n odd integers. You could define:
def my_odds(n):
result = np.zeros(n, dtype=np.int32)
v = 1
for i in xrange(n):
result[i] = v
v += 2
return result
And then you could create an array holding the first four odd integers with a call:
arr = my_odds(4)
A task that is closely related to creating NumPy arrays is copying an existing array. The NumPy copy() function can do this, and is described in detail later in this e-book.
Resources
For additional details on NumPy numeric data types, see
http://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html.
For additional information about NumPy numeric array initialization functions, see
http://docs.scipy.org/doc/numpy-1.10.1/user/basics.creation.html.
For additional details on NumPy array iteration, see
http://docs.scipy.org/doc/numpy-dev/reference/arrays.nditer.html#arrays-nditer
Searching a numeric array for some target value is a common task. There are three basic ways to search a NumPy array. You can use the NumPy where() function, you can use the Python in keyword, or you can use a program-defined search function. Interestingly, there is no NumPy index() function like those found in several programming languages, including C# and Java.
Code Listing 5: Array Searching Demo
# searching.py # Python 2.7 import numpy as np def my_search(a, x, eps): for i in xrange(len(a)): if (np.isclose(a[i], x, eps)): return i return -1 # ===== print "\nBegin array search demo \n" arr = np.array([7.0, 9.0, 5.0, 1.0, 5.0, 8.0]) print "Array arr is " print arr print "" target = 5.0 print "Target value is " print target print "" found = target in arr print "Search result using 'target in arr' syntax = " + str(found) print "" result = np.where(arr == target) print "Search result using np.where(arr == target) is " print result print "" idx = my_search(arr, target, 1.0e-6) print "Search result using my_search() = " print idx print "" print "\nEnd demo \n" |
C:\SciPy\Ch2> python searching.py Begin array search demo Array arr is [ 7. 9. 5. 1. 5. 8.] Target value is 5.0 Search result using 'target in arr' syntax = True Search result using np.where(arr == target) is (array([2, 4], dtype=int64),) Search result using my_search() = 2 End demo |
The demo program begins with creating an array and a target value to search for:
arr = np.array([7.0, 9.0, 5.0, 1.0, 5.0, 8.0])
print "Array arr is "
print arr
target = 5.0
print "Target value is "
print target
Next, the demo program searches the array for the target value using the Python in keyword:
found = target in arr
print "Search result using 'target in arr' syntax = " + str(found) # 'True'
The return result from a call to target in arr is Boolean, either True or False. Nice and simple. However, using this syntax for searching an array of floating-point values is not really a good idea. The problem is that comparing two floating-point values for exact equality is very tricky. For example:
>>> x = 0.15 + 0.15
>>> y = 0.20 + 0.10
>>> 'yes' if x == y else 'no'
'no'
>>> # what the heck?!
When comparing two floating-point values for equality, you should usually not compare for exact equality; instead, you should check if the two values are very, very close to each other.
Floating-point values stored in memory are sometimes just close approximations to their true values, so comparing two floating-point values for exact equality can give unexpected results. The target in arr syntax doesn't give you any direct way to control how the target value is compared to the values in the array. Note that this problem with checking for exact equality doesn't exist for integer arrays (or string arrays or Boolean arrays), so the target in arr syntax is fine for those.
The target in arr syntax does work properly in the demo program, returning a correct result of True. Next, the demo program searches using the NumPy where() function:
result = np.where(arr == target)
print "Search result using np.where(arr == target) is "
print result
The somewhat tricky return result is:
(array([2, 4], dtype=int64,)
The where() function returns a tuple (as indicated by the parentheses) containing an array. The array holds the indices in the searched array where the target value occurs, cells 2 and 4 in this case. If you search for a target value that is not in the array, the return result is a tuple with an array with length 0:
(array([], dtype=int64),)
Therefore, if you just want to know if a target value is in an array, you can check the return value along the line of:
if len(result[0]) == 0:
print "target not found in array"
else:
print "target is in array"
As is the case with searching using the in keyword, searching an array of floating-point values using the where() function is not recommended because you cannot control how the cell values are compared to the target value. But using the where() function with integer, string, and Boolean arrays is safe and effective.
Next, the demo searches the array using a program-defined function:
idx = my_search(arr, target, 1.0e-6)
print "Search result using my_search() = "
print idx
The program-defined my_search() function returns -1 if the target value is not found in the array, or the cell index of the first occurrence of the target if the target is in the array. In this case the return value is 2 because the target value, 5.0, is in cells [2] and [4] of the array. The third argument, 1.0e-6, is the tolerance defining how close two floating-point values must be in order to be considered equal.
Function my_search() is defined:
def my_search(a, x, eps):
for i in xrange(len(a)):
if np.isclose(a[i], x, eps):
return i
return -1
The NumPy isclose() function compares two values and returns True if the values are within eps (this stands for epsilon, the Greek letter often used in mathematics to represent a small value) of each other.
Instead of using the isclose() function, you can compare directly using either the Python built-in abs() function or the NumPy fabs() function like so:
if abs(a[i] - x) < eps:
return i
if np.fabs(a[i] - x) < eps:
return i
To summarize, to search an array of floating-point values, I recommend using a program-defined function, which allows you to control the comparison of two values for equality. For integer, string, and Boolean arrays, you can use the in keyword, the where() function, or a program-defined function.
In some situations, you may want to find the location of the last occurrence of a target value in an array. Using the where() function with integer, string, or Boolean arrays, you could write code like:
result = np.where(arr == target)
if len(result[0]) == 0:
print "-1" # not found
else:
print result[0][len(result[0])-1] # last idx
To find the last occurrence of a target value in a program-defined function, you could traverse the array from back to front with a for i in xrange(len(a)-1, -1, -1): loop.
Resources
For additional information about the NumPy where() function, see
http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.where.html.
For technical details about how NumPy stores arrays in memory, see
http://docs.scipy.org/doc/numpy-1.10.0/reference/internals.html.
For a list of Python built-in functions such as the absolute value, see
https://docs.python.org/2/library/functions.html.
Putting values in an array in order is a common and fundamental programming task. The NumPy library has a sort() function that implements three different sorting algorithms: the quicksort algorithm, the mergesort algorithm, and the heapsort algorithm.
Interestingly, unlike the sorting functions in most other languages, a call to sort(arr) returns a sorted array, leaving the original array arr unchanged. The sort functions in many programming languages sort their array argument in place, and do not return a new sorted array. However, you can sort a NumPy array arr in place if you wish by using the call arr.sort().
Code Listing 6: Array Sorting Demo
# sorting.py # Python 2.7 import numpy as np import time def my_qsort(a): quick_sorter(a, 0, len(a)-1) def quick_sorter(a, lo, hi): if lo < hi: p = partition(a, lo, hi) quick_sorter(a, lo, p-1) quick_sorter(a, p+1, hi) def partition(a, lo, hi): piv = a[hi] i = lo for j in xrange(lo, hi): if a[j] <= piv: a[i], a[j] = a[j], a[i] i += 1 a[i], a[hi] = a[hi], a[i] return i # ===== print "\nBegin array sorting demo \n" arr = np.array([4.0, 3.0, 0.0, 2.0, 1.0, 9.0, 7.0, 6.0, 5.0]) print "Original array is " print arr print "" s_arr = np.sort(arr, kind='quicksort') print "Return result of sorting using np.sort(arr, 'quicksort') is " print s_arr print "" print "Original array after calling np.sort() is " print arr print "" print "Calling my_qsort(arr) " start_time = time.clock() # record starting time my_qsort(arr) end_time = time.clock() elapsed_time = end_time - start_time print "Elapsed time = " print str(elapsed_time) + " seconds" print "" print "Original array after calling my_qsort(arr) is " print arr print "" print "\nEnd demo \n" |
C:\SciPy\Ch2> python sorting.py Begin array sorting demo Original array is [ 4. 3. 0. 2. 1. 9. 7. 6. 5.] Return result of sorting using np.sort(arr, 'quicksort') is [ 0. 1. 2. 3. 4. 5. 6. 7. 9.] Original array after calling np.sort() is [ 4. 3. 0. 2. 1. 9. 7. 6. 5.] Calling my_qsort(arr) Elapsed time = 3.6342481559e-05 seconds Original array after calling my_qsort(arr) is [ 0. 1. 2. 3. 4. 5. 6. 7. 9.] End demo |
The demo program begins by creating an array:
arr = np.array([4.0, 3.0, 0.0, 2.0, 1.0, 9.0, 7.0, 6.0, 5.0])
print "Original array is "
print arr
Next, the demo program sorts the array using the NumPy sort() function:
s_arr = np.sort(arr, kind='quicksort')
The sort() function returns a new array with values in order, leaving the original array unchanged:
print "Return result of sorting using np.sort(arr, 'quicksort') is "
print s_arr # in order
print "Original array after calling np.sort() is "
print arr # unchanged
It is possible to sort an array in place using either a slightly different syntax or calling pattern:
arr.sort(kind='quicksort')
print arr # arr is sorted
arr = np.sort(arr, kind='quicksort')
print arr # arr is sorted
The quicksort algorithm is the default, so the call to sort() could have been written:
s_arr = np.sort(arr)
Supplying an explicit kind argument to sort() is useful as a form of documentation, especially in situations where other people will be using your code.
The other two sorting algorithms could have been called like so:
s_arr = np.sort(arr, kind='mergesort')
s_arr = np.sort(arr, kind='heapsort')
arr.sort(kind='mergesort')
arr.sort(kind='heapsort')
By default, the sort() function orders array elements from low value to high value. If you want to sort an array from high value to low, you can't do so directly, but you can use Python slicing syntax to reverse after sorting (there is no explicit reverse() function for arrays):
arr = np.array([4.0, 8.0, 6.0, 5.0])
s_arr = np.sort(arr, kind='quicksort') # s_arr = [4.0 5.0 6.0 8.0]
r_arr = s_arr[::-1] # r_arr = [8.0 6.0 5.0 4.0]
arr = np.array([4.0, 8.0, 6.0, 5.0])
orig = np.copy(arr) # make a copy of original
arr[::-1].sort(kind='quicksort') # reverse sort arr in-place
r_arr = np.copy(arr) # copy reversed to r_arr if you wish
arr = np.copy(orig) # restore arr to original if you wish
Note that the sort() function has an optional order parameter. However, this parameter controls the order in which fields are compared when an array has cells holding an object with multiple fields. So order does not control ascending versus descending sort behavior.
The demo program concludes by sorting an array using a program-defined my_qsort() function:
start_time = time.clock()
my_qsort(arr)
end_time = time.clock()
elapsed_time = end_time - start_time
print "Elapsed time = "
print str(elapsed_time) + " seconds"
print "Original array after callng my_qsort(arr) is "
print arr # arr is sorted
The program-defined my_qsort() function sorts its array argument in place. The demo measures the approximate amount of time used by my_qsort() by wrapping its call with time.clock() function calls. Notice the demo program has an import time statement at the top of the source code to bring the clock() function into scope.
The whole point of using a library like NumPy is that you can use built-in functions like sort() and so you don't have to write program-defined functions. However, there are some scenarios where writing a custom version of a NumPy function is useful. In particular, you can customize the behavior of a program-defined function, usually at the expense of extra time (to write the function) and performance.
The heart of the quicksort algorithm is the partition() function. A detailed explanation of how quicksort and partitioning work is outside the scope of this e-book, but the behavior of any quicksort implementation depends on how the so-called pivot value is selected. The key line of code in the custom partition() function is:
piv = a[hi]
The pivot value is selected as the last cell value in the current sub-array being processed. Alternatives are to select the first cell value (piv = a[lo]), the middle cell value, or a randomly selected cell value between a[lo] and a[hi].
Resources
For additional information about the NumPy sort() function, see
http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.sort.html.
The program-defined quicksort function in this section is based on the Wikipedia article at
https://en.wikipedia.org/wiki/Quicksort.
For additional information about working with the Python time module, see
https://docs.python.org/2/library/time.html.
For additional information about NumPy array slicing, see
http://docs.scipy.org/doc/numpy-1.10.0/reference/arrays.indexing.html.
A surprisingly common task in data science programming is shuffling an array. Shuffling is rearranging the cell values in an array into a random order. You can think of shuffling as somewhat the opposite of sorting. You can shuffle an array using the built-in NumPy random.shuffle() function or by writing a program-defined shuffle function.
Code Listing 7: Array Shuffling Demo
# shuffling.py # Python 2.7 import numpy as np def my_shuffle(a, seed): # shuffle array a in place using Fisher-Yates algorithm np.random.seed(seed) n = len(a) for i in xrange(n): r = np.random.randint(i, n) a[r], a[i] = a[i], a[r] return # ===== arr = np.arange(10, dtype=np.int64) # [0, 1, 2, .. 9] orig = np.copy(arr) print "Array arr is " print arr print "" np.random.shuffle(arr) print "Array arr after a call to np.random.shuffle(arr) is " print arr print "" print "Resetting array arr" arr = np.copy(orig) print "Array arr is " print arr print "" my_shuffle(arr, seed=0) print "Array arr after call to my_shuffle(arr, seed=0) is " print arr print ""
print "\nEnd demo \n" |
C:\SciPy\Ch2> python shuffling.py Array arr is [0 1 2 3 4 5 6 7 8 9] Array arr after a call to np.random.shuffle(arr) is [1 9 8 3 0 2 7 5 6 4] Resetting array arr Array arr is [0 1 2 3 4 5 6 7 8 9] Array arr after call to my_shuffle(arr, seed=0) is [5 1 0 6 7 3 9 8 4 2] End demo |
The demo program begins by creating an ordered integer array with 10 values (0 through 9) using the arange() function, and makes a copy of that array using the copy() function:
arr = np.arange(10, dtype=np.int64)
orig = np.copy(arr)
print "Array arr is "
print arr
Next, the demo shuffles the contents of the array using the NumPy random.shuffle() function:
np.random.shuffle(arr)
print "Array arr after a call to np.random.shuffle(arr) is "
print arr
The random.shuffle() function reorders the contents of its array argument in place to a random order. In this example, the seed value for the underlying random number generator was not set, so if you ran the program again, you'd almost certainly get a different ordering of the array. If you want to make your program results reproducible, which is usually the case, you can explicitly set the underlying seed value like so:
np.random.seed(0)
np.random.shuffle(arr)
Here the seed was arbitrarily set to 0. Next, the demo program resets the array to its original values using the copy:
print "Resetting array arr"
arr = np.copy(orig)
print "Array arr is "
print arr
It would have been a mistake to use the assignment operator instead of the copy() function in an effort to make a copy of the original array. For example, suppose you had written this code:
arr = np.arange(10, dtype=np.int64)
orig = arr # assignment is probably not what you intended
print "Array arr is "
print arr
Because array assignment works by reference rather than by value, orig and arr are essentially pointers that both point to the same array in memory. Any change made to arr, such as a call to random.shuffle(arr), implicitly affects orig, too. Therefore, an attempt to reset arr after a call to random.shuffle() would have no effect.
Another important consequence of NumPy arrays being reference objects is that a function with an array parameter can modify the array in place. You can also create a reference to an array using the view() function, for example arr_v = arr.view() creates a reference copy of arr.
The demo program concludes by using a program-defined function my_shuffle() to shuffle the array:
my_shuffle(arr, seed=0)
print "Array arr after call to my_shuffle(arr, seed=0) is "
print arr
Function my_shuffle() is defined as:
def my_shuffle(a, seed):
np.random.seed(seed)
n = len(a)
for i in xrange(n):
r = np.random.randint(i, n)
a[r], a[i] = a[i], a[r]
return
Shuffling an array into a random order is surprisingly tricky and it's very easy to write faulty code. The function my_shuffle() uses what is called the Fisher-Yates algorithm, which is the best approach in most situations. Notice the function uses the very handy a,b = b,a Python idiom to swap two values. An alternative is to use the standard tmp=a; a=b; b=tmp idiom that's required in other programming languages.
Resources
For additional details about the NumPy random.shuffle() function, see
http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.shuffle.html.
For additional information about setting the random seed, see
http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.seed.html.
For a really interesting explanation of the Fisher-Yates shuffle algorithm, see
https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle.