Motivation ¶
A one-dimensional array is a fundamental data structure in programming that represents a collection of elements stored in a linear sequence. Each element in the array is identified by an index, starting from 0 for the first element. This sounds a lot like a list (and it is).
Let's start by creating a simple list of
int
values.
data_list = [0, 1, 2, 3, 4, 5]
print(data_list)
print(type(data_list))
[0, 1, 2, 3, 4, 5] <class 'list'>
We can create something very similar in NumPy called an array .
Note
Note that other programming languages use the term "array" or "vector" for the equivalent data structure of a Python list. Whenever we are talking about Python, we will often say "array" to mean specifically a NumPy array. If you are ever confused, always ask to clarify if we mean a NumPy array.
Okay, back to creating our Numpy array.
import numpy as np
data_array = np.array([0, 1, 2, 3, 4, 5])
print(data_array)
print(type(data_array))
[0 1 2 3 4 5] <class 'numpy.ndarray'>
data_array
contains the same information (
int
s from 0 to 5) with some major usability differences.
First, we notice that there are no commas in between the elements when we print the array; this is mainly just for aesthetic purposes and you
could
tell NumPy to print them.
We also check that, indeed, the data type is not a
list
, but a
numpy.ndarray
.
Let's do some numerical operations with our two data structures and see the differences.
First, let's just add
2
to each element.
For a list we need to create a new list and then
loop
over each element, add
2
, and then
append
it to
list_added
.
list_added = []
for num in data_list:
list_added.append(num + 2)
print(f"List: {list_added}")
List: [2, 3, 4, 5, 6, 7]
Another way you could do this is with a list comprehension. This is essentially a shortcut to do the code in the above cell.
list_added = [i + 2 for i in data_list]
print(f"List: {list_added}")
List: [2, 3, 4, 5, 6, 7]
For a NumPy array, we do this.
array_added = data_array + 2
print(f"Array: {array_added}")
Array: [2 3 4 5 6 7]
Wow, that was easy. And indeed, NumPy is designed to be the de facto library for numerical operations.
What about a sum?
list_sum = sum(data_list)
print(f"List: {list_sum}")
List: 15
Okay, not too bad.
array_sum = np.sum(data_array)
print(f"Array: {array_sum}")
Array: 15
There was not much of a difference there.
What about computing the mean?
list_mean = sum(data_list) / len(data_list)
print(f"List: {list_mean}")
array_mean = np.mean(data_array)
print(f"Array: {array_mean}")
List: 2.5 Array: 2.5
Alright, that does not seem like too big of a difference. What gives?
NumPy accelerates mathematical routines compared to lists.
For example, we can use
timeit
to compute the average time for small snippets of code.
Let's see how much faster NumPy is for simply computing $x^{2}$.
First, we will create an array and list of random numbers from 0 to 1000.
random_array = np.random.uniform(low=0, high=1000, size=10000)
print(random_array)
random_list = random_array.tolist()
[840.02203123 390.43466132 943.92266283 ... 915.78585778 895.96633058 646.51616033]
Now, let's time the calculation.
import timeit
# Time the operation of squaring elements for NumPy array
numpy_time = timeit.timeit(lambda: np.square(random_array), number=10000)
# Time the operation of squaring elements for Python list
list_time = timeit.timeit(lambda: [x**2 for x in random_list], number=10000)
print(f"NumPy array time: {numpy_time:.3f} s")
print(f"Python list time: {list_time:.3f} s")
print(f"\nNumPy array is {list_time/numpy_time:.2f} times faster than Python list!")
NumPy array time: 0.025 s Python list time: 3.787 s NumPy array is 151.52 times faster than Python list!
List of lists ¶
The real selling point for NumPy is the concept of multi-dimensional arrays which are important in
computational biology
everything.
For out example, suppose we have a data set of different patient vital signs like
pulse
,
temperature
, and
spo2
.
We could potentially have thousands, but let's stick with only three patients.
While we have not covered it yet, you can actually store a list inside of a list!
patient_data_list = [[57, 99.0, 0.98], [68, 101.2, 0.92], [60, 98.3, 1.00]]
print("List")
print(patient_data_list)
List [[57, 99.0, 0.98], [68, 101.2, 0.92], [60, 98.3, 1.0]]
patient_data_array = np.array([[57, 99.0, 0.98], [68, 101.2, 0.92], [60, 98.3, 1.00]])
print("Array")
print(patient_data_array)
Array [[ 57. 99. 0.98] [ 68. 101.2 0.92] [ 60. 98.3 1. ]]
We can also use
patient_data_list
to create the array!
patient_data_array = np.array(patient_data_list)
print("Array")
print(patient_data_array)
Array [[ 57. 99. 0.98] [ 68. 101.2 0.92] [ 60. 98.3 1. ]]
Okay, now lets compute the mean of each data category (i.e., column).
# First, we create a list to store our means.
n_patients = len(patient_data_list) # total number of rows/patients we have
print(n_patients)
patient_data_mean_list = [0.0] * n_patients # Creates list with zeros for each mean
print(patient_data_mean_list)
3 [0.0, 0.0, 0.0]
Now we have our collection where we can store intermediate values while computing the mean.
To compute the mean, we first need to compute the sum for each element
i
in each list.
Looping over each
patient_data
in
patient_data_list
would get us the data for each patient.
Then, I can iterate over the
pulse
,
temperature
, and
spo2
and add this value to our
patient_data_mean_list
.
After this, we need to normalize the sum by the total number of samples.
# Repeat this line to ensure we always start from zero in this cell.
patient_data_mean_list = [0.0] * n_patients
# Loops through each patient's data
for patient_data in patient_data_list:
# Generates an index for and adds the value of patient data to the current
# value in patient_data_mean_list
for i in range(len(patient_data)):
patient_data_mean_list[i] = patient_data_mean_list[i] + patient_data[i]
# Generates an index to loop over every sum in patient_data_mean_list
for i in range(len(patient_data_mean_list)):
# Divides each sum by the number of patients to get the mean
patient_data_mean_list[i] = patient_data_mean_list[i] / n_patients
print(patient_data_mean_list)
[61.666666666666664, 99.5, 0.9666666666666667]
patient_data_mean_array = np.mean(
patient_data_array, axis=0
) # axis must equal zero here.
print(f"Array: {patient_data_mean_array}")
Array: [61.66666667 99.5 0.96666667]
That was much easier!
But what was that
axis
parameter?
Great question, however, to answer that we need to explain some fundamental NumPy array concepts.