Motivation ¶

A one-dimensional array is a fundamental data structure in programming that represents a collection of elements stored in a linear sequence. Each element in the array is identified by an index, starting from 0 for the first element. This sounds a lot like a list (and it is).

Let's start by creating a simple list of int values.

In [1]:

         
            Copied!
           
         data_list = [0, 1, 2, 3, 4, 5]
print(data_list)
print(type(data_list))

         data_list = [0, 1, 2, 3, 4, 5]
print(data_list)
print(type(data_list))

[0, 1, 2, 3, 4, 5]
<class 'list'>

We can create something very similar in NumPy called an array .

Note

Note that other programming languages use the term "array" or "vector" for the equivalent data structure of a Python list. Whenever we are talking about Python, we will often say "array" to mean specifically a NumPy array. If you are ever confused, always ask to clarify if we mean a NumPy array.

Okay, back to creating our Numpy array.

In [2]:

         
            Copied!
           
         import numpy as np

data_array = np.array([0, 1, 2, 3, 4, 5])
print(data_array)
print(type(data_array))

         import numpy as np

data_array = np.array([0, 1, 2, 3, 4, 5])
print(data_array)
print(type(data_array))

[0 1 2 3 4 5]
<class 'numpy.ndarray'>

data_array contains the same information ( int s from 0 to 5) with some major usability differences. First, we notice that there are no commas in between the elements when we print the array; this is mainly just for aesthetic purposes and you could tell NumPy to print them. We also check that, indeed, the data type is not a list , but a numpy.ndarray .

Let's do some numerical operations with our two data structures and see the differences. First, let's just add 2 to each element.

For a list we need to create a new list and then loop over each element, add 2 , and then append it to list_added .

In [3]:

         
            Copied!
           
         list_added = []
for num in data_list:
    list_added.append(num + 2)
print(f"List:  {list_added}")

         list_added = []
for num in data_list:
    list_added.append(num + 2)
print(f"List:  {list_added}")

List:  [2, 3, 4, 5, 6, 7]

Another way you could do this is with a list comprehension. This is essentially a shortcut to do the code in the above cell.

In [4]:

         
            Copied!
           
         list_added = [i + 2 for i in data_list]
print(f"List:  {list_added}")

         list_added = [i + 2 for i in data_list]
print(f"List:  {list_added}")

List:  [2, 3, 4, 5, 6, 7]

For a NumPy array, we do this.

In [5]:

         
            Copied!
           
         array_added = data_array + 2
print(f"Array: {array_added}")

         array_added = data_array + 2
print(f"Array: {array_added}")

Array: [2 3 4 5 6 7]

Wow, that was easy. And indeed, NumPy is designed to be the de facto library for numerical operations.

What about a sum?

In [6]:

         
            Copied!
           
         list_sum = sum(data_list)
print(f"List:  {list_sum}")

         list_sum = sum(data_list)
print(f"List:  {list_sum}")

List:  15

Okay, not too bad.

In [7]:

         
            Copied!
           
         array_sum = np.sum(data_array)
print(f"Array: {array_sum}")

         array_sum = np.sum(data_array)
print(f"Array: {array_sum}")

Array: 15

There was not much of a difference there.

What about computing the mean?

In [8]:

         
            Copied!
           
         list_mean = sum(data_list) / len(data_list)
print(f"List:  {list_mean}")

array_mean = np.mean(data_array)
print(f"Array: {array_mean}")

         list_mean = sum(data_list) / len(data_list)
print(f"List:  {list_mean}")

array_mean = np.mean(data_array)
print(f"Array: {array_mean}")

List:  2.5
Array: 2.5

Alright, that does not seem like too big of a difference. What gives?

NumPy accelerates mathematical routines compared to lists. For example, we can use timeit to compute the average time for small snippets of code. Let's see how much faster NumPy is for simply computing $x^{2}$.

First, we will create an array and list of random numbers from 0 to 1000.

In [9]:

         
            Copied!
           
         random_array = np.random.uniform(low=0, high=1000, size=10000)
print(random_array)
random_list = random_array.tolist()

         random_array = np.random.uniform(low=0, high=1000, size=10000)
print(random_array)
random_list = random_array.tolist()

[840.02203123 390.43466132 943.92266283 ... 915.78585778 895.96633058
 646.51616033]

Now, let's time the calculation.

In [10]:

         
            Copied!
           
         import timeit

# Time the operation of squaring elements for NumPy array
numpy_time = timeit.timeit(lambda: np.square(random_array), number=10000)

# Time the operation of squaring elements for Python list
list_time = timeit.timeit(lambda: [x**2 for x in random_list], number=10000)

print(f"NumPy array time: {numpy_time:.3f} s")
print(f"Python list time: {list_time:.3f} s")
print(f"\nNumPy array is {list_time/numpy_time:.2f} times faster than Python list!")

         import timeit

# Time the operation of squaring elements for NumPy array
numpy_time = timeit.timeit(lambda: np.square(random_array), number=10000)

# Time the operation of squaring elements for Python list
list_time = timeit.timeit(lambda: [x**2 for x in random_list], number=10000)

print(f"NumPy array time: {numpy_time:.3f} s")
print(f"Python list time: {list_time:.3f} s")
print(f"\nNumPy array is {list_time/numpy_time:.2f} times faster than Python list!")

NumPy array time: 0.025 s
Python list time: 3.787 s

NumPy array is 151.52 times faster than Python list!

List of lists ¶

The real selling point for NumPy is the concept of multi-dimensional arrays which are important in ~~computational biology~~ everything.

For out example, suppose we have a data set of different patient vital signs like pulse , temperature , and spo2 . We could potentially have thousands, but let's stick with only three patients.

While we have not covered it yet, you can actually store a list inside of a list!

In [11]:

         
            Copied!
           
         patient_data_list = [[57, 99.0, 0.98], [68, 101.2, 0.92], [60, 98.3, 1.00]]
print("List")
print(patient_data_list)

         patient_data_list = [[57, 99.0, 0.98], [68, 101.2, 0.92], [60, 98.3, 1.00]]
print("List")
print(patient_data_list)

List
[[57, 99.0, 0.98], [68, 101.2, 0.92], [60, 98.3, 1.0]]

In [12]:

         
            Copied!
           
         patient_data_array = np.array([[57, 99.0, 0.98], [68, 101.2, 0.92], [60, 98.3, 1.00]])
print("Array")
print(patient_data_array)

         patient_data_array = np.array([[57, 99.0, 0.98], [68, 101.2, 0.92], [60, 98.3, 1.00]])
print("Array")
print(patient_data_array)

Array
[[ 57.    99.     0.98]
 [ 68.   101.2    0.92]
 [ 60.    98.3    1.  ]]

We can also use patient_data_list to create the array!

In [13]:

         
            Copied!
           
         patient_data_array = np.array(patient_data_list)
print("Array")
print(patient_data_array)

         patient_data_array = np.array(patient_data_list)
print("Array")
print(patient_data_array)

Array
[[ 57.    99.     0.98]
 [ 68.   101.2    0.92]
 [ 60.    98.3    1.  ]]

Okay, now lets compute the mean of each data category (i.e., column).

In [14]:

         
            Copied!
           
         # First, we create a list to store our means.
n_patients = len(patient_data_list)  # total number of rows/patients we have
print(n_patients)
patient_data_mean_list = [0.0] * n_patients  # Creates list with zeros for each mean
print(patient_data_mean_list)

         # First, we create a list to store our means.
n_patients = len(patient_data_list)  # total number of rows/patients we have
print(n_patients)
patient_data_mean_list = [0.0] * n_patients  # Creates list with zeros for each mean
print(patient_data_mean_list)

3
[0.0, 0.0, 0.0]

Now we have our collection where we can store intermediate values while computing the mean.

To compute the mean, we first need to compute the sum for each element i in each list. Looping over each patient_data in patient_data_list would get us the data for each patient. Then, I can iterate over the pulse , temperature , and spo2 and add this value to our patient_data_mean_list .

After this, we need to normalize the sum by the total number of samples.

In [15]:

         
            Copied!
           
         # Repeat this line to ensure we always start from zero in this cell.
patient_data_mean_list = [0.0] * n_patients

# Loops through each patient's data
for patient_data in patient_data_list:
    # Generates an index for and adds the value of patient data to the current
    # value in patient_data_mean_list
    for i in range(len(patient_data)):
        patient_data_mean_list[i] = patient_data_mean_list[i] + patient_data[i]

# Generates an index to loop over every sum in patient_data_mean_list
for i in range(len(patient_data_mean_list)):
    # Divides each sum by the number of patients to get the mean
    patient_data_mean_list[i] = patient_data_mean_list[i] / n_patients

print(patient_data_mean_list)

         # Repeat this line to ensure we always start from zero in this cell.
patient_data_mean_list = [0.0] * n_patients

# Loops through each patient's data
for patient_data in patient_data_list:
    # Generates an index for and adds the value of patient data to the current
    # value in patient_data_mean_list
    for i in range(len(patient_data)):
        patient_data_mean_list[i] = patient_data_mean_list[i] + patient_data[i]

# Generates an index to loop over every sum in patient_data_mean_list
for i in range(len(patient_data_mean_list)):
    # Divides each sum by the number of patients to get the mean
    patient_data_mean_list[i] = patient_data_mean_list[i] / n_patients

print(patient_data_mean_list)

[61.666666666666664, 99.5, 0.9666666666666667]

In [16]:

         
            Copied!
           
         patient_data_mean_array = np.mean(
    patient_data_array, axis=0
)  # axis must equal zero here.
print(f"Array: {patient_data_mean_array}")

         patient_data_mean_array = np.mean(
    patient_data_array, axis=0
)  # axis must equal zero here.
print(f"Array: {patient_data_mean_array}")

Array: [61.66666667 99.5         0.96666667]

That was much easier! But what was that axis parameter?

Great question, however, to answer that we need to explain some fundamental NumPy array concepts.