Data Structures, Algorithms, and Libraries¶

# if you're using colab, then install the required modules
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    %pip install algorithms

Python comes with a standard library.

This includes:

Built-in functions (e.g., len)
Data types/structures
- Numerics (e.g., int, float)
- Sequences (e.g., list, str)
- Mappings (e.g., dict)
Modules
- Numeric and Mathematical (e.g., math)

And lots more.

They are optimised in C (statically typed and compiled). Hence, they’re often faster than reimplementing them yourself.

They provide standardized solutions for many problems that occur in everyday programming.

Built-in functions ¶

`len`¶

Return the length (the number of items) of an object. The argument may be a sequence (e.g., list) or a collection (e.g., dictionary).

Tip

Use len rather than counting the items in an object in a loop.

nums = [num for num in range(1_000_000)]

%%timeit
count = 0
for num in nums: # time O(n)
    count += 1

33.4 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
len(nums) # time O(1)

57 ns ± 0.0741 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Data types/structures ¶

Lists ¶

A sequence of objects.

Tip

Append to lists, rather than concatenating.

Lists are allocated twice the memory required, so appending fills this up in O(1) (long-term average), while concatenating creates a new list each time in O(n).

%%timeit
my_list = []
for num in range(1_000):
    my_list += [num] # time O(n)

59.9 µs ± 23 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%%timeit
my_list = []
for num in range(1_000):
    my_list.append(num) # time O(1)

47.8 µs ± 23 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Dictionaries ¶

A set of key: value pairs, where the keys are unique and immutable indicies.

Tip

Use dictionaries as look-ups, as they’re fast to search, O(1).

Example from Luciano Ramalho, Fluent Python, Clear, Concise, and Effective Programming, 2015. O’Reilly Media, Inc.

import numpy as np

haystack_list = np.random.uniform(low=0, high=100, size=(1_000_000))
haystack_dict = {key: value for key, value in enumerate(haystack_list)}
needles = [0.1, 50.1, 99.1]

%%timeit
needles_found = 0
for needle in needles:
    if needle in haystack_list: # time O(n) within list
        needles_found += 1

1.24 ms ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
needles_found = 0
for needle in needles:
    if needle in haystack_dict: # time O(1) within dict
        needles_found += 1

206 ns ± 1.24 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Tip

Reduce repeated calculations with caching.

For example, use caching with the Fibonacci sequence (each number is the sum of the two preceding ones starting from 0 and 1 e.g. 0, 1, 1, 2, 3, 5, 8, 13, 21, 34).

def fibonacci(n): # time O(2^n) as 2 calls to the function n times (a balanced tree of repeated calls)
    if n == 0 or n == 1:
        return 0
    elif n == 2:
        return 1
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)

%timeit fibonacci(20)

1.57 ms ± 873 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

def fibonacci_with_caching(n, cache={0: 0, 1: 0, 2: 1}): # time O(n) as 1 call per n
    if n in cache:
        return cache[n]
    else:
        cache[n] = fibonacci_with_caching(n - 1, cache) + fibonacci_with_caching(n - 2, cache)
        return cache[n]

%timeit fibonacci_with_caching(20, cache={0: 0, 1: 0, 2: 1})

5.82 µs ± 13.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Question

Which of the following uses less memory and how can you check?

np.float16
np.float32
np.float64

Hint

Use the nbytes attribute.

Solution

Single precision floats have 32 bits (i.e., np.single or np.float32).

This is equal to 4 bytes, as 1 byte is 8 bits.

You can check this using:

np.float32(1.0).nbytes
4

Half precision floats have (you guessed it) half of this (i.e., np.half or np.float16):

np.float16(1.0).nbytes
2

And double precision floats (i.e., np.double or np.float64):

np.float64(1.0).nbytes
8

In some situations, you may be able to significantly reduce your memory usage by using half or single precision arrays.

Modules¶

`math`¶

This module provides access to the mathematical functions.

So, you could create your own function to calculate the hypotenuse of a triangle:

def hypotenuse(x, y):
    x = abs(x)
    y = abs(y)
    t = min(x, y)
    x = max(x, y)
    t = t / x
    return x * (1 + t*t) ** 0.5

%timeit hypotenuse(3.0, 4.0)

469 ns ± 4.47 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Though, math already has this implemented and optimised:

import math

%timeit math.hypot(3.0, 4.0)

86.8 ns ± 0.0594 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Algorithms¶

An algortihm is the instructions (/ recipe) to solve the problem.

Many existing libraries are already optimised (computationally and algorithmically).

For example, the algorithm library has minimal examples of data structures and algorithms in Python e.g., breadth first search, depth first search, linked lists, etc.

Sorting ¶

unsorted_array = np.random.rand(1_000)

Selection sort ¶

Time O(n^2), space O(1)

Have two arrays: one unsorted (original) and one sorted (can do in place to avoid extra memory).
Find the smallest number in the unsorted array and add it to the sorted array.
Repeat step 2 until the final index is the largest number (i.e. sorted array).

from algorithms.sort import selection_sort

%timeit selection_sort(unsorted_array)

101 ms ± 84.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Merge sort ¶

Time O(nlgn), space O(n or nlgn, depending on implementation)

Divide the array in half.
Then recusively apply:
a. Step 1 to both halves, until hit the base case where both halves are length 1.
b. Merge the two length 1 arrays into a sorted array of length 2.
c. Repeat b, all the way up for this half.

from algorithms.sort import merge_sort

%timeit merge_sort(unsorted_array)

4.38 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Timsort ¶

Time O(nlgn = worst case, n = best case), space O(n)

Timsort is the default implementation of sorting in Python.
It combines merge sort with insertion sort (where each element is inserted into a new sorted list).
It takes advantage of runs of consecutive ordered elements to reduce comparisons (relative to merge sort).
It merges when runs match a specified criterion.
The runs have a minimum size (attained by insertion sort, if needed).

sorted(my_iterable)

Creates a new sorted list.
Works for any iterable.

%timeit sorted(unsorted_array)

63.5 µs ± 61.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

my_list.sort()

In-place (only for lists).

%timeit unsorted_array.sort()

6.18 µs ± 3.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Exercises¶

Exercise 1

What data structure would be suitable for finding or removing duplicate values?

a. List
b. Dictionary
c. Queue
d. Set

Test out your answer on the following array:

array = np.random.choice(100, 80)

Are there any other ways of doing it?

Solution

A suitable data structure for finding or removing duplicate values is:

d. Set

This is because sets are unordered collections of distinct hashable objects.

You can call it via set(array).

Another approach would be to use the NumPy features. For example, when creating a random array, you can specific whether the random choice gets replaced or not via:

array = np.random.choice(100, 80, replace=False)

Exercise 2

In the exercise from the profiling lesson, we saw an example of two_sum i.e., finding two numbers from an array of unique integers that add up to a target sum.

What would be a good approach for generalising this sum of two numbers to three, four, or n numbers?

Solution

Many algorithms are already implemented for you.

For example, the algorithm library has an n_sum method.

This generalises the function to any value of n.

For example, here is a “six_sum”:

from algorithms.arrays import n_sum
array = np.random.choice(100, 80, replace=False)
target = 23
n_sum(6, array, target)
[[0, 1, 3, 4, 6, 9],
[0, 1, 3, 4, 7, 8],
[0, 1, 3, 5, 6, 8],
[0, 1, 4, 5, 6, 7]]

Try it for yourself or any one of the other many examples.

Key Points¶

Important

Make use of the built-in functions e.g, use len rather than counting the items in an object in a loop.
Use appropriate data structures e.g., append to lists rather than concatenating, use dictionaries as fast to search look-ups, cache results in dictionaries to reduce repeated calculations.
Make use of the standard library (optimised in C) e.g., the math module.
See whether there is an algorithm or library that already optimally solves your problem e.g., faster sorting algorithms.

Further information¶

Other options¶

Generators save memory by yielding only the next iteration.
For NetCDFs, using engine='h5netcdf' with xarray can be faster, over than the default engine='netcdf4'.
Compression
- If arrays are mostly 0, then can save memory by using sparse arrays (example using them with Dask and xarray).
- Try lz4, snappy, and Z-Standard over gzip and bz2 for performance.
Chunking
- If need all data, then can load/process in chunks to reduce amount in memory: Zarr for arrays, Pandas.
Indexing
- If need a subset of the data, then can index (multi-index) to reduce memory and increase speed for queries: Pandas, SQLite.
Suitable data types for parallel computing
- Parquet creates efficient tabular data (e.g. dataframes), useful for parallel computing.
  - Accessed via pyarrow or fastparquet.
- Zarr creates compressed, chunked, N-dimensional arrays, designed for use in parallel computing.

Resources¶

MIT 6.006 ‘Introduction to Algorithms’, 2011: video lectures
LeetCode, AlgoExpert

SWD6: High Performance Python

Data Structures, Algorithms, and Libraries

Contents

Data Structures, Algorithms, and Libraries¶

Built-in functions ¶

`len`¶

Data types/structures ¶

Lists ¶

Dictionaries ¶

Modules¶

`math`¶

Algorithms¶

Sorting ¶

Selection sort ¶

Merge sort ¶

Timsort ¶

Exercises¶

Key Points¶

Further information¶

Other options¶

Resources¶