CPython is the main Python distribution.
(Not to be confused with Cython, which we’ll touch on later).
CPython uses an Ahead-Of-Time (AOT) compiler i.e., the code is compiled in advance.
(A compiler translates program source course into machine-readable instructions).
This is as an assortment of statically compiled C extensions.
CPython is a general purpose interpreter, allowing it to work on a variety of problems.
It is dynamically typed, so types can change as you go.
# assign x to an integer x = 5 print(x) # then assign x to a string x = 'Gary' print(x)
Numba uses a JIT (Just-In-Time) compiler on functions i.e., compiles the function at execution time.
This converts the function to fast machine code (LLVM).
Numba works with the default CPython.
It works by adding decorators around functions.
Numba is helpful when you want to speed up numerical operations in specific functions.
There are two main modes:
Works by adding the
@jit decorator around the function.
This then compiles code that handles all values as Python objects and uses CPython to work on those objects.
@jit first tries to use
nopython mode (covered next), and if it fails uses
The main improvement over CPython is for loops.
Works by adding the
@jit(nopython=True) decorator (aliased as
@njit) around the function.
This then compiles code that does not access CPython.
This has higher performance than the
nopython mode requires specific types (mainly numbers), otherwise returns a
import numpy as np from numba import njit
First, lets profile an example numerical function without Numba:
nums = np.arange(1_000_000)
def slow_function(nums): trace = 0.0 for num in nums: # loop trace += np.cos(num) # numpy return nums + trace # broadcasting
1.37 s ± 8.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Now, lets add the Numba
njit decorator on the same function:
@njit def fast_function(nums): trace = 0.0 for num in nums: # loop trace += np.cos(num) # numpy return nums + trace # broadcasting
The first call to the Numba function has an overhead to compile the function.
%%timeit -n 1 -r 1 # -n 1 means execute the statement once, -r 1 means for one repetition fast_function(nums)
384 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Then, all subsequent calls use this compiled version, which are much faster.
%%timeit -n 1 -r 1 fast_function(nums)
21.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
For the function below (
def fast_add(x, y):
return x + y
What will happen when it is called with:
TypingError is returned.
This is because Numba is trying to compile a function that adds an integer to a tuple.
Take care with types.
Ensure that the function works first, before adding Numba to make it faster.
The signature of the Numba function can limit it to specific input and output types, among other things.
This can save time for Numba to infer the types, and is also useful for when we use GPUs later.
These are added as arguments to the Numba decorator.
from numba import int32, float32
Here, the output type is wrapped around the input types.
@njit(float32(int32, int32)) def fast_add(x, y): return x + y
Numba also simplifies the creation of a NumPy ufunc using the
They can be targeted to different hardware in the signature.
The default target is for a single CPU case (which has the least overhead).
This is suitable for smaller data sizes (<1 KB) and low compute intensities.
from numba import vectorize
Don’t worry about what this function does, just focus on the vectorisation bit.
You’ll notice that this it the same example as from the previous lesson on vectorisation, apart from that we’re now adding Numba’s
import math SQRT_2PI = np.float32((2.0 * math.pi)**0.5) x = np.random.uniform(-3.0, 3.0, size=1_000_000) mean = 0.0 sigma = 1.0 @vectorize # I'm new def my_function(x, mean, sigma): '''Compute the value of a Gaussian probability density function at x with given mean and sigma.''' return math.exp(-0.5 * ((x - mean) / sigma)**2.0) / (sigma * SQRT_2PI)
So, the first call to the function compiles it:
%%timeit -n 1 -r 1 my_function(x, 0.0, 1.0)
73.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Then, all subsequent calls use the fast compiled version:
%%timeit -n 1 -r 1 my_function(x, 0.0, 1.0)
11 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
During our last lesson on vectorisation we also touched on generalised ufuncs (gufuncs).
vectorize to work on many input elements.
Numba has a nice implementation of these using
The signature also requires the types to be specified first in a list.
[(int64[:], int64, int64[:])]means an n-element one-dimensional array of
int64, a scalar of
int64, and another n-element one-dimensional array of
Then the signature includes the input(s) and output(s) dimensions in symbolic form.
'(n),()->(n)'means input an n-element one-dimensional array (
(n)) and a scalar (
()), and output an n-element one-dimensional array (
from numba import guvectorize, int64
@guvectorize([(int64[:], int64, int64[:])], '(n),()->(n)') def g(x, y, result): for index in range(x.shape): result[index] = x[index] + y
First, lets try the gufunc with 1D array and an integer:
x = np.arange(5) x
array([0, 1, 2, 3, 4])
array([5, 6, 7, 8, 9])
Okay. So, now how about a 2D array and an integer:
x = np.arange(6).reshape(2, 3) x
array([[0, 1, 2], [3, 4, 5]])
array([[10, 11, 12], [13, 14, 15]])
And, what about a 2D array and a 1D array:
g(x, np.array([10, 20]))
array([[10, 11, 12], [23, 24, 25]])
The next lesson covers parallelisation in detail. However, before that, let’s touch on a nice feature within Numba.
Numba can target different hardware in the signature.
Just now, we saw a Numba function for a single CPU, which is suitable for small data sizes.
The next target is for a multi-core CPU.
This has small additional overheads for threading.
This is suitable for medium data sizes (1 KB - 1 MB).
If code contains operations that are parallelisable (and supported) Numba can compile a version that will run in parallel on multiple threads.
This parallelisation is performed automatically and is enabled by simply adding the keyword agurment
For example, let’s first use the function in serial (i.e., with
parallel=False which is also the default):
x = np.arange(1.e7, dtype=np.int64)
@njit def my_serial_function_for_cpu(x): return np.cos(x) ** 2 + np.sin(x) ** 2
280 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Okay, so let’s now change that to run in parallel:
@njit(parallel=True) def my_parallel_function_for_cpu(x): return np.cos(x) ** 2 + np.sin(x) ** 2
The timing of this parallel function depends on how many CPUs your machine has and how free their resources are.
143 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
What is the default Python distribution?
Which Numba compilation mode has higher performance?
How do I compile a function in Numba using
Wrap the function in the
What is the keyword argument that enables Numba compiled functions to run over multiple CPUs?
Create your own Numba vectorised function that calculates the cube root of an array over all elements.
Have a look at the similar exercise from the Vectorisation lesson.
Using the Numba
@vectorize decorator and the
from numba import vectorize # I'm the only change from the NumPy ufunc exercise
return math.pow(array, 1/3)
27.7 ms ± 643 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba is a nice way to get performant numerical code.
More information and considerations¶
Factor out the performance-critical part of the code for compilation in Numba.
Consider what data precision is required i.e., is 64-bit needed?
Numba can also target CUDA GPUs, which we’ll cover in the final lesson.
Also uses the JIT compiler (though is written in Python).
PyPy enables optimisations at run time, especially for numerical tasks with repitition and loops.
Completely replaces CPython.
Cauation, it may not be compatible with the libraries you use.
Generally fast, though there are overheads for start-up and memory.
PyPy is helpful when want to speed up numerical opterations in all of the code.