Compilers
Contents
Compilers¶
CPython¶
CPython is the main Python distribution.
(Not to be confused with Cython, which we’ll touch on later).
CPython uses an AheadOfTime (AOT) compiler i.e., the code is compiled in advance.
(A compiler translates program source course into machinereadable instructions).
This is as an assortment of statically compiled C extensions.
CPython is a general purpose interpreter, allowing it to work on a variety of problems.
It is dynamically typed, so types can change as you go.
For example:
# assign x to an integer
x = 5
print(x)
# then assign x to a string
x = 'Gary'
print(x)
5
Gary
Numba¶
Numba uses a JIT (JustInTime) compiler on functions i.e., compiles the function at execution time.
This converts the function to fast machine code (LLVM).
Numba works with the default CPython.
It works by adding decorators around functions.
Numba is helpful when you want to speed up numerical operations in specific functions.
There are two main modes: object
and nopython
.
object
mode (@jit
)¶
Works by adding the @jit
decorator around the function.
This then compiles code that handles all values as Python objects and uses CPython to work on those objects.
@jit
first tries to use nopython
mode (covered next), and if it fails uses object
mode.
The main improvement over CPython is for loops.
nopython
mode (@njit
)¶
Works by adding the @jit(nopython=True)
decorator (aliased as @njit
) around the function.
This then compiles code that does not access CPython.
This has higher performance than the object
mode.
The nopython
mode requires specific types (mainly numbers), otherwise returns a TypeError
.
For example:
import numpy as np
from numba import njit
First, lets profile an example numerical function without Numba:
nums = np.arange(1_000_000)
def slow_function(nums):
trace = 0.0
for num in nums: # loop
trace += np.cos(num) # numpy
return nums + trace # broadcasting
%%timeit
slow_function(nums)
1.37 s ± 8.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Now, lets add the Numba njit
decorator on the same function:
@njit
def fast_function(nums):
trace = 0.0
for num in nums: # loop
trace += np.cos(num) # numpy
return nums + trace # broadcasting
The first call to the Numba function has an overhead to compile the function.
%%timeit n 1 r 1 # n 1 means execute the statement once, r 1 means for one repetition
fast_function(nums)
384 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Then, all subsequent calls use this compiled version, which are much faster.
%%timeit n 1 r 1
fast_function(nums)
21.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Question
For the function below (fast_add
):
@njit
def fast_add(x, y):
return x + y
What will happen when it is called with:
fast_add(1, (2,))
Solution
A TypingError
is returned.
This is because Numba is trying to compile a function that adds an integer to a tuple.
Take care with types.
Ensure that the function works first, before adding Numba to make it faster.
Signatures¶
The signature of the Numba function can limit it to specific input and output types, among other things.
This can save time for Numba to infer the types, and is also useful for when we use GPUs later.
These are added as arguments to the Numba decorator.
For example:
from numba import int32, float32
Here, the output type is wrapped around the input types.
@njit(float32(int32, int32))
def fast_add(x, y):
return x + y
fast_add(2, 2)
4.0
@vectorize
¶
Numba also simplifies the creation of a NumPy ufunc using the @vectorize
decorator.
They can be targeted to different hardware in the signature.
The default target is for a single CPU case (which has the least overhead).
This is suitable for smaller data sizes (<1 KB) and low compute intensities.
For example:
from numba import vectorize
Don’t worry about what this function does, just focus on the vectorisation bit.
You’ll notice that this it the same example as from the previous lesson on vectorisation, apart from that we’re now adding Numba’s @vectorize
decorator.
import math
SQRT_2PI = np.float32((2.0 * math.pi)**0.5)
x = np.random.uniform(3.0, 3.0, size=1_000_000)
mean = 0.0
sigma = 1.0
@vectorize # I'm new
def my_function(x, mean, sigma):
'''Compute the value of a Gaussian probability density function at x with given mean and sigma.'''
return math.exp(0.5 * ((x  mean) / sigma)**2.0) / (sigma * SQRT_2PI)
So, the first call to the function compiles it:
%%timeit n 1 r 1
my_function(x, 0.0, 1.0)
73.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Then, all subsequent calls use the fast compiled version:
%%timeit n 1 r 1
my_function(x, 0.0, 1.0)
11 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
@guvectorize
¶
During our last lesson on vectorisation we also touched on generalised ufuncs (gufuncs).
These extend vectorize
to work on many input elements.
Numba has a nice implementation of these using guvectorize
.
The signature also requires the types to be specified first in a list.
For example:
[(int64[:], int64, int64[:])]
means an nelement onedimensional array ofint64
, a scalar ofint64
, and another nelement onedimensional array ofint64
.
Then the signature includes the input(s) and output(s) dimensions in symbolic form.
For example:
'(n),()>(n)'
means input an nelement onedimensional array ((n)
) and a scalar (()
), and output an nelement onedimensional array ((n)
).
from numba import guvectorize, int64
@guvectorize([(int64[:], int64, int64[:])], '(n),()>(n)')
def g(x, y, result):
for index in range(x.shape[0]):
result[index] = x[index] + y
First, lets try the gufunc with 1D array and an integer:
x = np.arange(5)
x
array([0, 1, 2, 3, 4])
g(x, 5)
array([5, 6, 7, 8, 9])
Okay. So, now how about a 2D array and an integer:
x = np.arange(6).reshape(2, 3)
x
array([[0, 1, 2],
[3, 4, 5]])
g(x, 10)
array([[10, 11, 12],
[13, 14, 15]])
And, what about a 2D array and a 1D array:
g(x, np.array([10, 20]))
array([[10, 11, 12],
[23, 24, 25]])
parallel = True
¶
The next lesson covers parallelisation in detail. However, before that, let’s touch on a nice feature within Numba.
Numba can target different hardware in the signature.
Just now, we saw a Numba function for a single CPU, which is suitable for small data sizes.
The next target is for a multicore CPU.
This has small additional overheads for threading.
This is suitable for medium data sizes (1 KB  1 MB).
If code contains operations that are parallelisable (and supported) Numba can compile a version that will run in parallel on multiple threads.
This parallelisation is performed automatically and is enabled by simply adding the keyword agurment parallel=True
to @njit
.
For example, let’s first use the function in serial (i.e., with parallel=False
which is also the default):
x = np.arange(1.e7, dtype=np.int64)
@njit
def my_serial_function_for_cpu(x):
return np.cos(x) ** 2 + np.sin(x) ** 2
%%timeit
my_serial_function_for_cpu(x)
280 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Okay, so let’s now change that to run in parallel:
@njit(parallel=True)
def my_parallel_function_for_cpu(x):
return np.cos(x) ** 2 + np.sin(x) ** 2
Note
The timing of this parallel function depends on how many CPUs your machine has and how free their resources are.
%%timeit
my_parallel_function_for_cpu(x)
143 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Exercises¶
Exercise 1
What is the default Python distribution?
Cython
PyPy
CPython
Solution
CPython
Exercise 2
Which Numba compilation mode has higher performance?
object
nopython
Solution
nopython
Exercise 3
How do I compile a function in Numba using nopython
mode?
Solution
Wrap the function in the @njit
(or @jit(nopython=True)
) decorator.
Exercise 4
What is the keyword argument that enables Numba compiled functions to run over multiple CPUs?
Solution
parallel=True
Exercise 5
Create your own Numba vectorised function that calculates the cube root of an array over all elements.
Hint
Have a look at the similar exercise from the Vectorisation lesson.
Solution
Using the Numba @vectorize
decorator and the math
library:
import math
from numba import vectorize # I'm the only change from the NumPy ufunc exercise
@vectorize
def my_cube_root(array):
return math.pow(array, 1/3)
%timeit my_cube_root(big_array)
27.7 ms ± 643 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba is a nice way to get performant numerical code.
Further information¶
More information and considerations¶
Factor out the performancecritical part of the code for compilation in Numba.
Consider what data precision is required i.e., is 64bit needed?
Numba can also target CUDA GPUs, which we’ll cover in the final lesson.
Other options¶

Also uses the JIT compiler (though is written in Python).
PyPy enables optimisations at run time, especially for numerical tasks with repitition and loops.
Completely replaces CPython.
Cauation, it may not be compatible with the libraries you use.
Generally fast, though there are overheads for startup and memory.
PyPy is helpful when want to speed up numerical opterations in all of the code.
Resources¶
Why is Python slow?, Anthony Shaw, PyCon 2020. CPython Internals.