Emulation of complex air quality model (Gaussian process)


Image sources 1, 2, and 3.


Numerical atmospheric models are useful to simulate air quality, weather, and climate. However, these models are slow and have high computational costs. Some compromises to meet these constraints are to:

  • Reduce the model accuracy.

    • By reducing the model complexity e.g., use simpler representations of mechanisms, decrease the (space/time) resolutions, replace analytical solutions with parameterisations.

  • Reduce the model precision.

  • Reduce the number of experiments.

  • Use a bigger computer.

Though, these compromises are not ideal (or sometimes even possible).

Alternative approach

Another approach is to use emulators. Emulators are machine learning models that act as proxies of these numerical atmospheric models. They are trained on data from model simulations and learn statistical associations between inputs and outputs. They are much cheaper to run, enabling many more experiments to be undertaken. These emulators are often designed using Gaussian process regressors, due to their high prediction accuracy on test data. Rasmussen & Williams (2006) do a great job of explaining them. They’ve been used in the atmospheric sciences to explore uncertainties, sensitivities, and for prediction problems.

The general approach is split into simulation and emulation. First, the inputs and outputs of the prediction problem are determined. Then simulations are designed, executed, and evaluated. Then emulators are designed, optimised, evaluated, and used for prediction.


We developed emulators to predict air quality and public health in China. The summary of our approach is:

  1. We simulated air quality using WRFChem.
    a. Our inputs were 5 anthropogenic emission sectors (residential, industrial, land transport, agriculture, and power generation).
    b. Our outputs were fine particulate matter (PM\(_{2.5}\)) and ozone (O\(_3\)) concentrations.
    c. We used a maxi−min Latin hypercube space–filling design to select the scalings to apply to the inputs (pyDOE).
    d. There were 50 years of training simulations and 5 separate years of test simulations (independent Latin hypercube designs).

    e. We performed a control run of the simulator and evaluated it against measurements to ensure it accurately predicted our outputs.

  2. We emulated air quality using Gaussian process regressors (scikit-learn).
    a. Our design included:

    • Preprocessing the inputs with a power transform (Yeo-Johnson) to make them more Gaussian-like (scikit-learn).

    • We used a Matern kernel.

    • We optimised the hyperparameters of the model using genetic programming (TPOT).

    • We developed 1 emulator per grid cell from the simulator (30,556 in total).

    b. The emulators were trained on the 50 years of simulator data.

    c. The emulators evaluated to a R\(^2\) of 0.999 for both outputs on the unseen test data.

  3. Prediction
    a. The emulators were used to predict a wide range of possible emission scenarios (32,768 in total).
    b. I’ll add the results here soon.

Further information

  • Short-term air quality prediction.

    • Code.

    • Paper.

      • Conibear, L. Reddington, C. L., Silver, B. J., Chen, Y., Knote, C., Arnold, S. R., Spracklen, D. V. (2021). Statistical emulation of winter ambient fine particulate matter concentrations from emission changes in China, GeoHealth.

  • Long-term air quality and public health prediction

    • Code.

    • Paper, in preparation.

      • Conibear, L. Reddington, C. L., Silver, B. J., Chen, Y., Knote, C., Arnold, S. R., Spracklen, D. V. (2021). The impacts of sectoral emission changes on air quality and public health in China from 2010−2020 using statistical emulation.

    • Interactive plot, in preparation.