Summary

Contents

Summary

In this workshop, we covered:

1. Understand the fundamentals of machine learning and deep learning.

  • Machine learning and deep learning are a range of prediction methods that learn associations from training data.

  • The objective is for the models to generalise to new data.

  • They mainly use tensors (multi-dimensional arrays) as inputs.

  • Problems are mainly either supervised (if you provide labels) or unsupervised (if you don’t provide labels).

  • Problems are either classification (if you’re trying to predict a discrete category) or regression (if you’re trying to predict a continuous number).

  • Data is split into training, validation, and test sets.

  • The models only learn from the training data.

  • The test set is used only once.

  • Hyperparameters are set before model training.

  • Parameters (i.e., the weights and biases) are learnt during model training.

  • The aim is to minimise the loss function.

  • The model underfits when it has high bias.

  • The model overfits when it has high variance.

2. Know how to use key tools, including:

  • scikit-learn

    • scikit-learn is great for classic machine learning problems.

  • TensorFlow and Keras

    • TensorFlow is great for deep learning problems.

    • Keras (high-level API for TensorFlow) has many high-level objects to help you create deep learning models.

  • PyTorch and PyTorch Lightning

    • PyTorch is great for deep learning problems.

    • PyTorch Lightning (high-level API for PyTorch) has many high-level objects to help you create deep learning models.

  • You can use low-level APIs for any custom objects.

  • Explore your data before using it.

  • Check your model before fitting the training data to it.

  • Evaluate your model and analyse the errors it makes.

3. Be aware of good practices for data, such as pipelines and modules.

  • Always split the data into train and test subsets first, before any pre-processing.

  • Never fit to the test data.

  • Use a data pipeline.

  • Use a random seed and any available deterministic functionalities for reproducibility.

    • Try and reproduce your own work, to check that it is reproducible.

  • Consider optimising the data pipeline with:

    • Shuffling.

    • Batching.

    • Caching.

    • Prefetching.

    • Parallel data extraction.

    • Data augmentation.

    • Parallel data transformation.

    • Vectorised mapping.

    • Mixed precision.

4. Be aware of good practices for models, such as hyperparameter tuning, transfer learning, and callbacks.

  • Tune hyperparamaters for the best model fit.

  • Use transfer learning to save computation on similar problems.

  • Consider using callbacks to help with model training, such as:

    • Checkpoints.

    • Fault tolerance.

    • Logging.

    • Profiling.

    • Early stopping.

    • Learning rate decay.

5. Be able to undertake distributed training.

  • Ensure that you really need to use distributed devices.

  • Check everything first works on a single device.

  • Ensure that the data pipeline can efficiently use multiple devices.

  • Use data parallelism (to split the data over multiple devices).

  • Take care when setting the global batch size.

  • Check the efficiency of your jobs to ensure utilising the requested resources (for both single and multi-device).

  • When moving from Jupyter to HPC:

    • Clean non-essential code.

    • Refactor Jupyter Notebook code into functions.

    • Create a Python script.

    • Create submission script.

    • Create tests.

Next steps

For things that you’re interested in:

  • Try things out yourself (e.g., play around with the examples).

  • Check out the Online Courses.