Skip to article frontmatterSkip to article content

Machine Learning workflows in a Pixi workspace

Authors
Affiliations
University of Wisconsin-Madison
prefix.dev GmbH
NVIDIA

Now that we know how to build a CUDA Pixi environment in our example workspace and NVIDIA GPUs on Brev, let’s run an example machine learning workflow using our Pixi workspace.

Training the machine learning model

Let’s write a very standard tutorial example of training a deep neral network on the MNIST dataset with PyTorch and then run it on the GPUs on Brev.

The neural network code

We’ll download Python code that uses a convocational neural network written in PyTorch to learn to identify the handwritten number of the MNIST dataset and place it under a src/ directory. This is a modified example from the PyTorch documentation (mnist/main.py) which is licensed under the BSD 3-Clause license.

curl -sLO https://raw.githubusercontent.com/matthewfeickert/nvidia-gpu-ml-library-test/36c725360b1b1db648d6955c27bd3885b29a3273/torch_MNIST.py
mkdir -p src
mv torch_MNIST.py src/

The Pixi environment

Now let’s think about what we need to use this code. Looking at the imports of src/torch_MNIST.py we can see that torch and torchvision are the only imported libraries that aren’t part of the Python standard library, so we will need to depend on PyTorch and torchvision. We also know that we’d like to use CUDA accelerated code, so that we’ll need CUDA libraries and versions of PyTorch that support CUDA.

Now that we have the environments solved, let’s do a comparison of training in the CPU environment and the GPU environment.

To validate that things are working with the CPU code, let’s do a short training run for only 2 epochs in the cpu environment.

pixi run --environment cpu python src/torch_MNIST.py --epochs 2 --save-model --data-dir data
100.0%
100.0%
100.0%
100.0%
Train Epoch: 1 [0/60000 (0%)]	Loss: 2.329474
Train Epoch: 1 [640/60000 (1%)]	Loss: 1.425185
Train Epoch: 1 [1280/60000 (2%)]	Loss: 0.826808
Train Epoch: 1 [1920/60000 (3%)]	Loss: 0.556883
Train Epoch: 1 [2560/60000 (4%)]	Loss: 0.483756
...
Train Epoch: 2 [57600/60000 (96%)]	Loss: 0.146226
Train Epoch: 2 [58240/60000 (97%)]	Loss: 0.016065
Train Epoch: 2 [58880/60000 (98%)]	Loss: 0.003342
Train Epoch: 2 [59520/60000 (99%)]	Loss: 0.001542

Test set: Average loss: 0.0351, Accuracy: 9874/10000 (99%)

That took some time and we only got 2 epochs into training, but it ran!

Let’s speed things up using the gpu environment.

pixi run --environment gpu python src/torch_MNIST.py --epochs 14 --save-model --data-dir data
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.281690
Train Epoch: 1 [640/60000 (1%)]	Loss: 1.459567
Train Epoch: 1 [1280/60000 (2%)]	Loss: 0.927929
Train Epoch: 1 [1920/60000 (3%)]	Loss: 0.632228
Train Epoch: 1 [2560/60000 (4%)]	Loss: 0.384857
...
Train Epoch: 14 [56960/60000 (95%)]	Loss: 0.009351
Train Epoch: 14 [57600/60000 (96%)]	Loss: 0.001419
Train Epoch: 14 [58240/60000 (97%)]	Loss: 0.024142
Train Epoch: 14 [58880/60000 (98%)]	Loss: 0.004241
Train Epoch: 14 [59520/60000 (99%)]	Loss: 0.003314

Test set: Average loss: 0.0268, Accuracy: 9919/10000 (99%)

Performing inference with the trained model

Now that we’ve trained our model we’d like to be able to use it to perform machine learning inference (model prediction). However, we might want to perform inference in a different software environment than the environment we used for model training.

We’ll download Python code under the src/ directory that uses the same PyTorch convocational neural network architecture in torch_MNIST.py to load the model and an image and predict what number the image contains. This code is licensed under the MIT license.

curl -sLO https://raw.githubusercontent.com/matthewfeickert/nvidia-gpu-ml-library-test/36c725360b1b1db648d6955c27bd3885b29a3273/torch_MNIST_inference.py
mkdir -p src
mv torch_MNIST_inference.py src/

Now that we’ve added an inference environment to the workspace, use it to predict the value of the default image.

pixi run --environment inference python src/torch_MNIST_inference.py --model-path ./mnist_cnn.pt --image-path ./test_image.png
Label: 4, Prediction: 4

As we didn’t have an image yet to run on, we loaded one from the MNIST training set, and as we know the labels there we include the label output. If we load the image from disk without knowing this, then we get just the prediction values.

pixi run --environment inference python src/torch_MNIST_inference.py --model-path ./mnist_cnn.pt --image-path ./test_image.png
Prediction: 4