Chapter 1.1 - Run a simple ML experiment with Jupyter Notebook¶

Introduction¶

As a recent addition to the ML team, your objective is to contribute to the development of a model capable of visually identifying planets or moons within our solar system from images.

The data scientists of your team have been actively collaborating on a Jupyter Notebook, which they have readily shared with you. The dataset they have gathered comprises approximately 1,650 images capturing 11 distinct planets and moons. Each celestial body is represented by around 150 images, each taken from a unique angle.

The training process is as follows:

Preprocess the dataset
Split the celestial bodies into training/testing datasets
Train a model to classify the celestial bodies using the training dataset
Evaluate the model's performance using metrics, training history, predictions preview and a confusion matrix.

Your primary objective is to enhance the team's workflow by implementing MLOps tools, documenting the procedures, tracking changes, and ensuring the model is accessible to others.

In this chapter, you will learn how to:

Set up the project directory
Acquire the notebook
Obtain the dataset
Create a Python environment to run the experiment
Launch the experiment locally for the first time

The following diagram illustrates the control flow of the experiment at the end of this chapter:

flowchart
    subgraph workspaceGraph[WORKSPACE]
        data[data/raw] <--> notebook[notebook.ipynb]
    end

Let's get started!

Steps¶

Set up the project directory¶

As a new team member, set up a project directory on your computer for this ground breaking ML experiment. This directory will serve as your working directory for this first chapter:

Execute the following command(s) in a terminal
# Create the working directory
mkdir a-guide-to-mlops-jupyter-notebook

# Switch to the working directory
cd a-guide-to-mlops-jupyter-notebook

Download the notebook¶

Your colleague provided you the following URL to download an archive containing the Jupyter Notebook for this machine learning experiment:

Execute the following command(s) in a terminal
# Download the archive containing the Jupyter Notebook
wget https://github.com/swiss-ai-center/a-guide-to-mlops/archive/refs/heads/jupyter-notebook.zip -O jupyter-notebook.zip

Unzip the Jupyter Notebook into your working directory:

Execute the following command(s) in a terminal
# Extract the Jupyter Notebook
unzip jupyter-notebook.zip

# Move the subdirectory files to the working directory
mv a-guide-to-mlops-jupyter-notebook/* .

# Remove the archive and the directory
rm -r jupyter-notebook.zip a-guide-to-mlops-jupyter-notebook

Download and set up the dataset¶

Your colleague provided you the following URL to download an archive containing the dataset for this machine learning experiment:

Execute the following command(s) in a terminal
# Download the archive containing the dataset
wget https://github.com/swiss-ai-center/a-guide-to-mlops/archive/refs/heads/data.zip -O data.zip

This archive must be decompressed and its contents be moved in the data directory in the working directory of the experiment:

Execute the following command(s) in a terminal
# Extract the dataset
unzip data.zip

# Move the `data.xml` file to the working directory
mv a-guide-to-mlops-data/ data/

# Remove the archive and the directory
rm data.zip

Explore the notebook and dataset¶

Examine the notebook and the dataset to get a better understanding of their contents.

Your working directory should now look like this:

.
├── data # (1)!
│   ├── raw # (2)!
│   │   └── ...
│   └── README.md
├── README.md
├── notebook.ipynb
└── requirements.txt

This, and all its sub-directory, is new.
The raw directory include the unprocessed dataset images.

Create the virtual environment¶

Create the virtual environment and install necessary dependencies in your working directory:

Execute the following command(s) in a terminal
# Create the virtual environment
python3.12 -m venv .venv

# Activate the virtual environment
source .venv/bin/activate

# Install the dependencies
pip install --requirement requirements.txt

Run the experiment¶

Awesome! You now have everything you need to run the experiment: the notebook and the dataset are in place, the virtual environment is ready; and you're ready to run the experiment for the first time.

Launch the notebook:

Execute the following command(s) in a terminal
1 2	`# Launch the experiment jupyter-lab notebook.ipynb`

A browser window should open with the Jupyter Notebook at http://localhost:8888/lab/tree/notebook.ipynb.

You may notice all the previous outputs from the notebook might still be present. This is because the notebook was not cleared before being shared with you. This can be useful to see the results of previous runs.

In most cases, however, it can also be a source of confusion. This is one of the limitations of the Jupyter Notebook, which make them not always easy to share with others.

For the time being, execute each step of the notebook to train the model and evaluate its performance. Previous outputs will be overwritten.

Ensure the experiment runs without errors. Once done, you can close the browser window. Shut down the Jupyter server by pressing Ctrl+C in the terminal, followed with Y and Enter.

Exit the virtual environment with the following command:

Execute the following command(s) in a terminal
1 2	`# Exit the virtual environment deactivate`

The Jupyter notebook serves as a valuable tool for consolidating an entire experiment into a single file, facilitating data visualization, and enabling the presentation of results. However, it does have severe limitations such as being challenging to share with others due to a lack of versioning capabilities, difficulty in reproducing the experiment, and the potential for data leaks and confusion from previous outputs.

In the next chapter you will see how to address these issues.

Summary¶

Congratulations! You have successfully reproduced the experiment on your machine.

In this chapter, you have:

Created the working directory
Acquired the codebase
Obtained the dataset
Set up a Python environment to run the experiment
Executed the experiment locally for the first time

However, you may have identified the following areas for improvement:

Notebook still needs manual download
Dataset still needs manual download and placement
Steps to run the experiment were not documented

In the next chapters, you will enhance the workflow to fix those issues.

State of the MLOps process¶

You will address these issues in the next chapters for improved efficiency and collaboration. Continue the guide to learn how.

Sources¶

Highly inspired by:

Planets and Moons Dataset - AI in Space - kaggle.com community prediction competition.