Chapter 1.1 - Run a simple ML experiment with Jupyter Notebook¶
Introduction¶
As a recent addition to the ML team, your objective is to contribute to the development of a model capable of visually identifying planets or moons within our solar system from images.
The data scientists of your team have been actively collaborating on a Jupyter Notebook, which they have readily shared with you. The dataset they have gathered comprises approximately 1,650 images capturing 11 distinct planets and moons. Each celestial body is represented by around 150 images, each taken from a unique angle.
The training process is as follows:
- Preprocess the dataset
- Split the celestial bodies into training/testing datasets
- Train a model to classify the celestial bodies using the training dataset
- Evaluate the model's performance using metrics, training history, predictions preview and a confusion matrix.
Your primary objective is to enhance the team's workflow by implementing MLOps tools, documenting the procedures, tracking changes, and ensuring the model is accessible to others.
In this chapter, you will learn how to:
- Set up the project directory
- Acquire the notebook
- Obtain the dataset
- Create a Python environment to run the experiment
- Launch the experiment locally for the first time
The following diagram illustrates the control flow of the experiment at the end of this chapter:
flowchart
subgraph workspaceGraph[WORKSPACE]
data[data/raw] <--> notebook[notebook.ipynb]
end
Let's get started!
Steps¶
Set up the project directory¶
As a new team member, set up a project directory on your computer for this ground breaking ML experiment. This directory will serve as your working directory for this first chapter:
Execute the following command(s) in a terminal | |
---|---|
Download the notebook¶
Your colleague provided you the following URL to download an archive containing the Jupyter Notebook for this machine learning experiment:
Execute the following command(s) in a terminal | |
---|---|
Unzip the Jupyter Notebook into your working directory:
Execute the following command(s) in a terminal | |
---|---|
Download and set up the dataset¶
Your colleague provided you the following URL to download an archive containing the dataset for this machine learning experiment:
Execute the following command(s) in a terminal | |
---|---|
This archive must be decompressed and its contents be moved in the data
directory in the working directory of the experiment:
Execute the following command(s) in a terminal | |
---|---|
Explore the notebook and dataset¶
Examine the notebook and the dataset to get a better understanding of their contents.
Your working directory should now look like this:
- This, and all its sub-directory, is new.
- The
raw
directory include the unprocessed dataset images.
Create the virtual environment¶
Create the virtual environment and install necessary dependencies in your working directory:
Execute the following command(s) in a terminal | |
---|---|
Run the experiment¶
Awesome! You now have everything you need to run the experiment: the notebook and the dataset are in place, the virtual environment is ready; and you're ready to run the experiment for the first time.
Launch the notebook:
A browser window should open with the Jupyter Notebook at http://localhost:8888/lab/tree/notebook.ipynb.
You may notice all the previous outputs from the notebook might still be present. This is because the notebook was not cleared before being shared with you. This can be useful to see the results of previous runs.
In most cases, however, it can also be a source of confusion. This is one of the limitations of the Jupyter Notebook, which make them not always easy to share with others.
For the time being, execute each step of the notebook to train the model and evaluate its performance. Previous outputs will be overwritten.
Ensure the experiment runs without errors. Once done, you can close the browser window. Shut down the Jupyter server by pressing Ctrl+C in the terminal, followed with Y and Enter.
Exit the virtual environment with the following command:
The Jupyter notebook serves as a valuable tool for consolidating an entire experiment into a single file, facilitating data visualization, and enabling the presentation of results. However, it does have severe limitations such as being challenging to share with others due to a lack of versioning capabilities, difficulty in reproducing the experiment, and the potential for data leaks and confusion from previous outputs.
In the next chapter you will see how to address these issues.
Summary¶
Congratulations! You have successfully reproduced the experiment on your machine.
In this chapter, you have:
- Created the working directory
- Acquired the codebase
- Obtained the dataset
- Set up a Python environment to run the experiment
- Executed the experiment locally for the first time
However, you may have identified the following areas for improvement:
- Notebook still needs manual download
- Dataset still needs manual download and placement
- Steps to run the experiment were not documented
In the next chapters, you will enhance the workflow to fix those issues.
State of the MLOps process¶
- Notebook can be run but is not adequate for production
- Codebase and dataset are not versioned
- Model steps rely on verbal communication and may be undocumented
- Changes to model are not easily visualized
- Codebase requires manual download and setup
- Dataset requires manual download and placement
- Experiment may not be reproducible on other machines
- CI/CD pipeline does not report the results of the experiment
- Changes to model are not thoroughly reviewed and discussed before integration
- Model may have required artifacts that are forgotten or omitted in saved/loaded state
- Model cannot be easily used from outside of the experiment context
- Model requires manual publication to the artifact registry
- Model is not accessible on the Internet and cannot be used anywhere
- Model requires manual deployment on the cluster
- Model cannot be trained on hardware other than the local machine
- Model cannot be trained on custom hardware for specific use-cases
You will address these issues in the next chapters for improved efficiency and collaboration. Continue the guide to learn how.
Sources¶
Highly inspired by:
- Planets and Moons Dataset - AI in Space - kaggle.com community prediction competition.