Chapter 1.2 - Adapt and move the Jupyter Notebook to Python scripts¶
Introduction¶
Jupyter Notebooks provide an interactive environment where code can be executed and results can be visualized. They combine code, text explanations, visualizations, and media in a single document, making it a flexible tool to document a ML experiment.
However, they have severe limitations, such as challenges with reproducibility, scalability, experiment tracking, and standardization. Integrating Jupyter Notebooks into Python scripts suitable for running ML experiments in a more modular and reproducible manner can help address these issues and enhance the overall ML development process.
pip is the standard package manager for Python. It is used to install and manage dependencies in a Python environment.
In this chapter, you will learn how to:
- Set up a Python environment using pip
- Adapt the content of the Jupyter Notebook into Python scripts
- Launch the experiment locally
The following diagram illustrates the control flow of the experiment at the end of this chapter:
flowchart LR
subgraph workspaceGraph[WORKSPACE]
prepare[prepare.py] --> train
train[train.py] --> evaluate[evaluate.py]
params[params.yaml] -.- prepare
params -.- train
data[data/raw] --> prepare
end
style data opacity:0.4,color:#7f7f7f80
Let's get started!
Steps¶
Set up a new project directory¶
For the rest of the guide, you will work in a new directory. This will allow you to use the Jupyter Notebook directory as a reference.
Start by ensuring you have left the virtual environment created in the previous chapter:
Next, exit from the current directory and create a new one:
Execute the following command(s) in a terminal | |
---|---|
Set up the dataset¶
You will use the same dataset as in the previous chapter. Copy the data
folder from the previous chapter to your new directory:
Execute the following command(s) in a terminal | |
---|---|
Set up a Python environment¶
Firstly, create the virtual environment:
Not familiar with virtual environments? Read this!
What are virtual environments?
Python virtual environments are essential tools for managing dependencies and isolating project environments. They allow developers to create separate, self-contained environments for different projects, ensuring that each project has its own set of dependencies without interfering with one another.
This is particularly important when working on multiple projects with different versions of libraries or packages.
How do virtual environments work?
Virtual environments work by creating a local directory that contains a Python interpreter and a copy of the desired Python packages. When activated, the virtual environment modifies the system's PATH variable to prioritize the interpreter and packages within the local directory.
This ensures that when running Python commands, the system uses the specific interpreter and packages from the virtual environment, effectively isolating the project from the global Python installation.
How to manage virtual environments?
- Create a virtual environment:
python3.11 -m venv .venv
- Activate the virtual environment:
source .venv/bin/activate
- Deactivate the virtual environment:
deactivate
Conclusion
Virtual environments are essential for dependency management and environment isolation. They ensure stability, reproducibility, and clean project separation. By using virtual environments, you achieve smoother collaboration, easier debugging, and reliable deployment.
Execute the following command(s) in a terminal | |
---|---|
Create a requirements.txt
file to list the dependencies:
Install the dependencies:
Execute the following command(s) in a terminal | |
---|---|
Create a freeze file to list the dependencies with their versions to ensure that transitive dependencies are also listed. This will help with reproducibility:
Not familiar with freezing dependencies? Read this!
When working on Python projects, managing dependencies is crucial for maintaining a stable and reproducible development environment.
Understanding requirements.txt
The requirements.txt file is a commonly used approach to specify project dependencies. It lists all the high-level dependencies required for your project, including their specific versions. Each line in the file typically follows the format: package_name==version
.
Freezing dependencies
Freezing dependencies refers to fixing the versions of all transitive dependencies, ensuring that the same versions are installed consistently across different environments. This is crucial for reproducibility, as it guarantees that everyone working on the project has the exact same dependencies.
Separating high-level and transitive dependencies
To better control and manage your project's dependencies, it's beneficial to separate high-level dependencies from transitive dependencies. This approach allows for clearer identification of the core functionality packages and their required versions, ensuring a more focused and stable development environment.
-
requirements.txt
: This file contains the high-level dependencies explicitly required by your project. It should include packages necessary for your project's core functionality while excluding packages that are indirectly required by other dependencies. By isolating the high-level dependencies, you maintain a clear distinction between the essential packages and the ones brought in transitively. -
requirements-freeze.txt
: This file includes all the transitive dependencies required by the high-level dependencies. It ensures that all the packages needed for the project, including their versions, are recorded in a separate file. This separation allows for a more flexible and controlled approach when updating transitive dependencies while maintaining the reproducibility of your project.
How to update dependencies
When updating dependencies, it is essential to primarily modify the high-level requirements.txt
file with the desired versions or new packages. Then, generate an updated requirements-freeze.txt
file to capture the updated transitive dependencies accurately.
Conclusion
Prioritizing stability and reproducibility in your project's dependency management is crucial for minimizing compatibility issues, avoiding unexpected bugs, and ensuring a smooth and reliable development process.
By using separate requirements files for high-level and transitive dependencies, you gain better visibility and control over the dependencies required by your project. This approach promotes a stable and reproducible development environment while allowing you to update specific packages and their versions when needed. By following these practices, you can ensure the long-term success of your Python projects.
Execute the following command(s) in a terminal | |
---|---|
- The
--local
flag ensures that if a virtualenv has global access, it will not output globally-installed packages. - The
--all
flag ensures that it does not skip these packages in the output:setuptools
,wheel
,pip
,distribute
.
Split the Jupyter Notebook into scripts¶
You will split the Jupyter Notebook in a codebase made of separate Python scripts with well defined role. These scripts will be able to be called on the command line, making it ideal for automation tasks.
The following table describes the files that you will create in this codebase:
File | Description | Input | Output |
---|---|---|---|
params.yaml | The parameters to run the ML experiment | - | - |
src/prepare.py | Prepare the dataset to run the ML experiment | The dataset to prepare in data/raw directory | The prepared data in data/prepared directory |
src/train.py | Train the ML model | The prepared dataset | The model trained with the dataset |
src/evaluate.py | Evaluate the ML model using scikit-learn | The model to evaluate | The results of the model evaluation in evaluation directory |
src/utils/seed.py | Util function to fix the seed | - | - |
Move the parameters to its own file¶
Let's split the parameters to run the ML experiment with in a distinct file:
params.yaml | |
---|---|
Move the preparation step to its own file¶
The src/prepare.py
script will prepare the dataset. Let's take this opportunity to refactor the code to make it more modular and explicit using functions:
Move the train step to its own file¶
The src/train.py
script will train the ML model. Let's take this opportunity to refactor the code to make it more modular and explicit using functions:
Move the evaluate step to its own file¶
The src/evaluate.py
script will evaluate the ML model using DVC. Let's take this opportunity to refactor the code to make it more modular and explicit using functions:
src/evaluate.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
|
Create the seed helper function¶
Finally, add a module for utils:
Execute the following command(s) in a terminal | |
---|---|
In this module, include src/utils/seed.py
to handle the fixing of the seed parameters. This ensure the results are reproducible:
Create a README.md
file¶
Finally, create a README.md
file at the root of the project to describe the repository. Feel free to use the following template. As you progress though this guide, you can add your notes in the ## Notes
section:
README.md | |
---|---|
Check the results¶
Your working directory should now look like this:
- This, and all its sub-directory, is new.
- This is new.
- This is new.
- This is new.
- This is new.
Run the experiment¶
Awesome! You now have everything you need to run the experiment: the codebase and the dataset are in place, the new virtual environment is set up, and you are ready to run the experiment using scripts for the first time.
You can now follow these steps to reproduce the experiment:
Execute the following command(s) in a terminal | |
---|---|
The experiment will take some time to run. Once it is done, you will find the results in the data/prepared
, model
, and evaluation
directories.
Check the results¶
Your working directory should now be similar to this:
- This, and all its sub-directory, is new.
- This, and all its sub-directory, is new.
- This is new.
- This is new.
Here, the following should be noted:
- the
prepare.py
script created thedata/prepared
directory and divided the dataset into a training set and a test set - the
train.py
script created themodel
directory and trained the model with the prepared data. - the
evaluate.py
script created theevaluation
directory and generated some plots and metrics to evaluate the model
Take some time to get familiar with the scripts and the results.
Summary¶
Congratulations! You have successfully reproduced the experiment on your machine, this time using a modular approach that can be put into production.
In this chapter, you have:
- Set up a Python environment using
pip
andvirtualenv
- Adapted the content of the Jupyter Notebook into Python scripts
- Launched the experiment locally
However, you may have identified the following areas for improvement:
- Codebase is not versioned
- Dataset still needs manual download and placement
- Steps to run the experiment were not documented
- Codebase is not easily sharable
- Dataset is not easily sharable
In the next chapters, you will enhance the workflow to fix those issues.
You can now safely continue to the next chapter.
State of the MLOps process¶
- Notebook has been transformed into scripts for production
- Codebase and dataset are not versioned
- Model steps rely on verbal communication and may be undocumented
- Changes to model are not easily visualized
- Codebase requires manual download and setup
- Dataset requires manual download and placement
- Experiment may not be reproducible on other machines
- CI/CD pipeline does not report the results of the experiment
- Changes to model are not thoroughly reviewed and discussed before integration
- Model may have required artifacts that are forgotten or omitted in saved/loaded state
- Model cannot be easily used from outside of the experiment context
- Model requires manual publication to the artifact registry
- Model is not accessible on the Internet and cannot be used anywhere
- Model requires manual deployment on the cluster
- Model cannot be trained on hardware other than the local machine
- Model cannot be trained on custom hardware for specific use-cases
You will address these issues in the next chapters for improved efficiency and collaboration. Continue the guide to learn how.
Sources¶
Highly inspired by: