Chapter 2.3 - Reproduce the ML experiment in a CI/CD pipeline¶
Introduction¶
At this point, your code, your data and your execution process should be shared with Git and DVC.
Now, it's time to enhance your workflow further by incorporating a CI/CD (Continuous Integration/Continuous Deployment) pipeline. This addition will enable you to execute your ML experiments remotely and reproduce it, ensuring that any changes made to the project won't inadvertently break. This helps eliminate the notorious "but it works on my machine" effect, where code behave differently across different environments.
In this chapter, you will learn how to:
- Grant access to the S3 bucket on the cloud provider
- Store the cloud provider credentials in the CI/CD configuration
- Create the CI/CD pipeline configuration file
- Push the CI/CD pipeline configuration file to Git
- Visualize the execution of the CI/CD pipeline
The following diagram illustrates the control flow of the experiment at the end of this chapter:
flowchart TB
dot_dvc[(.dvc)] <-->|dvc push
dvc pull| s3_storage[(S3 Storage)]
dot_git[(.git)] <-->|git push
git pull| gitGraph[Git Remote]
workspaceGraph <-....-> dot_git
data[data/raw] <-.-> dot_dvc
subgraph remoteGraph[REMOTE]
s3_storage
subgraph gitGraph[Git Remote]
direction TB
repository[(Repository)] --> action[Action]
action -->|dvc pull| action_data[data/raw]
action_data -->|dvc repro| action_out[metrics & plots]
end
end
subgraph cacheGraph[CACHE]
dot_dvc
dot_git
end
subgraph workspaceGraph[WORKSPACE]
prepare[prepare.py] <-.-> dot_dvc
train[train.py] <-.-> dot_dvc
evaluate[evaluate.py] <-.-> dot_dvc
data --> prepare
subgraph dvcGraph["dvc.yaml (dvc repro)"]
prepare --> train
train --> evaluate
end
params[params.yaml] -.- prepare
params -.- train
params <-.-> dot_dvc
end
style cacheGraph opacity:0.4,color:#7f7f7f80
style workspaceGraph opacity:0.4,color:#7f7f7f80
style dot_git opacity:0.4,color:#7f7f7f80
style dot_dvc opacity:0.4,color:#7f7f7f80
style data opacity:0.4,color:#7f7f7f80
style prepare opacity:0.4,color:#7f7f7f80
style params opacity:0.4,color:#7f7f7f80
style train opacity:0.4,color:#7f7f7f80
style evaluate opacity:0.4,color:#7f7f7f80
style dvcGraph opacity:0.4,color:#7f7f7f80
style s3_storage opacity:0.4,color:#7f7f7f80
style repository opacity:0.4,color:#7f7f7f80
linkStyle 0 opacity:0.4,color:#7f7f7f80
linkStyle 1 opacity:0.4,color:#7f7f7f80
linkStyle 2 opacity:0.4,color:#7f7f7f80
linkStyle 3 opacity:0.4,color:#7f7f7f80
linkStyle 4 opacity:0.4,color:#7f7f7f80
linkStyle 7 opacity:0.4,color:#7f7f7f80
linkStyle 8 opacity:0.4,color:#7f7f7f80
linkStyle 9 opacity:0.4,color:#7f7f7f80
linkStyle 10 opacity:0.4,color:#7f7f7f80
linkStyle 11 opacity:0.4,color:#7f7f7f80
linkStyle 12 opacity:0.4,color:#7f7f7f80
linkStyle 13 opacity:0.4,color:#7f7f7f80
linkStyle 14 opacity:0.4,color:#7f7f7f80
linkStyle 15 opacity:0.4,color:#7f7f7f80
Steps¶
Set up access to the S3 bucket of the cloud provider¶
DVC will need to log in to the S3 bucket of the cloud provider to download the data inside the CI/CD pipeline:
Google Cloud allows the creation of a "Service Account", so you don't have to store/share your own credentials. A Service Account can be deleted, hence revoking all the access it had.
Create the Google Service Account and its associated Google Service Account Key to access Google Cloud without your own credentials.
The key will be stored in your ~/.config/gcloud
directory under the name google-service-account-key.json
:
Danger
You must never add and commit this file to your working directory. It is a sensitive data that you must keep safe.
Info
The path ~/.config/gcloud
should be created when installing gcloud
. If it does not exist, you can create it by running mkdir -p ~/.config/gcloud
This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.
If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!
Store the cloud provider credentials in the CI/CD configuration¶
Now that the credentials are created, you need to store them in the CI/CD configuration. Depending on the CI/CD platform you are using, the process will be different:
Display the Google Service Account key
The service account key is stored on your computer as a JSON file. You need to display it and store it as a CI/CD variable in a text format.
Display the Google Service Account key that you have downloaded from Google Cloud:
Encode and display the Google Service Account key that you have downloaded from Google Cloud as base64
. It allows to hide the secret in GitLab CI logs as a security measure.
Store the Google Service Account key as a CI/CD variable
Store the output as a CI/CD variable by going to the Settings section from the top header of your GitHub repository.
Select Secrets and variables > Actions and select New repository secret.
Create a new variable named GOOGLE_SERVICE_ACCOUNT_KEY
with the output value of the Google Service Account key file as its value. Save the variable by selecting Add secret.
Store the output as a CI/CD Variable by going to Settings > CI/CD from the left sidebar of your GitLab project.
Select Variables and select Add variable.
Create a new variable named GOOGLE_SERVICE_ACCOUNT_KEY
with the Google Service Account key file encoded in base64
as its value.
- Protect variable: Unchecked
- Mask variable: Checked
- Expand variable reference: Unchecked
Save the variable by clicking Add variable.
This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.
If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!
Create the CI/CD pipeline configuration file¶
At the root level of your Git repository, create a GitHub Workflow configuration file .github/workflows/mlops.yaml
. Take some time to understand the train job and its steps:
At the root level of your Git repository, create a GitLab CI configuration file .gitlab-ci.yml
.
Explore this file to understand the train stage and its steps.
Tip
Instead of running dvc pull
and dvc repro
separately, you can run them together with dvc repro --pull
.
Push the CI/CD pipeline configuration file to Git¶
Push the CI/CD pipeline configuration file to Git:
Check the results¶
You can see the pipeline running on the Actions page.
You can see the pipeline running on the CI/CD > Pipelines page.
You should see a newly created pipeline. The pipeline should log into Google Cloud, pull the data from DVC and reproduce the experiment. If you encounter cache errors, verify that you have pushed all data to DVC with dvc push
.
You may have noticed that DVC was able to skip all stages as its cache is up to date. It helps you to ensure the experiment can be run (all data and metadata are up to date) and that the experiment can be reproduced (the results are the same).
This chapter is done, you can check the summary.
Summary¶
Congratulations! You now have a CI/CD pipeline that will run the experiment on each commit.
In this chapter, you have successfully:
- Granted access to the S3 bucket on the cloud provider
- Stored the cloud provider credentials in the CI/CD configuration
- Created the CI/CD pipeline configuration file
- Pushed the CI/CD pipeline configuration file to Git
- Visualized the execution of the CI/CD pipeline
You fixed some of the previous issues:
- The experiment can be executed on a clean machine with the help of a CI/CD pipeline
You have a CI/CD pipeline to ensure the whole experiment can still be reproduced using the data and the commands to run using DVC over time.
You can now safely continue to the next chapter.
State of the MLOps process¶
- Notebook has been transformed into scripts for production
- Codebase and dataset are versioned
- Steps used to create the model are documented and can be re-executed
- Changes done to a model can be visualized with parameters, metrics and plots to identify differences between iterations
- Codebase can be shared and improved by multiple developers
- Dataset can be shared among the developers and is placed in the right directory in order to run the experiment
- Experiment can be executed on a clean machine with the help of a CI/CD pipeline
- CI/CD pipeline does not report the results of the experiment
- Changes to model are not thoroughly reviewed and discussed before integration
- Model may have required artifacts that are forgotten or omitted in saved/loaded state
- Model cannot be easily used from outside of the experiment context
- Model requires manual publication to the artifact registry
- Model is not accessible on the Internet and cannot be used anywhere
- Model requires manual deployment on the cluster
- Model cannot be trained on hardware other than the local machine
- Model cannot be trained on custom hardware for specific use-cases
You will address these issues in the next chapters for improved efficiency and collaboration. Continue the guide to learn how.
Sources¶
Highly inspired by:
- Creating and managing service accounts - cloud.google.com
- Create and manage service account keys - cloud.google.com
- IAM basic and predefined roles reference - cloud.google.com
- Using service accounts - dvc.org
- Creating encrypted secrets for a repository - docs.github.com
- Add a CI/CD variable to a project - docs.gitlab.com
- Triggering a workflow - docs.github.com