Chapter 2.2 - Move the ML experiment data to the cloud¶
Introduction¶
At this point, the codebase is made available to team members using Git, but the experiment data itself is not.
Similarly to other version control system, DVC allows for storing the dataset in a remote storage, typically a cloud storage provider, ensuring effective tracking of modifications and smooth maintenance workflow.
This guide will demonstrate the use of a remote Storage Bucket for storing the dataset.
In this chapter, you will learn how to:
- Create a project on the cloud provider
- Create a storage bucket on the cloud provider
- Configure DVC for remote storage
- Push the data files to DVC
- Commit the metadata files to Git
The following diagram illustrates the control flow of the experiment at the end of this chapter:
flowchart TB
dot_dvc[(.dvc)] <-->|dvc push
dvc pull| s3_storage[(S3 Storage)]
dot_git[(.git)] <-->|git push
git pull| gitGraph[Git Remote]
workspaceGraph <-....-> dot_git
data[data/raw] <-.-> dot_dvc
subgraph remoteGraph[REMOTE]
s3_storage
subgraph gitGraph[Git Remote]
repository[(Repository)]
end
end
subgraph cacheGraph[CACHE]
dot_dvc
dot_git
end
subgraph workspaceGraph[WORKSPACE]
prepare[prepare.py] <-.-> dot_dvc
train[train.py] <-.-> dot_dvc
evaluate[evaluate.py] <-.-> dot_dvc
data --> prepare
subgraph dvcGraph["dvc.yaml (dvc repro)"]
prepare --> train
train --> evaluate
end
params[params.yaml] -.- prepare
params -.- train
params <-.-> dot_dvc
end
style gitGraph opacity:0.4,color:#7f7f7f80
style repository opacity:0.4,color:#7f7f7f80
style workspaceGraph opacity:0.4,color:#7f7f7f80
style dvcGraph opacity:0.4,color:#7f7f7f80
style cacheGraph opacity:0.4,color:#7f7f7f80
style dot_git opacity:0.4,color:#7f7f7f80
style data opacity:0.4,color:#7f7f7f80
style prepare opacity:0.4,color:#7f7f7f80
style train opacity:0.4,color:#7f7f7f80
style evaluate opacity:0.4,color:#7f7f7f80
style params opacity:0.4,color:#7f7f7f80
linkStyle 1 opacity:0.4,color:#7f7f7f80
linkStyle 2 opacity:0.4,color:#7f7f7f80
linkStyle 3 opacity:0.4,color:#7f7f7f80
linkStyle 4 opacity:0.4,color:#7f7f7f80
linkStyle 5 opacity:0.4,color:#7f7f7f80
linkStyle 6 opacity:0.4,color:#7f7f7f80
linkStyle 7 opacity:0.4,color:#7f7f7f80
linkStyle 8 opacity:0.4,color:#7f7f7f80
linkStyle 9 opacity:0.4,color:#7f7f7f80
linkStyle 10 opacity:0.4,color:#7f7f7f80
linkStyle 11 opacity:0.4,color:#7f7f7f80
linkStyle 12 opacity:0.4,color:#7f7f7f80
Let's get started!
Steps¶
Install and configure the cloud provider CLI¶
Install and configure the cloud provider CLI tool to manage the cloud resources:
To install gcloud
, follow the official documentation: Install the Google Cloud CLI - cloud.google.com
Initialize and configure the Google Cloud CLI
The following process will authenticate to Google Cloud using the Google Cloud CLI with the following command. It should open a browser window to authenticate to Google Cloud. You might need to follow the instructions in the terminal to authenticate:
Warning
If gcloud asks you to pick a project or create a project, exit the process by pressing Ctrl+C in the terminal and follow the next steps to create a project.
This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.
If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!
Create a project on a cloud provider¶
This step will create a project on a cloud provider to host the data.
Warning
Do not create a new project through the web interface. The following commands will create a new project and link it to a billing account for you, without navigating through the web interface.
Export a Google Cloud Project ID with the following command. Replace <my_project_id>
with a project ID of your choice. It has to be lowercase and words separated by hyphens.
Warning
The project ID must be unique across all Google Cloud projects and users. For example, use mlops-<surname>-project
, where surname
is based on your name. Change the project ID if the command fails.
Execute the following command(s) in a terminal | |
---|---|
Create a Google Cloud Project with the following commands:
Execute the following command(s) in a terminal | |
---|---|
Then run the following command to authenticate to Google Cloud with the Application Default. It will create a credentials file in ~/.config/gcloud/application_default_credentials.json
. This file must not be shared and will be used by DVC to authenticate to Google Cloud Storage.
This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.
If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!
Link a billing account to the project¶
Link a billing account to the project to be able to create to create cloud resources:
List the billing accounts with the following command:
Execute the following command(s) in a terminal | |
---|---|
If no billing account is available, you can add a new one from the Google Cloud Console and then link it to the project.
Export the billing account ID with the following command. Replace <my_billing_account_id>
with your own billing account ID:
Execute the following command(s) in a terminal | |
---|---|
Link a billing account to the project with the following command:
This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.
If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!
Create the Storage Bucket on the cloud provider¶
Create the Storage Bucket to store the data with the cloud provider CLI:
Info
On most cloud providers, the project must be linked to an active billing account to be able to create the bucket. You must set up a valid billing account for the cloud provider.
Create the Google Storage Bucket to store the data with the Google Cloud CLI.
Export the bucket name as an environment variable. Replace <my_bucket_name>
with a bucket name of your choice. It has to be lowercase and words separated by hyphens.
Warning
The bucket name must be unique across all Google Cloud projects and users. For example, use mlops-<surname>-bucket
, where surname
is based on your name. Change the bucket name if the command fails.
Execute the following command(s) in a terminal | |
---|---|
Export the bucket location as an environment variable. You can view the available locations at Cloud locations. You should ideally select a location close to where most of the expected traffic will come from. Replace <my_bucket_location>
with your own zone. For example, use europe-west6
for Switzerland (Zurich):
Execute the following command(s) in a terminal | |
---|---|
Create the bucket:
Execute the following command(s) in a terminal | |
---|---|
You now have everything you need for DVC.
This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.
If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!
Install the DVC Storage plugin¶
Install the DVC Storage plugin for the cloud provider:
Here, the dvc[gs]
package enables support for Google Cloud Storage. Update the requirements.txt
file:
Check the differences with Git to validate the changes:
Execute the following command(s) in a terminal | |
---|---|
The output should be similar to this:
Install the dependencies and update the freeze file:
Warning
Prior to running any pip commands, it is crucial to ensure the virtual environment is activated to avoid potential conflicts with system-wide Python packages.
To check its status, simply run pip -V
. If the virtual environment is active, the output will show the path to the virtual environment's Python executable. If it is not, you can activate it with source .venv/bin/activate
.
This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.
If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!
Configure DVC to use the Storage Bucket¶
Configure DVC to use the Storage Bucket on the cloud provider:
Configure DVC to use a Google Storage remote bucket. The dvcstore
is a user-defined path on the bucket. You can change it if needed:
This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.
If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!
Check the changes¶
Check the changes with Git to ensure all wanted files are here:
Execute the following command(s) in a terminal | |
---|---|
The output of the git status
command should be similar to this:
Push the data files to DVC¶
DVC works as Git. Once you want to share the data, you can use dvc push
to upload the data and its cache to the storage provider:
Execute the following command(s) in a terminal | |
---|---|
Commit the changes to Git¶
You can now push the changes to Git so all team members can get the data from DVC as well.
Execute the following command(s) in a terminal | |
---|---|
Check the results¶
Open the Bucket Storage on the cloud provider and check that the files were hashed and have been uploaded.
Open the Cloud Storage on the Google cloud interface and click on your bucket to access the details.
This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.
If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!
Summary¶
Congratulations! You now have a dataset that can be used and shared among the team.
In this chapter, you have successfully:
- Created a new project on a cloud provider
- Installed and configured the cloud provider CLI
- Created the Storage Bucket on the cloud provider
- Installed the DVC Storage plugin
- Configured DVC to use the Storage Bucket
- Updated the gitignore file and adding the experiment data to DVC
- Pushed the data files to DVC
- Commit the changes to Git
You fixed some of the previous issues:
- Data no longer needs manual download and is placed in the right directory.
When used by another member of the team, they can easily get a copy of the experiment data from DVC with the following command:
With the help of DVC, they can also easily reproduce your experiment and, thanks to caching, only the required steps will be executed:
You can now safely continue to the next chapter.
State of the MLOps process¶
- Notebook has been transformed into scripts for production
- Codebase and dataset are versioned
- Steps used to create the model are documented and can be re-executed
- Changes done to a model can be visualized with parameters, metrics and plots to identify differences between iterations
- Codebase can be shared and improved by multiple developers
- Dataset can be shared among the developers and is placed in the right directory in order to run the experiment
- Experiment may not be reproducible on other machines
- CI/CD pipeline does not report the results of the experiment
- Changes to model are not thoroughly reviewed and discussed before integration
- Model may have required artifacts that are forgotten or omitted in saved/loaded state
- Model cannot be easily used from outside of the experiment context
- Model requires manual publication to the artifact registry
- Model is not accessible on the Internet and cannot be used anywhere
- Model requires manual deployment on the cluster
- Model cannot be trained on hardware other than the local machine
- Model cannot be trained on custom hardware for specific use-cases
You will address these issues in the next chapters for improved efficiency and collaboration. Continue the guide to learn how.
Sources¶
Highly inspired by: