Skip to content

Chapter 2.2 - Move the ML experiment data to the cloud

Introduction

At this point, the codebase is made available to team members using Git, but the experiment data itself is not.

Similarly to other version control system, DVC allows for storing the dataset in a remote storage, typically a cloud storage provider, ensuring effective tracking of modifications and smooth maintenance workflow.

This guide will demonstrate the use of a remote Storage Bucket for storing the dataset.

In this chapter, you will learn how to:

  1. Create a project on the cloud provider
  2. Create a storage bucket on the cloud provider
  3. Configure DVC for remote storage
  4. Push the data files to DVC
  5. Commit the metadata files to Git

The following diagram illustrates the control flow of the experiment at the end of this chapter:

flowchart TB
    dot_dvc[(.dvc)] <-->|dvc push
                         dvc pull| s3_storage[(S3 Storage)]
    dot_git[(.git)] <-->|git push
                         git pull| gitGraph[Git Remote]
    workspaceGraph <-....-> dot_git
    data[data/raw] <-.-> dot_dvc
    subgraph remoteGraph[REMOTE]
    s3_storage
        subgraph gitGraph[Git Remote]
            repository[(Repository)]
        end
    end
    subgraph cacheGraph[CACHE]
        dot_dvc
        dot_git
    end
    subgraph workspaceGraph[WORKSPACE]
    prepare[prepare.py] <-.-> dot_dvc
    train[train.py] <-.-> dot_dvc
    evaluate[evaluate.py] <-.-> dot_dvc
    data --> prepare
    subgraph dvcGraph["dvc.yaml (dvc repro)"]
    prepare --> train
    train --> evaluate
    end
    params[params.yaml] -.- prepare
    params -.- train
    params <-.-> dot_dvc
    end
    style gitGraph opacity:0.4,color:#7f7f7f80
    style repository opacity:0.4,color:#7f7f7f80
    style workspaceGraph opacity:0.4,color:#7f7f7f80
    style dvcGraph opacity:0.4,color:#7f7f7f80
    style cacheGraph opacity:0.4,color:#7f7f7f80
    style dot_git opacity:0.4,color:#7f7f7f80
    style data opacity:0.4,color:#7f7f7f80
    style prepare opacity:0.4,color:#7f7f7f80
    style train opacity:0.4,color:#7f7f7f80
    style evaluate opacity:0.4,color:#7f7f7f80
    style params opacity:0.4,color:#7f7f7f80
    linkStyle 1 opacity:0.4,color:#7f7f7f80
    linkStyle 2 opacity:0.4,color:#7f7f7f80
    linkStyle 3 opacity:0.4,color:#7f7f7f80
    linkStyle 4 opacity:0.4,color:#7f7f7f80
    linkStyle 5 opacity:0.4,color:#7f7f7f80
    linkStyle 6 opacity:0.4,color:#7f7f7f80
    linkStyle 7 opacity:0.4,color:#7f7f7f80
    linkStyle 8 opacity:0.4,color:#7f7f7f80
    linkStyle 9 opacity:0.4,color:#7f7f7f80
    linkStyle 10 opacity:0.4,color:#7f7f7f80
    linkStyle 11 opacity:0.4,color:#7f7f7f80
    linkStyle 12 opacity:0.4,color:#7f7f7f80

Let's get started!

Steps

Install and configure the cloud provider CLI

Install and configure the cloud provider CLI tool to manage the cloud resources:

To install gcloud, follow the official documentation: Install the Google Cloud CLI - cloud.google.com

Initialize and configure the Google Cloud CLI

The following process will authenticate to Google Cloud using the Google Cloud CLI with the following command. It should open a browser window to authenticate to Google Cloud. You might need to follow the instructions in the terminal to authenticate:

Warning

If gcloud asks you to pick a project or create a project, exit the process by pressing Ctrl+C in the terminal and follow the next steps to create a project.

Execute the following command(s) in a terminal
# Initialize and login to Google Cloud
gcloud init

This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.

If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!

Create a project on a cloud provider

This step will create a project on a cloud provider to host the data.

Warning

Do not create a new project through the web interface. The following commands will create a new project and link it to a billing account for you, without navigating through the web interface.

Export a Google Cloud Project ID with the following command. Replace <my_project_id> with a project ID of your choice. It has to be lowercase and words separated by hyphens.

Warning

The project ID must be unique across all Google Cloud projects and users. For example, use mlops-<surname>-project, where surname is based on your name. Change the project ID if the command fails.

Execute the following command(s) in a terminal
# Export the project ID
export GCP_PROJECT_ID=<my_project_id>

Create a Google Cloud Project with the following commands:

Execute the following command(s) in a terminal
1
2
3
4
5
# Create a new project
gcloud projects create $GCP_PROJECT_ID

# Select your Google Cloud project
gcloud config set project $GCP_PROJECT_ID

Then run the following command to authenticate to Google Cloud with the Application Default. It will create a credentials file in ~/.config/gcloud/application_default_credentials.json. This file must not be shared and will be used by DVC to authenticate to Google Cloud Storage.

Execute the following command(s) in a terminal
1
2
3
4
# Set authentication for our ML experiment
# https://dvc.org/doc/user-guide/data-management/remote-storage/google-cloud-storage
# https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login
gcloud auth application-default login

This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.

If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!

Link a billing account to the project to be able to create to create cloud resources:

List the billing accounts with the following command:

Execute the following command(s) in a terminal
# List the billing accounts
gcloud billing accounts list

If no billing account is available, you can add a new one from the Google Cloud Console and then link it to the project.

Export the billing account ID with the following command. Replace <my_billing_account_id> with your own billing account ID:

Execute the following command(s) in a terminal
# Export the billing account ID
export GCP_BILLING_ACCOUNT_ID=<my_billing_account_id>

Link a billing account to the project with the following command:

Execute the following command(s) in a terminal
1
2
3
# Link the billing account to the project
gcloud billing projects link $GCP_PROJECT_ID \
    --billing-account $GCP_BILLING_ACCOUNT_ID

This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.

If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!

Create the Storage Bucket on the cloud provider

Create the Storage Bucket to store the data with the cloud provider CLI:

Info

On most cloud providers, the project must be linked to an active billing account to be able to create the bucket. You must set up a valid billing account for the cloud provider.

Create the Google Storage Bucket to store the data with the Google Cloud CLI.

Export the bucket name as an environment variable. Replace <my_bucket_name> with a bucket name of your choice. It has to be lowercase and words separated by hyphens.

Warning

The bucket name must be unique across all Google Cloud projects and users. For example, use mlops-<surname>-bucket, where surname is based on your name. Change the bucket name if the command fails.

Execute the following command(s) in a terminal
# Export the bucket name
export GCP_BUCKET_NAME=<my_bucket_name>

Export the bucket location as an environment variable. You can view the available locations at Cloud locations. You should ideally select a location close to where most of the expected traffic will come from. Replace <my_bucket_location> with your own zone. For example, use europe-west6 for Switzerland (Zurich):

Execute the following command(s) in a terminal
# Export the bucket location
export GCP_BUCKET_LOCATION=<my_bucket_location>

Create the bucket:

Execute the following command(s) in a terminal
1
2
3
4
5
# Create the Google Storage Bucket
gcloud storage buckets create gs://$GCP_BUCKET_NAME \
    --location=$GCP_BUCKET_LOCATION \
    --uniform-bucket-level-access \
    --public-access-prevention

You now have everything you need for DVC.

This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.

If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!

Install the DVC Storage plugin

Install the DVC Storage plugin for the cloud provider:

Here, the dvc[gs] package enables support for Google Cloud Storage. Update the requirements.txt file:

requirements.txt
1
2
3
4
tensorflow==2.17.0
matplotlib==3.9.2
pyyaml==6.0.2
dvc[gs]==3.53.2

Check the differences with Git to validate the changes:

Execute the following command(s) in a terminal
# Show the differences with Git
git diff requirements.txt

The output should be similar to this:

diff --git a/requirements.txt b/requirements.txt
index 0b88f4a..4b8d3d9 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,4 +1,4 @@
 tensorflow==2.17.0
 matplotlib==3.9.2
 pyyaml==6.0.2
-dvc==3.53.2
+dvc[gs]==3.53.2

Install the dependencies and update the freeze file:

Warning

Prior to running any pip commands, it is crucial to ensure the virtual environment is activated to avoid potential conflicts with system-wide Python packages.

To check its status, simply run pip -V. If the virtual environment is active, the output will show the path to the virtual environment's Python executable. If it is not, you can activate it with source .venv/bin/activate.

Execute the following command(s) in a terminal
1
2
3
4
5
# Install the dependencies
pip install --requirement requirements.txt

# Freeze the dependencies
pip freeze --local --all > requirements-freeze.txt

This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.

If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!

Configure DVC to use the Storage Bucket

Configure DVC to use the Storage Bucket on the cloud provider:

Configure DVC to use a Google Storage remote bucket. The dvcstore is a user-defined path on the bucket. You can change it if needed:

Execute the following command(s) in a terminal
# Add the Google Storage remote bucket
dvc remote add -d data gs://$GCP_BUCKET_NAME/dvcstore

This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.

If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!

Check the changes

Check the changes with Git to ensure all wanted files are here:

Execute the following command(s) in a terminal
1
2
3
4
5
# Add all the available files
git add .

# Check the changes
git status

The output of the git status command should be similar to this:

1
2
3
4
5
6
On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    modified:   .dvc/config
    modified:   requirements-freeze.txt
    modified:   requirements.txt

Push the data files to DVC

DVC works as Git. Once you want to share the data, you can use dvc push to upload the data and its cache to the storage provider:

Execute the following command(s) in a terminal
# Upload the experiment data and cache to the remote bucket
dvc push

Commit the changes to Git

You can now push the changes to Git so all team members can get the data from DVC as well.

Execute the following command(s) in a terminal
1
2
3
4
5
# Commit the changes
git commit -m "My ML experiment data is shared with DVC"

# Push the changes
git push

Check the results

Open the Bucket Storage on the cloud provider and check that the files were hashed and have been uploaded.

Open the Cloud Storage on the Google cloud interface and click on your bucket to access the details.

This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.

If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!

Summary

Congratulations! You now have a dataset that can be used and shared among the team.

In this chapter, you have successfully:

  1. Created a new project on a cloud provider
  2. Installed and configured the cloud provider CLI
  3. Created the Storage Bucket on the cloud provider
  4. Installed the DVC Storage plugin
  5. Configured DVC to use the Storage Bucket
  6. Updated the gitignore file and adding the experiment data to DVC
  7. Pushed the data files to DVC
  8. Commit the changes to Git

You fixed some of the previous issues:

  • Data no longer needs manual download and is placed in the right directory.

When used by another member of the team, they can easily get a copy of the experiment data from DVC with the following command:

Execute the following command(s) in a terminal
# Download experiment data from DVC
dvc pull

With the help of DVC, they can also easily reproduce your experiment and, thanks to caching, only the required steps will be executed:

Execute the following command(s) in a terminal
# Execute the pipeline
dvc repro

You can now safely continue to the next chapter.

State of the MLOps process

  • Notebook has been transformed into scripts for production
  • Codebase and dataset are versioned
  • Steps used to create the model are documented and can be re-executed
  • Changes done to a model can be visualized with parameters, metrics and plots to identify differences between iterations
  • Codebase can be shared and improved by multiple developers
  • Dataset can be shared among the developers and is placed in the right directory in order to run the experiment
  • Experiment may not be reproducible on other machines
  • CI/CD pipeline does not report the results of the experiment
  • Changes to model are not thoroughly reviewed and discussed before integration
  • Model may have required artifacts that are forgotten or omitted in saved/loaded state
  • Model cannot be easily used from outside of the experiment context
  • Model requires manual publication to the artifact registry
  • Model is not accessible on the Internet and cannot be used anywhere
  • Model requires manual deployment on the cluster
  • Model cannot be trained on hardware other than the local machine
  • Model cannot be trained on custom hardware for specific use-cases

You will address these issues in the next chapters for improved efficiency and collaboration. Continue the guide to learn how.

Sources

Highly inspired by: