Chapter 2.1 - Move the ML experiment code to the cloud¶
Introduction¶
Now that you have configured DVC and can reproduce the experiment, let's set up a remote repository for sharing the code with the team.
By linking your local project to a remote repository on platforms like GitHub or GitLab, you can easily push, pull, and synchronize changes with your team.
The following diagram illustrates the control flow of the experiment at the end of this chapter:
flowchart TB
dot_dvc[(.dvc)]
dot_git[(.git)] <-->|git push
git pull| gitGraph[Git Remote]
workspaceGraph <-....-> dot_git
data[data/raw] <-.-> dot_dvc
subgraph remoteGraph[REMOTE]
subgraph gitGraph[Git Remote]
repository[(Repository)]
end
end
subgraph cacheGraph[CACHE]
dot_dvc
dot_git
end
subgraph workspaceGraph[WORKSPACE]
prepare[prepare.py] <-.-> dot_dvc
train[train.py] <-.-> dot_dvc
evaluate[evaluate.py] <-.-> dot_dvc
data --> prepare
subgraph dvcGraph["dvc.yaml (dvc repro)"]
prepare --> train
train --> evaluate
end
params[params.yaml] -.- prepare
params -.- train
params <-.-> dot_dvc
end
style workspaceGraph opacity:0.4,color:#7f7f7f80
style dvcGraph opacity:0.4,color:#7f7f7f80
style cacheGraph opacity:0.4,color:#7f7f7f80
style dot_dvc opacity:0.4,color:#7f7f7f80
style data opacity:0.4,color:#7f7f7f80
style prepare opacity:0.4,color:#7f7f7f80
style train opacity:0.4,color:#7f7f7f80
style evaluate opacity:0.4,color:#7f7f7f80
style params opacity:0.4,color:#7f7f7f80
linkStyle 1 opacity:0.4,color:#7f7f7f80
linkStyle 2 opacity:0.4,color:#7f7f7f80
linkStyle 3 opacity:0.4,color:#7f7f7f80
linkStyle 4 opacity:0.4,color:#7f7f7f80
linkStyle 5 opacity:0.4,color:#7f7f7f80
linkStyle 6 opacity:0.4,color:#7f7f7f80
linkStyle 7 opacity:0.4,color:#7f7f7f80
linkStyle 8 opacity:0.4,color:#7f7f7f80
linkStyle 9 opacity:0.4,color:#7f7f7f80
linkStyle 10 opacity:0.4,color:#7f7f7f80
linkStyle 11 opacity:0.4,color:#7f7f7f80
Create a remote Git repository¶
Create a Git repository on your preferred service to collaborate with peers. For example, choose mlops-guide
as repository name.
Important
Configure the repository as you wish but do not check the box "Add a README file", "Add .gitignore" nor "Choose a license".
Create a new GitHub repository for this chapter by accessing https://github.com/new.
Important
Configure the repository as you wish but do not check the box "Initialize repository with a README".
Create a new GitLab blank project for this chapter by accessing https://gitlab.com/projects/new.
Configure Git for the remote branch¶
Add the remote origin to your repository. Replace <my_git_repository_url>
with the URL of your Git repository. Your Git service should provide these instructions as well:
Execute the following command(s) in a terminal | |
---|---|
Push the changes to Git¶
Set the remote as the upstream branch and push the changes to Git:
Execute the following command(s) in a terminal | |
---|---|
After setting the upstream branch, you can simply use git push
and git pull
without additional arguments to interact with the remote branch.
Check the results¶
Go to your online Git repository and you will be able to view the files that are stored there.
This chapter is now complete. Please review the summary for a recap of the key points.
Summary¶
Congratulations! You now have a codebase that can be used and shared among the team.
In this chapter, you have successfully:
- Set up a remote Git repository
- Added the remote to your local git repository
- Pushed your changes to the remote Git repository
You fixed some of the previous issues:
- Codebase no longer needs manual download and is versioned
Another member of your team can easily clone the experiment with the following command:
Execute the following command(s) in a terminal | |
---|---|
You can now safely continue to the next chapter.
State of the MLOps process¶
- Notebook has been transformed into scripts for production
- Codebase and dataset are versioned
- Steps used to create the model are documented and can be re-executed
- Changes done to a model can be visualized with parameters, metrics and plots to identify differences between iterations
- Codebase can be shared and improved by multiple developers
- Dataset requires manual download and placement
- Experiment may not be reproducible on other machines
- CI/CD pipeline does not report the results of the experiment
- Changes to model are not thoroughly reviewed and discussed before integration
- Model may have required artifacts that are forgotten or omitted in saved/loaded state
- Model cannot be easily used from outside of the experiment context
- Model requires manual publication to the artifact registry
- Model is not accessible on the Internet and cannot be used anywhere
- Model requires manual deployment on the cluster
- Model cannot be trained on hardware other than the local machine
- Model cannot be trained on custom hardware for specific use-cases
You will address these issues in the next chapters for improved efficiency and collaboration. Continue the guide to learn how.
Sources¶
Highly inspired by: