Chapter 1.3 - Initialize Git and DVC for local training¶
Introduction¶
Now that you have a good understanding of the experiment, it's time to improve the code and data sharing process. To share the codebase, you will create a Git repository.
However, when it comes to managing large files, Git has some limitations. Although Git LFS is an option for handling large files in Git repositories, it may not be the most efficient solution.
This is the reason you will use DVC, a version control system specifically designed to share the data and integrates well with Git. DVC utilizes chunking to efficiently store large files and track their changes.
In this chapter, you will learn how to:
- Set up a new Git repository
- Initialize Git in your project directory
- Verify Git tracking for your files
- Exclude experiment results, data, models and Python environment files from Git commits
- Commit your changes to the Git repository
- Install DVC
- Initialize and configure DVC
- Update the gitignore file and add the experiment data to DVC
- Push the data files to DVC
- Commit the metadata files to Git
The following diagram illustrates the control flow of the experiment at the end of this chapter:
flowchart TB
dot_dvc[(.dvc)]
dot_git[(.git)]
data[data/raw] <-.-> dot_dvc
workspaceGraph <-....-> dot_git
subgraph cacheGraph[CACHE]
dot_dvc
dot_git
end
subgraph workspaceGraph[WORKSPACE]
data --> prepare
prepare[prepare.py] --> train
train[train.py] --> evaluate[evaluate.py]
params[params.yaml] -.- prepare
params -.- train
end
style workspaceGraph opacity:0.4,color:#7f7f7f80
style prepare opacity:0.4,color:#7f7f7f80
style train opacity:0.4,color:#7f7f7f80
style evaluate opacity:0.4,color:#7f7f7f80
style params opacity:0.4,color:#7f7f7f80
linkStyle 2 opacity:0.4,color:#7f7f7f80
linkStyle 3 opacity:0.4,color:#7f7f7f80
linkStyle 4 opacity:0.4,color:#7f7f7f80
linkStyle 5 opacity:0.4,color:#7f7f7f80
linkStyle 6 opacity:0.4,color:#7f7f7f80
In future chapters, you will improve the code sharing process by setting up remote Git and DVC repositories to enable easy collaboration with the rest of the team.
Steps¶
Create a new Git repository¶
Initialize Git in your working directory¶
Use the following command to set up a local Git repository in your working directory:
Execute the following command(s) in a terminal | |
---|---|
Check if Git tracks your files¶
Initialize Git in your working directory. Verify available files for committing with this command:
The output should be similar to this:
As you can see, no files have been added to Git yet.
Create a .gitignore file¶
Create a .gitignore
file to exclude data, models, and Python environment to improve repository size and clone time. The data and models will be managed by DVC in the next chapters. Keep the model's evaluation as it doesn't take much space and you can have a history of the improvements made to your model. Additionally, this will help to ensure that the repository size and clone time remain optimized:
.gitignore | |
---|---|
Info
If using macOS, you might want to ignore .DS_Store
files as well to avoid pushing Apple's metadata files to your repository.
Check the changes¶
Check the changes with Git to ensure all wanted files are here with the following commands:
Execute the following command(s) in a terminal | |
---|---|
The output of the git status
command should be similar to this:
Commit the changes¶
Commit the changes to Git:
Execute the following command(s) in a terminal | |
---|---|
Create a DVC repository¶
Install DVC¶
Add the main dvc
dependency to the requirements.txt
file:
Check the differences with Git to validate the changes:
Execute the following command(s) in a terminal | |
---|---|
The output should be similar to this:
Install the dependencies and update the freeze file:
Warning
Prior to running any pip commands, it is crucial to ensure the virtual environment is activated to avoid potential conflicts with system-wide Python packages.
To check its status, simply run pip -V
. If the virtual environment is active, the output will show the path to the virtual environment's Python executable. If it is not, you can activate it with source .venv/bin/activate
.
Execute the following command(s) in a terminal | |
---|---|
Initialize DVC¶
Initialize DVC in the current project.
The dvc init
command creates a .dvc
directory in the working directory, which serves as the configuration directory for DVC.
Update the .gitignore file and add the experiment data to DVC¶
With DVC now set up, you can begin adding files to it.
Try to add the experiment data. Spoiler, it will fail:
Execute the following command(s) in a terminal | |
---|---|
When executing this command, the following output occurs:
You will have to update the gitignore file so that DVC can create files in the data
directory. However, you still don't want the directories data/raw
and data/prepared
to be added to Git.
Update the gitignore file by changing data/
to data/raw/
and data/prepared/
:
.gitignore | |
---|---|
Info
If using macOS, you might want to ignore .DS_Store
files as well to avoid pushing Apple's metadata files to your repository.
Check the differences with Git to validate the changes:
Execute the following command(s) in a terminal | |
---|---|
The output should be similar to this:
You can now add the experiment data to DVC without complain:
The output should be similar to this. You can safely ignore the message:
The effect of the dvc add
command is to create a data/data.raw.dvc
file and a data/.gitignore
. The .dvc
file contains the metadata of the file that is used by DVC to download and check the integrity of the files. The gitignore file is created to add the files in data/raw
to be ignored by Git. The .dvc
files must be added to Git.
Various DVC commands will automatically try to update the gitignore files. If a gitignore file is already present, it will be updated to include the newly ignored files. You might need to update existing .gitignore
files accordingly.
Check the changes¶
Check the changes with Git to ensure all wanted files are here.
Execute the following command(s) in a terminal | |
---|---|
The output of the git status
command should be similar to this.
Commit the changes to Git¶
You can now commit the changes to Git so the data from DVC is tracked along code changes as well.
Execute the following command(s) in a terminal | |
---|---|
This chapter is done, you can check the summary.
Summary¶
Congratulations! You now have a codebase and a dataset that is versioned with Git and DVC. At the moment, these tools are only used locally. In the next chapters, you will learn how to share the codebase and the dataset with the rest of the team.
In this chapter, you have successfully:
- Set up a new Git repository
- Initialized Git in your project directory
- Verified Git tracking for your files
- Excluded experiment results, data, models and Python environment files from Git commits
- Commited your changes to the Git repository
- Installed DVC
- Initialized DVC
- Updated the gitignore file and adding the experiment data to DVC
- Commited the data files to DVC
- Commited your changes to the Git repository
You fixed some of the previous issues:
- Data no longer needs manual download and is placed in the right directory.
- Codebase is versioned
You can now safely continue to the next chapter.
State of the MLOps process¶
- Notebook has been transformed into scripts for production
- Codebase and dataset are versioned
- Model steps rely on verbal communication and may be undocumented
- Changes to model are not easily visualized
- Codebase requires manual download and setup
- Dataset requires manual download and placement
- Experiment may not be reproducible on other machines
- CI/CD pipeline does not report the results of the experiment
- Changes to model are not thoroughly reviewed and discussed before integration
- Model may have required artifacts that are forgotten or omitted in saved/loaded state
- Model cannot be easily used from outside of the experiment context
- Model requires manual publication to the artifact registry
- Model is not accessible on the Internet and cannot be used anywhere
- Model requires manual deployment on the cluster
- Model cannot be trained on hardware other than the local machine
- Model cannot be trained on custom hardware for specific use-cases
You will address these issues in the next chapters for improved efficiency and collaboration. Continue the guide to learn how.
Sources¶
Highly inspired by: