Chapter 3.8 - Train the model on a Kubernetes pod¶
Introduction¶
Warning
This chapter is a work in progress. It focuses for now solely on GitHub. Please check back later for updates specific to using GitLab.
Thank you!
You can now train your model on the cluster. However, some experiments may require specific hardware to run. For instance, training a deep learning model might require a GPU. This GPU could be shared among multiple teams for different purposes, so it is important to avoid monopolizing its use.
In such situation, you can use a specialized Kubernetes pod for on-demand model training.
In this chapter, you will learn how to:
- Adjust the self-hosted runner to create a specialized on-demand pod within the Kubernetes cluster
- Start the model training from your CI/CD pipeline using the specialized pod in the Kubernetes cluster
The following diagram illustrates the control flow of the experiment at the end of this chapter:
flowchart TB
dot_dvc[(.dvc)] <-->|dvc pull
dvc push| s3_storage[(S3 Storage)]
dot_git[(.git)] <-->|git pull
git push| repository[(Repository)]
workspaceGraph <-....-> dot_git
data[data/raw]
subgraph cacheGraph[CACHE]
dot_dvc
dot_git
end
subgraph workspaceGraph[WORKSPACE]
data --> code[*.py]
subgraph dvcGraph["dvc.yaml"]
code
end
params[params.yaml] -.- code
code <--> bento_model[classifier.bentomodel]
subgraph bentoGraph[bentofile.yaml]
bento_model
serve[serve.py] <--> bento_model
end
bento_model <-.-> dot_dvc
end
subgraph remoteGraph[REMOTE]
s3_storage
subgraph gitGraph[Git Remote]
repository[(Repository)] <--> action[Action]
end
action --> |dvc pull
dvc repro
bentoml build
bentoml containerize
docker push|registry
s3_storage -.- |...|repository
subgraph clusterGraph[Kubernetes]
subgraph clusterPodGraph[Kubernetes Pod]
pod_train[Train model] <-.-> k8s_gpu[GPUs]
end
pod_runner[Runner] --> |setup
cleanup|clusterPodGraph
action -->|dvc pull
dvc repro| pod_train
bento_service_cluster[classifier.bentomodel] --> k8s_fastapi[FastAPI]
end
action --> |self-hosted|pod_runner
pod_train -->|cml publish| action
pod_train -->|dvc push| s3_storage
registry[(Container
registry)] --> bento_service_cluster
action --> |kubectl apply|bento_service_cluster
end
subgraph browserGraph[BROWSER]
k8s_fastapi <--> publicURL["public URL"]
end
style workspaceGraph opacity:0.4,color:#7f7f7f80
style dvcGraph opacity:0.4,color:#7f7f7f80
style cacheGraph opacity:0.4,color:#7f7f7f80
style data opacity:0.4,color:#7f7f7f80
style dot_git opacity:0.4,color:#7f7f7f80
style dot_dvc opacity:0.4,color:#7f7f7f80
style code opacity:0.4,color:#7f7f7f80
style bentoGraph opacity:0.4,color:#7f7f7f80
style serve opacity:0.4,color:#7f7f7f80
style bento_model opacity:0.4,color:#7f7f7f80
style params opacity:0.4,color:#7f7f7f80
style remoteGraph opacity:0.4,color:#7f7f7f80
style gitGraph opacity:0.4,color:#7f7f7f80
style repository opacity:0.4,color:#7f7f7f80
style bento_service_cluster opacity:0.4,color:#7f7f7f80
style registry opacity:0.4,color:#7f7f7f80
style clusterGraph opacity:0.4,color:#7f7f7f80
style k8s_fastapi opacity:0.4,color:#7f7f7f80
style browserGraph opacity:0.4,color:#7f7f7f80
style publicURL opacity:0.4,color:#7f7f7f80
linkStyle 0 opacity:0.4,color:#7f7f7f80
linkStyle 1 opacity:0.4,color:#7f7f7f80
linkStyle 2 opacity:0.4,color:#7f7f7f80
linkStyle 3 opacity:0.4,color:#7f7f7f80
linkStyle 4 opacity:0.4,color:#7f7f7f80
linkStyle 5 opacity:0.4,color:#7f7f7f80
linkStyle 6 opacity:0.4,color:#7f7f7f80
linkStyle 7 opacity:0.4,color:#7f7f7f80
linkStyle 8 opacity:0.4,color:#7f7f7f80
linkStyle 9 opacity:0.4,color:#7f7f7f80
linkStyle 10 opacity:0.0
linkStyle 14 opacity:0.4,color:#7f7f7f80
linkStyle 18 opacity:0.4,color:#7f7f7f80
linkStyle 19 opacity:0.4,color:#7f7f7f80
linkStyle 20 opacity:0.4,color:#7f7f7f80
Steps¶
Identify the specialized node¶
The cluster consists of two nodes. For demonstration purposes, let's assume that one node is equipped with a GPU while the other is not. You will need to identify which node has the specialized hardware required for training the model. This can be achieved by assigning a label to the nodes.
Note
For our small experiment, there is actually no need to have a GPU to train the model. This is done solely for demonstration purposes. In a real-life production setup with a larger machine learning experiment, however, training with a GPU is likely to be a strong requirement due to the increased computational demands and the need for faster processing times.
Display the nodes names and labels¶
Display the nodes with the following command.
Execute the following command(s) in a terminal | |
---|---|
The output should be similar to this: As noticed, you have two nodes in your cluster with their labels.
Export the name of the two nodes as environment variables. Replace the <my_node_1_name>
and <my_node_2_name>
placeholders with the names of your nodes (gke-mlops-surname-cluster-default-pool-d4f966ea-8rbn
and gke-mlops-surname-cluster-default-pool-d4f966ea-p7qm
in this example).
Execute the following command(s) in a terminal | |
---|---|
Execute the following command(s) in a terminal | |
---|---|
Labelize the nodes¶
You can now labelize the nodes to be able to use the GPU node for the training of the model.
Execute the following command(s) in a terminal | |
---|---|
You can check the labels with the kubectl get nodes --show-labels
command. You should see the node with the gpu=true
/ gpu=false
labels.
Adjust the self-hosted runner label¶
The existing self-hosted runner will not be used for model training. Instead, it will function as a "base runner," dedicated to monitoring jobs and creating on-demand specialized pods for training the model with GPU support.
To ensure the base runner operates effectively in this role, update its YAML configuration to prevent it from using the GPU-enabled node, as this is not required for its purpose. This change will also help keep the hardware resources available for the training job.
Check the differences with Git to validate the changes:
Execute the following command(s) in a terminal | |
---|---|
The output should be similar to this:
Note the nodeSelector
field that will select a node with a gpu=false
label.
To update the runner on the Kubernetes cluster, run the following commands:
Execute the following command(s) in a terminal | |
---|---|
The existing pod will be terminated, and a new one will be created with the updated configuration.
Set self-hosted GPU runner¶
We will now create a similar configuration file for the GPU runner, which is used exclusively during the train and report steps of the workflow to create a self-hosted GPU runner specifically for executing this step.
The runner will use the same custom Docker image that we pushed to the GitHub Container Registry. This image is identified by the label GITHUB_RUNNER_LABEL
which is set to the value gpu-runner
.
Create a new file called runner-gpu.yaml
in the kubernetes
directory with the following content. Replace <my_username>
and <my_repository_name>
with your own GitHub username and repository name.
Note the nodeSelector
field that will select a node with a gpu=true
label.
Add Kubeconfig secret¶
To enable the GPU runner to access the cluster, authentication is required. To obtain the credentials for your Google Cloud Kubernetes cluster, you can execute the following command to set up your kubeconfig file (~/.kube/config
) with the necessary credentials:
Execute the following command(s) in a terminal | |
---|---|
This updates the kubeconfig file (~/.kube/config
) used by kubectl
with the necessary information to connect to your Google Cloud Kubernetes cluster.
The relevant section of the kubeconfig file will look something like this:
Info
If using macOS, make sure the users.user.exec.command
parameter is set to gke-gcloud-auth-plugin
. The kubeconfig file is generated locally and may point to the Homebrew installation path. However, this configuration will be used in a standard Linux environment when accessing the Kubernetes cluster from the CI/CD pipeline.
Add Kubernetes CI/CD secrets¶
Add the Kubernetes secrets to access the Kubernetes cluster from the CI/CD pipeline. Depending on the CI/CD platform you are using, the process will be different:
Create the following new variable by going to the Settings section from the top header of your GitHub repository. Select Secrets and variables > Actions and select New repository secret:
GCP_K8S_KUBECONFIG
: The content of the kubeconfig file of the Kubernetes cluster.
Save the variables by selecting Add secret.
No additional secret variables are necessary for the GitLab CI/CD.
This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.
If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!
Update the CI/CD configuration file¶
You'll now update the CI/CD configuration file to start a runner on the Kubernetes cluster. Using the labels defined previously, you'll be able to start the training of the model on the node with the GPU.
Update the .github/workflows/mlops.yaml
file.
Take some time to understand the new steps:
.github/workflows/mlops.yaml | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 |
|
Here, the following should be noted:
When creating pull requests:
- the
setup-runner
job creates a self-hosted GPU runner. - the
train-report
job runs on the self-hosted GPU runner. It trains the model and pushes the trained model to the remote bucket with DVC. - the
cleanup-runner
job destroys the self-hosted GPU runner that was created. It also guarantees that the GPU runner pod is removed, even when if the previous step failed or was manually cancelled.
When merging pull requests:
- the
publish-and-deploy
runs on the main runner when merging pull requests. It retrieves the model with DVC, containerizes then deploys the model artifact.
Check the differences with Git to validate the changes.
Execute the following command(s) in a terminal | |
---|---|
The output should be similar to this:
Take some time to understand the changes made to the file.
Check the changes¶
Check the changes with Git to ensure that all the necessary files are tracked.
Execute the following command(s) in a terminal | |
---|---|
The output should look like this.
Push the CI/CD pipeline configuration file to Git¶
Push the CI/CD pipeline configuration file to Git.
Execute the following command(s) in a terminal | |
---|---|
Try it out one final time¶
Finally, try to update some parameters of your model to test the training on the kubernetes specilized pod.
Similarly to what you have done in Chapter 2.5: Work efficiently and collaboratively with Git, create an issue Demonstrate model training on kubernetes pod and a new branch for the issue.
On your machine, check out the new branch.
Update your experiment by editing for example the params.yaml
file with the following parameters:
params.yaml | |
---|---|
You can now commit and push the above changes to trigger a change on the remote repository.
This time, do not execute dvc repro
locally but let the cluster pod handle the job for you. Push the changes to the remote repository.
Execute the following command(s) in a terminal | |
---|---|
Check the results¶
On GitHub, you can see the pipeline running on the Actions page.
On GitLab, you can see the pipeline running on the CI/CD > Pipelines page.
The pod should be created on the Kubernetes Cluster.
On Google Cloud Console, you can see the pod that has been created on the Kubernetes Engine Workloads page. Open the pod and go to the YAML tab to see the configuration of the pod. You should notice that the pod has been created with the node selector gpu=true
and that it has been created on the right node.
This guide has been written with Google Cloud in mind. We are open to contributions to add support for other cloud providers such as Amazon Web Services, Exoscale, Microsoft Azure or Self-hosted Kubernetes but we might not officially support them.
If you want to contribute, please open an issue or a pull request on the GitHub repository. Your help is greatly appreciated!
Go back to your GitHub repository.
- Create a pull request and visualize the execution of the CI/CD pipeline. The
train-report
job will run on the self-hosted runner. It trains the model and DVC pushes the trained model to the remote bucket. - Merge the pull request/merge request, and switch back to the main branch and pull the latest changes. The
publish-and-deploy
will run on the main runner. It retrieves the model with DVC, containerizes then deploys the model artifact.
This chapter is done, you can check the summary.
Summary¶
Congratulations! You now can train your model on a custom infrastructure with custom hardware for specific use-cases.
In this chapter, you have successfully:
- Set up an specialized on-demand runner on a pod in Kubernetes
- Trained the model on the specialized pod on the Kubernetes cluster
Destroy the Kubernetes cluster¶
When you are done with the chapter, you can destroy the Kubernetes cluster.
Execute the following command(s) in a terminal | |
---|---|
Tip
If you need to quickly recreate the cluster after destroying it, here are the steps involved:
- Create the Kubernetes cluster.
- Deploy the containerized model on Kubernetes.
- Identify the specialized node.
- Label the nodes.
- Create the Kubernetes secret for the base runner registration.
- Deploy the base runner.
- Retrieve the Kubernetes cluster credentials.
- Update the Kubernetes
GCP_K8S_KUBECONFIG
CI/CD secret.
Refer to the previous chapters for the specific commands. Additionally, ensure that all necessary environment variables are correctly defined.
State of the MLOps process¶
- Notebook has been transformed into scripts for production
- Codebase and dataset are versioned
- Steps used to create the model are documented and can be re-executed
- Changes done to a model can be visualized with parameters, metrics and plots to identify differences between iterations
- Codebase can be shared and improved by multiple developers
- Dataset can be shared among the developers and is placed in the right directory in order to run the experiment
- Experiment can be executed on a clean machine with the help of a CI/CD pipeline
- CI/CD pipeline is triggered on pull requests and reports the results of the experiment
- Changes to model can be thoroughly reviewed and discussed before integrating them into the codebase
- Model can be saved and loaded with all required artifacts for future usage
- Model can be easily used outside of the experiment context
- Model publication to the artifact registry is automated
- Model can be accessed from a Kubernetes cluster
- Model is continuously deployed with the CI/CD
- Model can be trained on a custom infrastructure
- Model can be trained on a custom infrastructure with custom hardware for specific use-cases
You can now safely continue to the next chapter of this guide concluding your journey and the next things you could do with your model.
Sources¶
Highly inspired by: