XGBoost Training (XGBoostJob)

Using XGBoostJob to train a model with XGBoost

This page describes XGBoostJob for training a machine learning model with XGBoost.

XGBoostJob is a Kubernetes custom resource to run XGBoost training jobs on Kubernetes. The Kubeflow implementation of XGBoostJob is in training-operator.

Note: XGBoostJob doesn’t work in a user namespace by default because of Istio automatic sidecar injection. In order to get it running, it needs annotation sidecar.istio.io/inject: "false" to disable it for either PyTorchJob pods or namespace. To view an example of how to add this annotation to your yaml file, see the XGBoostJob documentation.

Creating a XGBoost training job

You can create a training job by defining a XGboostJob config file. See the manifests for the IRIS example. You may change the config file based on your requirements. E.g.: add CleanPodPolicy in Spec to None to retain pods after job termination.

cat xgboostjob.yaml

Deploy the XGBoostJob resource to start training:

kubectl create -f xgboostjob.yaml

You should now be able to see the created pods matching the specified number of replicas.

kubectl get pods -l job-name=xgboost-dist-iris-test-train

Training takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.

PODNAME=$(kubectl get pods -l job-name=xgboost-dist-iris-test-train,replica-type=master,replica-index=0 -o name)
kubectl logs -f ${PODNAME}

Monitoring a XGBoostJob

kubectl get -o yaml xgboostjobs xgboost-dist-iris-test-train

See the status section to monitor the job status. Here is sample output when the job is successfully completed.

apiVersion: kubeflow.org/v1
kind: XGBoostJob
metadata:
  creationTimestamp: "2021-09-06T18:34:06Z"
  generation: 1
  name: xgboost-dist-iris-test-train
  namespace: default
  resourceVersion: "5844304"
  selfLink: /apis/kubeflow.org/v1/namespaces/default/xgboostjobs/xgboost-dist-iris-test-train
  uid: a1ea6675-3cb5-482b-95dd-68b2c99b8adc
spec:
  runPolicy:
    cleanPodPolicy: None
  xgbReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - args:
                - --job_type=Train
                - --xgboost_parameter=objective:multi:softprob,num_class:3
                - --n_estimators=10
                - --learning_rate=0.1
                - --model_path=/tmp/xgboost-model
                - --model_storage_type=local
              image: docker.io/merlintang/xgboost-dist-iris:1.1
              imagePullPolicy: Always
              name: xgboost
              ports:
                - containerPort: 9991
                  name: xgboostjob-port
                  protocol: TCP
    Worker:
      replicas: 2
      restartPolicy: ExitCode
      template:
        spec:
          containers:
            - args:
                - --job_type=Train
                - --xgboost_parameter="objective:multi:softprob,num_class:3"
                - --n_estimators=10
                - --learning_rate=0.1
              image: docker.io/merlintang/xgboost-dist-iris:1.1
              imagePullPolicy: Always
              name: xgboost
              ports:
                - containerPort: 9991
                  name: xgboostjob-port
                  protocol: TCP
status:
  completionTime: "2021-09-06T18:34:23Z"
  conditions:
    - lastTransitionTime: "2021-09-06T18:34:06Z"
      lastUpdateTime: "2021-09-06T18:34:06Z"
      message: xgboostJob xgboost-dist-iris-test-train is created.
      reason: XGBoostJobCreated
      status: "True"
      type: Created
    - lastTransitionTime: "2021-09-06T18:34:06Z"
      lastUpdateTime: "2021-09-06T18:34:06Z"
      message: XGBoostJob xgboost-dist-iris-test-train is running.
      reason: XGBoostJobRunning
      status: "False"
      type: Running
    - lastTransitionTime: "2021-09-06T18:34:23Z"
      lastUpdateTime: "2021-09-06T18:34:23Z"
      message: XGBoostJob xgboost-dist-iris-test-train is successfully completed.
      reason: XGBoostJobSucceeded
      status: "True"
      type: Succeeded
  replicaStatuses:
    Master:
      succeeded: 1
    Worker:
      succeeded: 2

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified April 26, 2024: Training: Reorganized Training Operator Docs (#3719) (8fe2bd2)