Uploaded image for project: 'Apache Submarine'
  1. Apache Submarine
  2. SUBMARINE-1376

XGBoost experiment pods will be deleted so that submarine can not get logs

Details

    • Bug
    • Status: Reopened
    • Blocker
    • Resolution: Unresolved
    • None
    • None
    • experiment
    • None

    Description

      After submitting the xgboost task using the following json, submarine was able to monitor the status of the xgboost task correctly.
      POST http://127.0.0.1:32080/api/v1/experiment

      {
          "meta": {
              "name": "xgboost-example",
              "tags": [],
              "framework": "Xgboost",
              "cmd": "python /opt/mlkube/main.py --job_type=Train --xgboost_parameter=objective:multi:softprob,num_class:3 --n_estimators=10 --learning_rate=0.1 --model_path=/tmp/xgboost-model --model_storage_type=local",
              "envVars": {}
          },
          "environment": {
              "image": "docker.io/merlintang/xgboost-dist-iris:1.1"
          },
          "spec": {
              "Worker": {
                  "replicas": 2,
                  "resources": "cpu=0.5,nvidia.com/gpu=0,memory=512M"
              },
              "Master": {
                  "replicas": 1,
                  "resources": "cpu=0.5,nvidia.com/gpu=0,memory=512M"
              }
          }
      }
      

      However, after the task was finished, it was found that the training-operator deleted the pods. This caused submarine to be unable to confirm the names of the pods that had been executed and the logging status of each pod.

      I had checked training-operator(1.4.0) and found logs:

      time="2023-04-01T09:26:31Z" level=info msg="xgboostJob experiment-1680334381873-0006 is created."
      time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:26:31Z" level=info msg="Need to create new pod: worker-0" job=submarine.experiment-1680334381873-0006 replica-type=worker uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      time="2023-04-01T09:26:31Z" level=info msg="Controller experiment-1680334381873-0006 created pod experiment-1680334381873-0006-worker-0" job=.experiment-1680334381873-0006 pod=.experiment-1680334381873-0006-worker-0 uid=
      time="2023-04-01T09:26:31Z" level=info msg="Need to create new pod: worker-1" job=submarine.experiment-1680334381873-0006 replica-type=worker uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      2023-04-01T09:26:31.270Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"}, "reason": "SuccessfulCreatePod", "message": "Created pod: experiment-1680334381873-0006-worker-0"}
      time="2023-04-01T09:26:31Z" level=info msg="Controller experiment-1680334381873-0006 created pod experiment-1680334381873-0006-worker-1" job=.experiment-1680334381873-0006 pod=.experiment-1680334381873-0006-worker-1 uid=
      time="2023-04-01T09:26:31Z" level=info msg="need to create new service: Worker-0" job=submarine.experiment-1680334381873-0006 replica-type=worker uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      2023-04-01T09:26:31.307Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"}, "reason": "SuccessfulCreatePod", "message": "Created pod: experiment-1680334381873-0006-worker-1"}
      time="2023-04-01T09:26:31Z" level=info msg="Controller experiment-1680334381873-0006 created service experiment-1680334381873-0006-worker-0"
      time="2023-04-01T09:26:31Z" level=info msg="need to create new service: Worker-1" job=submarine.experiment-1680334381873-0006 replica-type=worker uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      2023-04-01T09:26:31.344Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"}, "reason": "SuccessfulCreateService", "message": "Created service: experiment-1680334381873-0006-worker-0"}
      time="2023-04-01T09:26:31Z" level=info msg="Controller experiment-1680334381873-0006 created service experiment-1680334381873-0006-worker-1"
      time="2023-04-01T09:26:31Z" level=info msg="Need to create new pod: master-0" job=submarine.experiment-1680334381873-0006 replica-type=master uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      2023-04-01T09:26:31.410Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"}, "reason": "SuccessfulCreateService", "message": "Created service: experiment-1680334381873-0006-worker-1"}
      time="2023-04-01T09:26:31Z" level=info msg="Controller experiment-1680334381873-0006 created pod experiment-1680334381873-0006-master-0" job=.experiment-1680334381873-0006 pod=.experiment-1680334381873-0006-master-0 uid=
      time="2023-04-01T09:26:31Z" level=info msg="need to create new service: Master-0" job=submarine.experiment-1680334381873-0006 replica-type=master uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      2023-04-01T09:26:31.462Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"}, "reason": "SuccessfulCreatePod", "message": "Created pod: experiment-1680334381873-0006-master-0"}
      time="2023-04-01T09:26:31Z" level=info msg="Controller experiment-1680334381873-0006 created service experiment-1680334381873-0006-master-0"
      time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, running=0, succeeded=0 , failed=0"
      time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, running=0, succeeded=0 , failed=0"
      time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob experiment-1680334381873-0006 is running." job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      2023-04-01T09:26:31.487Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40565"}, "reason": "SuccessfulCreateService", "message": "Created service: experiment-1680334381873-0006-master-0"}
      time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, running=0, succeeded=0 , failed=0"
      time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, running=0, succeeded=0 , failed=0"
      time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob experiment-1680334381873-0006 is running." job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      time="2023-04-01T09:26:31Z" level=error msg="Operation cannot be fulfilled on xgboostjobs.kubeflow.org \"experiment-1680334381873-0006\": the object has been modified; please apply your changes to the latest version and try againfailed to update XGBoost Job conditions in the API server" job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      2023-04-01T09:26:31.538Z	ERROR	controllers.XGBoostJob	Reconcile XGBoost Job error	{"xgboostjob": "submarine/experiment-1680334381873-0006", "error": "Operation cannot be fulfilled on xgboostjobs.kubeflow.org \"experiment-1680334381873-0006\": the object has been modified; please apply your changes to the latest version and try again"}
      2023-04-01T09:26:31.538Z	ERROR	controller-runtime.manager.controller.xgboostjob-controller	Reconciler error	{"name": "experiment-1680334381873-0006", "namespace": "submarine", "error": "Operation cannot be fulfilled on xgboostjobs.kubeflow.org \"experiment-1680334381873-0006\": the object has been modified; please apply your changes to the latest version and try again"}
      time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, running=0, succeeded=0 , failed=0"
      time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, running=0, succeeded=0 , failed=0"
      time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob experiment-1680334381873-0006 is running." job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      time="2023-04-01T09:26:31Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, running=0, succeeded=0 , failed=0"
      time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, running=0, succeeded=0 , failed=0"
      time="2023-04-01T09:26:31Z" level=info msg="XGBoostJob experiment-1680334381873-0006 is running." job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, running=0, succeeded=0 , failed=0"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, running=1, succeeded=0 , failed=0"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob experiment-1680334381873-0006 is running." job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, running=0, succeeded=0 , failed=0"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, running=1, succeeded=0 , failed=0"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob experiment-1680334381873-0006 is running." job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, running=1, succeeded=0 , failed=0"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, running=1, succeeded=0 , failed=0"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob experiment-1680334381873-0006 is running." job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, running=1, succeeded=0 , failed=0"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, running=1, succeeded=0 , failed=0"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob experiment-1680334381873-0006 is running." job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, running=1, succeeded=0 , failed=0"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob experiment-1680334381873-0006 is running." job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      time="2023-04-01T09:26:33Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=1, running=1, succeeded=0 , failed=0"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
      time="2023-04-01T09:26:33Z" level=info msg="XGBoostJob experiment-1680334381873-0006 is running." job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      time="2023-04-01T09:27:04Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:27:04Z" level=info msg="Ignoring inactive pod submarine/experiment-1680334381873-0006-master-0 in state Succeeded, deletion time <nil>"
      time="2023-04-01T09:27:04Z" level=info msg="Pod: submarine.experiment-1680334381873-0006-master-0 exited with code 0" job=submarine.experiment-1680334381873-0006 replica-type=master uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      time="2023-04-01T09:27:04Z" level=info msg="XGBoostJob=experiment-1680334381873-0006, ReplicaType=Master expected=0, running=0, succeeded=1 , failed=0"
      time="2023-04-01T09:27:04Z" level=info msg="XGBoostJob experiment-1680334381873-0006 is successfully completed."
      2023-04-01T09:27:04.010Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40623"}, "reason": "ExitedWithCode", "message": "Pod: submarine.experiment-1680334381873-0006-master-0 exited with code 0"}
      2023-04-01T09:27:04.010Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40623"}, "reason": "XGBoostJobSucceeded", "message": "XGBoostJob experiment-1680334381873-0006 is successfully completed."}
      time="2023-04-01T09:27:04Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:27:04Z" level=info msg="Controller experiment-1680334381873-0006 deleting pod submarine/experiment-1680334381873-0006-worker-1" job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      2023-04-01T09:27:04.067Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"}, "reason": "SuccessfulDeletePod", "message": "Deleted pod: experiment-1680334381873-0006-worker-1"}
      time="2023-04-01T09:27:04Z" level=info msg="Controller experiment-1680334381873-0006 deleting service submarine/experiment-1680334381873-0006-worker-1"
      2023-04-01T09:27:04.113Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"}, "reason": "SuccessfulDeleteService", "message": "Deleted service: experiment-1680334381873-0006-worker-1"}
      time="2023-04-01T09:27:04Z" level=info msg="Controller experiment-1680334381873-0006 deleting pod submarine/experiment-1680334381873-0006-worker-0" job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      2023-04-01T09:27:04.145Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"}, "reason": "SuccessfulDeletePod", "message": "Deleted pod: experiment-1680334381873-0006-worker-0"}
      time="2023-04-01T09:27:04Z" level=info msg="Controller experiment-1680334381873-0006 deleting service submarine/experiment-1680334381873-0006-worker-0"
      2023-04-01T09:27:04.162Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"}, "reason": "SuccessfulDeleteService", "message": "Deleted service: experiment-1680334381873-0006-worker-0"}
      time="2023-04-01T09:27:04Z" level=info msg="Controller experiment-1680334381873-0006 deleting pod submarine/experiment-1680334381873-0006-master-0" job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      2023-04-01T09:27:04.175Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"}, "reason": "SuccessfulDeletePod", "message": "Deleted pod: experiment-1680334381873-0006-master-0"}
      time="2023-04-01T09:27:04Z" level=info msg="Controller experiment-1680334381873-0006 deleting service submarine/experiment-1680334381873-0006-master-0"
      2023-04-01T09:27:04.185Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"XGBoostJob","namespace":"submarine","name":"experiment-1680334381873-0006","uid":"20673c7b-e336-4ab0-b584-7453bc6b3234","apiVersion":"kubeflow.org/v1","resourceVersion":"40677"}, "reason": "SuccessfulDeleteService", "message": "Deleted service: experiment-1680334381873-0006-master-0"}
      time="2023-04-01T09:27:04Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:27:04Z" level=info msg="pod submarine/experiment-1680334381873-0006-worker-1 is terminating, skip deleting" job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      time="2023-04-01T09:27:13Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      time="2023-04-01T09:27:13Z" level=info msg="pod submarine/experiment-1680334381873-0006-worker-1 is terminating, skip deleting" job=submarine.experiment-1680334381873-0006 uid=20673c7b-e336-4ab0-b584-7453bc6b3234
      time="2023-04-01T09:27:13Z" level=info msg="Reconciling for job experiment-1680334381873-0006"
      

      Attachments

        1. submarine-xgboost-pods.jpg
          85 kB
          cdmikechen

        Issue Links

          Activity

            People

              chenxiang cdmikechen
              chenxiang cdmikechen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: