Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-2682

YuniKorn Gang Scheduling Issue: Executors Failing to Start When Running Multiple Applications

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Workaround
    • 1.3.0
    • None
    • core - scheduler
    • None

    Description

      Description:

      While using YuniKorn's gang scheduling, we encountered a situation where the scheduling process appears to succeed, but in reality, there is a problem. When submitting two applications simultaneously, only the driver pods are successfully running, and the executor pods fail to start due to insufficient resources. The following error is observed in the scheduler logs:

      2024-06-18T15:15:27.933Z ERROR cache/placeholder_manager.go:99 failed to create placeholder pod {"error": "pods \"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed quota: spark-quota: must specify limits.cpu,limits.memory"}
      github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
        github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
      github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
        github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542 

      Environment:

      • YuniKorn version: 1.3.0
      • Kubernetes version: 1.21.3
      • Spark version: 3.2.2

      resource-quota.yaml

      apiVersion: v1
      kind: ResourceQuota
      metadata:
        name: spark-quota
        namespace: spark
      spec:
        hard:
          requests.cpu: "5"
          requests.memory: "5Gi"
          limits.cpu: "5"
          limits.memory: "5Gi" 

      yunikorn-configs.yaml 

      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: yunikorn-configs
        namespace: yunikorn
      data:
        log.level: "-1"
        log.admission.level: "-1"
        log.core.config.level: "-1"
        queues.yaml: |
          partitions:
            - name: default
              placementrules:
                - name: tag
                  value: namespace
                  create: true
              queues:
                - name: root
                  submitacl: '*'
                  properties:
                    application.sort.policy: fifo
                    placeholderTimeoutInSeconds: 60
                    schedulingStyle: Hard
                  queues:
                    - name: spark
                      properties:
                        application.sort.policy: fifo
                        placeholderTimeoutInSeconds: 60
                        schedulingStyle: Hard
                      resources:
                        guaranteed:
                          vcore: 5
                          memory: 5Gi
                        max:
                          vcore: 5
                          memory: 5Gi 

      Spark-submit command

      ./bin/spark-submit \
        --master k8s://https://10.10.10.10:6443 \
        --deploy-mode cluster \
        --name spark-pi \
        --conf spark.kubernetes.authenticate.driver.serviceAccountName=sparksa \
        --conf spark.kubernetes.namespace=spark \
        --class org.apache.spark.examples.SparkPi \
        --conf spark.executor.instances=1 \
        --conf spark.executor.cores=1 \
        --conf spark.executor.memory=1500m \
        --conf spark.driver.cores=1 \
        --conf spark.driver.memory=1500m \
        --conf spark.kubernetes.driver.limit.cores=1 \
        --conf spark.kubernetes.driver.limit.memory=2G \
        --conf spark.kubernetes.executor.limit.cores=1 \
        --conf spark.kubernetes.executor.limit.memory=2G \
         --conf spark.kubernetes.driver.label.app=spark \
        --conf spark.kubernetes.executor.label.app=spark \
        --conf spark.kubernetes.container.image=apache/spark:v3.3.2 \
        --conf spark.kubernetes.scheduler.name=yunikorn \
        --conf spark.kubernetes.driver.label.queue=root.spark \
        --conf spark.kubernetes.executor.label.queue=root.spark \
        --conf spark.kubernetes.driver.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
        --conf spark.kubernetes.executor.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
        --conf spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver \
        --conf spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name": "spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory": "2Gi"},"nodeSelector": {"app": "spark"} }, {"name": "spark-executor", "minMember": 1, "minResource": {"cpu": "1", "memory": "2Gi"},"nodeSelector": {"app": "spark"} }]' \
        --conf spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=60 gangSchedulingStyle=Hard' \
        --conf spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor \
        local:///opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar \
        3000 

       

      scheduler logs

      kubectl  logs yunikorn-scheduler-56f599846b-8fl7d yunikorn-scheduler-k8s -n yunikorn

      2024-06-18T15:15:26.201Z  DEBUG general/general.go:141  pod added {"appType": "general", "Name": "spark-pi-f4f19b902beac663-driver", "Namespace": "spark"}
      2024-06-18T15:15:26.201Z  DEBUG utils/utils.go:305  Unable to retrieve user name from pod labels. Empty user label  {"userLabel": "yunikorn.apache.org/username"}
      2024-06-18T15:15:26.201Z  DEBUG cache/context.go:737  AddApplication  {"Request": {"Metadata":{"ApplicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","QueueName":"root.spark","User":"nobody","Tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60 gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\": \"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\": \"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }]"},"Groups":null,"TaskGroups":[{"name":"spark-driver","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}},{"name":"spark-executor","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}}],"OwnerReferences":[{"apiVersion":"v1","kind":"Pod","name":"spark-pi-f4f19b902beac663-driver","uid":"4fb897dd-3af5-4799-a09d-640b5222ba3a","controller":false,"blockOwnerDeletion":true}],"SchedulingPolicyParameters":{},"CreationTime":0}}}
      2024-06-18T15:15:26.201Z  DEBUG cache/context.go:746  app namespace info  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "namespace": "spark"}
      2024-06-18T15:15:26.201Z  INFO  cache/context.go:773  app added {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
      2024-06-18T15:15:26.201Z  DEBUG cache/context.go:841  AddTask {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
      2024-06-18T15:15:26.201Z  DEBUG cache/context.go:233  adding pod to cache {"podName": "spark-pi-f4f19b902beac663-driver"}
      2024-06-18T15:15:26.201Z  INFO  cache/context.go:863  task added  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskState": "New"}
      2024-06-18T15:15:26.201Z  INFO  cache/context.go:873  app request originating pod added {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "original task": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
      2024-06-18T15:15:26.201Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (AddPod.Pre)  {"nodes": 3, "pods": 55, "assumed": 0, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases": {"Pending":2,"Running":53}}
      2024-06-18T15:15:26.201Z  DEBUG external/scheduler_cache.go:411 Putting pod in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
      2024-06-18T15:15:26.201Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (AddPod.Post) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases": {"Pending":3,"Running":53}}
      2024-06-18T15:15:26.277Z  DEBUG scheduler/scheduler.go:157  inspect outstanding requests
      2024-06-18T15:15:26.924Z  DEBUG cache/application_state.go:508  shim app state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "New", "destination": "Submitted", "event": "SubmitApplication"}
      2024-06-18T15:15:26.924Z  INFO  cache/application.go:424  handle app submission {"app": "applicationID: spark-8e410a4c5ce44da2aa85ba835257a1e9, queue: root.spark, partition: default, totalNumOfTasks: 1, currentState: Submitted", "clusterID": "mycluster"}
      2024-06-18T15:15:26.924Z  DEBUG scheduler/scheduler.go:96 enqueued event  {"eventType": "*rmevent.RMUpdateApplicationEvent", "event": {"Request":{"new":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","queueName":"root.spark","partitionName":"[mycluster]default","ugi":{"user":"nobody"},"tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60 gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\": \"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\": \"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }]"},"executionTimeoutMilliSeconds":60000,"placeholderAsk":{"resources":{"memory":{"value":4294967296},"pods":{"value":2},"vcore":{"value":2000}}},"gangSchedulingStyle":"Hard"}],"rmID":"mycluster"}}, "currentQueueSize": 0}
      2024-06-18T15:15:26.924Z  DEBUG placement/placement.go:145  Executing rule for placing application  {"ruleName": "tag", "application": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
      2024-06-18T15:15:26.924Z  DEBUG placement/tag_rule.go:106 Tag rule intermediate result  {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue": "root.spark"}
      2024-06-18T15:15:26.924Z  INFO  placement/tag_rule.go:115 Tag rule application placed {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue": "root.spark"}
      2024-06-18T15:15:26.924Z  DEBUG placement/placement.go:204  Rule result for placing application {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queueName": "root.spark"}
      2024-06-18T15:15:26.925Z  INFO  scheduler/context.go:549  Added application to partition  {"applicationID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "partitionName": "[mycluster]default", "requested queue": "root.spark", "placed queue": "root.spark"}
      2024-06-18T15:15:26.925Z  DEBUG rmproxy/rmproxy.go:59 enqueue event {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": {"RmID":"mycluster","AcceptedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9"}],"RejectedApplications":[],"UpdatedApplications":null}, "currentQueueSize": 0}
      2024-06-18T15:15:26.925Z  DEBUG callback/scheduler_callback.go:108  UpdateApplication callback received {"UpdateApplicationResponse": "accepted:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" > "}
      2024-06-18T15:15:26.925Z  DEBUG callback/scheduler_callback.go:114  callback: response to accepted application  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
      2024-06-18T15:15:26.925Z  INFO  callback/scheduler_callback.go:118  Accepting app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
      2024-06-18T15:15:26.925Z  DEBUG cache/application_state.go:508  shim app state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "Submitted", "destination": "Accepted", "event": "AcceptApplication"}
      2024-06-18T15:15:27.277Z  DEBUG scheduler/scheduler.go:157  inspect outstanding requests
      2024-06-18T15:15:27.925Z  DEBUG cache/application.go:516  postAppAccepted on cached app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "numTaskGroups": 2, "numAllocatedTasks": 0}
      2024-06-18T15:15:27.925Z  INFO  cache/application.go:526  app has taskGroups defined, trying to reserve resources for gang members  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
      2024-06-18T15:15:27.925Z  DEBUG cache/application_state.go:508  shim app state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "Accepted", "destination": "Reserving", "event": "TryReserve"}
      2024-06-18T15:15:27.933Z  ERROR cache/placeholder_manager.go:99 failed to create placeholder pod  {"error": "pods \"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed quota: spark-quota: must specify limits.cpu,limits.memory"}
      github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
        github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
      github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
        github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542
      2024-06-18T15:15:27.933Z  INFO  cache/placeholder_manager.go:115  start to clean up app placeholders  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
      2024-06-18T15:15:27.933Z  INFO  cache/placeholder_manager.go:128  finished cleaning up app placeholders {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
      2024-06-18T15:15:27.933Z  DEBUG cache/application_state.go:508  shim app state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "Reserving", "destination": "Running", "event": "RunApplication"}
      2024-06-18T15:15:28.278Z  DEBUG scheduler/scheduler.go:157  inspect outstanding requests
      2024-06-18T15:15:28.867Z  DEBUG scheduler/partition_manager.go:83 time consumed for queue cleaner {"duration": "6.41µs"}
      2024-06-18T15:15:28.925Z  INFO  cache/task_state.go:380 Task state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": "spark/spark-pi-f4f19b902beac663-driver", "source": "New", "destination": "Pending", "event": "InitTask"}
      2024-06-18T15:15:28.926Z  INFO  cache/task_state.go:380 Task state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": "spark/spark-pi-f4f19b902beac663-driver", "source": "Pending", "destination": "Scheduling", "event": "SubmitTask"}
      2024-06-18T15:15:28.926Z  DEBUG cache/task.go:275 scheduling pod  {"podName": "spark-pi-f4f19b902beac663-driver"}
      2024-06-18T15:15:28.926Z  DEBUG cache/task.go:294 send update request {"request": "asks:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\" applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" resourceAsk:<resources:<key:\"memory\" value:<value:1975517184 > > resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\" value:<value:1000 > > > maxAllocations:1 tags:<key:\"kubernetes.io/label/app\" value:\"spark\" > tags:<key:\"kubernetes.io/label/queue\" value:\"root.spark\" > tags:<key:\"kubernetes.io/label/spark-app-name\" value:\"spark-pi\" > tags:<key:\"kubernetes.io/label/spark-app-selector\" value:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" > tags:<key:\"kubernetes.io/label/spark-role\" value:\"driver\" > tags:<key:\"kubernetes.io/label/spark-version\" value:\"3.3.2\" > tags:<key:\"kubernetes.io/meta/namespace\" value:\"spark\" > tags:<key:\"kubernetes.io/meta/podName\" value:\"spark-pi-f4f19b902beac663-driver\" > taskGroupName:\"spark-driver\" Originator:true preemptionPolicy:<allowPreemptSelf:true allowPreemptOther:true > > rmID:\"mycluster\" "}
      2024-06-18T15:15:28.926Z  DEBUG scheduler/scheduler.go:96 enqueued event  {"eventType": "*rmevent.RMUpdateAllocationEvent", "event": {"Request":{"asks":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","partitionName":"[mycluster]default","resourceAsk":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"maxAllocations":1,"tags":{"kubernetes.io/label/app":"spark","kubernetes.io/label/queue":"root.spark","kubernetes.io/label/spark-app-name":"spark-pi","kubernetes.io/label/spark-app-selector":"spark-8e410a4c5ce44da2aa85ba835257a1e9","kubernetes.io/label/spark-role":"driver","kubernetes.io/label/spark-version":"3.3.2","kubernetes.io/meta/namespace":"spark","kubernetes.io/meta/podName":"spark-pi-f4f19b902beac663-driver"},"taskGroupName":"spark-driver","Originator":true,"preemptionPolicy":{"allowPreemptSelf":true,"allowPreemptOther":true}}],"rmID":"mycluster"}}, "currentQueueSize": 0}
      2024-06-18T15:15:28.926Z  INFO  objects/application_state.go:133  Application state transition  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "New", "destination": "Accepted", "event": "runApplication"}
      2024-06-18T15:15:28.926Z  DEBUG rmproxy/rmproxy.go:59 enqueue event {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Accepted","stateTransitionTimestamp":1718723728926391633,"message":"runApplication"}]}, "currentQueueSize": 0}
      2024-06-18T15:15:28.926Z  INFO  objects/application.go:668  ask added successfully to application {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "ask": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "placeholder": false, "pendingDelta": "map[memory:1975517184 pods:1 vcore:1000]"}
      2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:108  UpdateApplication callback received {"UpdateApplicationResponse": "updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" state:\"Accepted\" stateTransitionTimestamp:1718723728926391633 message:\"runApplication\" > "}
      2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:137  status update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "new status": "Accepted"}
      2024-06-18T15:15:28.926Z  DEBUG objects/application.go:339  Application state timer initiated {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "state": "Starting", "timeout": "5m0s"}
      2024-06-18T15:15:28.926Z  INFO  objects/application_state.go:133  Application state transition  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "Accepted", "destination": "Starting", "event": "runApplication"}
      2024-06-18T15:15:28.926Z  DEBUG rmproxy/rmproxy.go:59 enqueue event {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Starting","stateTransitionTimestamp":1718723728926635454,"message":"runApplication"}]}, "currentQueueSize": 0}
      2024-06-18T15:15:28.926Z  DEBUG ugm/manager.go:63 Increasing resource usage {"user": "nobody", "queue path": "root.spark", "application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 pods:1 vcore:1000]"}
      2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue tracker object for queue {"queue": "root"}
      2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource usage {"queue path": "root.spark", "application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 pods:1 vcore:1000]"}
      2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue tracker object for queue {"queue": "spark"}
      2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource usage {"queue path": "spark", "application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 pods:1 vcore:1000]"}
      2024-06-18T15:15:28.926Z  DEBUG ugm/manager.go:257  Group tracker does not exist. Creating group tracker object and linking the same with application {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue path": "root.spark", "user": "nobody", "group": "nobody"}
      2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue tracker object for queue {"queue": "root"}
      2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource usage {"queue path": "root.spark", "application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 pods:1 vcore:1000]"}
      2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue tracker object for queue {"queue": "spark"}
      2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource usage {"queue path": "spark", "application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 pods:1 vcore:1000]"}
      2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:108  UpdateApplication callback received {"UpdateApplicationResponse": "updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" state:\"Starting\" stateTransitionTimestamp:1718723728926635454 message:\"runApplication\" > "}
      2024-06-18T15:15:28.926Z  DEBUG objects/queue.go:1239 allocation found on queue {"queueName": "root.spark", "appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "allocation": "applicationID=spark-8e410a4c5ce44da2aa85ba835257a1e9, uuid=3e42c4e0-0fc3-466c-a524-444b2eec700b, allocationKey=4fb897dd-3af5-4799-a09d-640b5222ba3a, Node=10.10.10.66, result=Allocated"}
      2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:137  status update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "new status": "Starting"}
      2024-06-18T15:15:28.926Z  INFO  scheduler/partition.go:888  scheduler allocation processed  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "allocationKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "uuid": "3e42c4e0-0fc3-466c-a524-444b2eec700b", "allocatedResource": "map[memory:1975517184 pods:1 vcore:1000]", "placeholder": false, "targetNode": "10.10.10.66"}
      2024-06-18T15:15:28.926Z  DEBUG rmproxy/rmproxy.go:59 enqueue event {"eventType": "*rmevent.RMNewAllocationsEvent", "event": {"RmID":"mycluster","Allocations":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","UUID":"3e42c4e0-0fc3-466c-a524-444b2eec700b","resourcePerAlloc":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"nodeID":"10.10.10.66","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","taskGroupName":"spark-driver"}]}, "currentQueueSize": 0}
      2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:48 UpdateAllocation callback received  {"UpdateAllocationResponse": "new:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\" UUID:\"3e42c4e0-0fc3-466c-a524-444b2eec700b\" resourcePerAlloc:<resources:<key:\"memory\" value:<value:1975517184 > > resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\" value:<value:1000 > > > nodeID:\"10.10.10.66\" applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" taskGroupName:\"spark-driver\" > "}
      2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:53 callback: response to new allocation  {"allocationKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "UUID": "3e42c4e0-0fc3-466c-a524-444b2eec700b", "applicationID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "nodeID": "10.10.10.66"}
      2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (AssumePod.Pre) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases": {"Pending":3,"Running":53}}
      2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:476 Adding assumed pod to cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "node": "10.10.10.66", "allBound": true}
      2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:411 Putting pod in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
      2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (AssumePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}}
      2024-06-18T15:15:28.926Z  DEBUG scheduler/context.go:853  Successfully synced shim on new allocation. response: no. of allocations: 1
      2024-06-18T15:15:28.926Z  INFO  cache/task_state.go:380 Task state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": "spark/spark-pi-f4f19b902beac663-driver", "source": "Scheduling", "destination": "Allocated", "event": "TaskAllocated"}
      2024-06-18T15:15:28.926Z  DEBUG cache/task.go:349 bind pod volumes  {"podName": "spark-pi-f4f19b902beac663-driver", "podUID": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
      2024-06-18T15:15:28.926Z  INFO  cache/context.go:499  Binding Pod Volumes skipped: all volumes already bound  {"podName": "spark-pi-f4f19b902beac663-driver"}
      2024-06-18T15:15:28.926Z  DEBUG cache/task.go:362 bind pod  {"podName": "spark-pi-f4f19b902beac663-driver", "podUID": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
      2024-06-18T15:15:28.926Z  INFO  client/kubeclient.go:112  bind pod to node  {"podName": "spark-pi-f4f19b902beac663-driver", "podUID": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "nodeID": "10.10.10.66"}
      2024-06-18T15:15:28.933Z  INFO  cache/task.go:375 successfully bound pod  {"podName": "spark-pi-f4f19b902beac663-driver"}
      2024-06-18T15:15:28.934Z  INFO  cache/task_state.go:380 Task state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": "spark/spark-pi-f4f19b902beac663-driver", "source": "Allocated", "destination": "Bound", "event": "TaskBound"}
      2024-06-18T15:15:28.934Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}}
      2024-06-18T15:15:28.934Z  DEBUG external/scheduler_cache.go:411 Putting pod in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
      2024-06-18T15:15:28.934Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}}
      2024-06-18T15:15:28.946Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}}
      2024-06-18T15:15:28.946Z  DEBUG external/scheduler_cache.go:411 Putting pod in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
      2024-06-18T15:15:28.946Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}}
      2024-06-18T15:15:29.278Z  DEBUG scheduler/scheduler.go:157  inspect outstanding requests
      2024-06-18T15:15:29.906Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}}
      2024-06-18T15:15:29.906Z  DEBUG external/scheduler_cache.go:411 Putting pod in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
      2024-06-18T15:15:29.906Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}}
      2024-06-18T15:15:30.113Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}}
      2024-06-18T15:15:30.113Z  DEBUG external/scheduler_cache.go:411 Putting pod in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
      2024-06-18T15:15:30.113Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":2,"Running":54}}
      2024-06-18T15:15:30.279Z  DEBUG scheduler/scheduler.go:157  inspect outstanding requests
      2024-06-18T15:15:31.279Z  DEBUG scheduler/scheduler.go:157  inspect outstanding requests
      2024-06-18T15:15:32.280Z  DEBUG scheduler/scheduler.go:157  inspect outstanding requests
      2024-06-18T15:15:32.395Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdateNode.Pre)  {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":2,"Running":54}}
      2024-06-18T15:15:32.395Z  DEBUG external/scheduler_cache.go:179 Updating node in cache  {"nodeName": "10.10.10.66"}
      2024-06-18T15:15:32.395Z  DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdateNode.Post) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":2,"Running":54}}
      2024-06-18T15:15:32.395Z  DEBUG cache/node.go:109 set node capacity {"nodeID": "10.10.10.66", "capacity": "resources:<key:\"ephemeral-storage\" value:<value:478652923105 > > resources:<key:\"hugepages-1Gi\" value:<> > resources:<key:\"hugepages-2Mi\" value:<> > resources:<key:\"kubernetes.io/batch-cpu\" value:<value:5167 > > resources:<key:\"kubernetes.io/batch-memory\" value:<value:34437364710 > > resources:<key:\"kubernetes.io/mid-cpu\" value:<value:3156 > > resources:<key:\"kubernetes.io/mid-memory\" value:<value:1442139746 > > resources:<key:\"memory\" value:<value:64713940992 > > resources:<key:\"pods\" value:<value:110 > > resources:<key:\"vcore\" value:<value:14000 > > "}
      2024-06-18T15:15:32.396Z  INFO  cache/nodes.go:179  Node's ready status flag  {"Node name": "10.10.10.66", "ready": true}
      2024-06-18T15:15:32.396Z  INFO  cache/nodes.go:184  report updated nodes to scheduler {"request": {"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}}
      2024-06-18T15:15:32.396Z  DEBUG scheduler/scheduler.go:96 enqueued event  {"eventType": "*rmevent.RMUpdateNodeEvent", "event": {"Request":{"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true","si/node-partition":"[mycluster]default"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}}, "currentQueueSize": 0}
      2024-06-18T15:15:32.396Z  INFO  objects/queue.go:1190 updating root queue max resources {"current max": "map[ephemeral-storage:1435958769315 hugepages-1Gi:0 hugepages-2Mi:0 kubernetes.io/batch-cpu:19610 kubernetes.io/batch-memory:109317941106 kubernetes.io/mid-cpu:9574 kubernetes.io/mid-memory:5209251314 memory:194141839360 pods:330 vcore:42000]", "new max": "map[ephemeral-storage:1435958769315 hugepages-1Gi:0 hugepages-2Mi:0 kubernetes.io/batch-cpu:19309 kubernetes.io/batch-memory:109530403859 kubernetes.io/mid-cpu:10574 kubernetes.io/mid-memory:6651391060 memory:194141839360 pods:330 vcore:42000]"}
      2024-06-18T15:15:32.396Z  DEBUG objects/node.go:182 skip updating occupiedResource, not changed 

      Spark Pod status

      kubectl get pod -n spark|grep Running

      kubectl describe pod spark-pi-xxxxxxxx-driver -n spark

       
       

       

      Attachments

        1. image-2024-06-19-00-03-09-703.png
          388 kB
          huangzhir
        2. image-2024-06-19-00-02-53-178.png
          188 kB
          huangzhir

        Issue Links

          Activity

            People

              ccondit Craig Condit
              huangzhir huangzhir
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: