Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-2 Support Gang Scheduling
  3. YUNIKORN-518

Placeholder manager failed to init during scheduler recovery

    XMLWordPrintableJSON

Details

    Description

      Name:         yunikorn-scheduler-6577f789d8-vc5cc
      Namespace:    yunikorn
      Priority:     0
      Node:         ip-10-192-153-109.ca-central-1.compute.internal/10.192.153.109
      Start Time:   Tue, 26 Jan 2021 19:17:12 -0800
      Labels:       app=yunikorn
                    component=yunikorn-scheduler
                    pod-template-hash=6577f789d8
                    release=yunikorn
      Annotations:  cni.projectcalico.org/podIP: 100.100.166.78/32
                    cni.projectcalico.org/podIPs: 100.100.166.78/32
                    kubernetes.io/psp: eks.privileged
      Status:       Running
      IP:           100.100.166.78
      IPs:
        IP:           100.100.166.78
      Controlled By:  ReplicaSet/yunikorn-scheduler-6577f789d8
      Containers:
        yunikorn-scheduler-k8s:
          Container ID:   docker://759f2b2f14ba37f46a42cdc59a5c51ed19d442ed717b81ee98d30177b7a184e6
          Image:          <>/cloudera/yunikorn-scheduler:0.10.0-b9
          Image ID:       docker-pullable://<>/cloudera/yunikorn-scheduler@sha256:878300a91cfd3b9d6dc515948afbfab23572a475b0df7006f06480ee06d1aceb
          Port:           9080/TCP
          Host Port:      0/TCP
          State:          Running
            Started:      Tue, 26 Jan 2021 19:18:01 -0800
          Last State:     Terminated
            Reason:       Error
            Exit Code:    1
            Started:      Tue, 26 Jan 2021 19:17:33 -0800
            Finished:     Tue, 26 Jan 2021 19:17:33 -0800
          Ready:          True
          Restart Count:  3
          Limits:
            cpu:     4
            memory:  2Gi
          Requests:
            cpu:     200m
            memory:  1Gi
          Environment:
            NAMESPACE:                                yunikorn (v1:metadata.namespace)
            ADMISSION_CONTROLLER_IMAGE_REGISTRY:      <>/cloudera/yunikorn-admission
            ADMISSION_CONTROLLER_IMAGE_TAG:           0.10.0-b9
            ADMISSION_CONTROLLER_IMAGE_PULL_POLICY:   Always
            ADMISSION_CONTROLLER_IMAGE_PULL_SECRETS:  [dockercreds]
          Mounts:
            /etc/yunikorn/ from config-volume (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from yunikorn-admin-token-dnq4h (ro)
        yunikorn-scheduler-web:
          Container ID:   docker://0b8205bb8292f193765bbc563ea10010106fd316257e523c3446c5685ee0d5bf
          Image:          <>/cloudera/yunikorn-web:0.10.0-b9
          Image ID:       docker-pullable://<>/cloudera/yunikorn-web@sha256:a64b986df2dc737958701838f41f9fae7f2e4a353a497949ba6b9e75b4b44b66
          Port:           9889/TCP
          Host Port:      0/TCP
          State:          Running
            Started:      Tue, 26 Jan 2021 19:17:17 -0800
          Ready:          True
          Restart Count:  0
          Limits:
            cpu:     200m
            memory:  500Mi
          Requests:
            cpu:        100m
            memory:     100Mi
          Environment:  <none>
          Mounts:
            /var/run/secrets/kubernetes.io/serviceaccount from yunikorn-admin-token-dnq4h (ro)
      Conditions:
        Type              Status
        Initialized       True
        Ready             True
        ContainersReady   True
        PodScheduled      True
      Volumes:
        config-volume:
          Type:      ConfigMap (a volume populated by a ConfigMap)
          Name:      yunikorn-configs
          Optional:  false
        yunikorn-admin-token-dnq4h:
          Type:        Secret (a volume populated by a Secret)
          SecretName:  yunikorn-admin-token-dnq4h
          Optional:    false
      QoS Class:       Burstable
      Node-Selectors:  role.node.kubernetes.io/liftie-infra=true
      Tolerations:     CriticalAddonsOnly op=Exists
                       node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                       node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                       role.node.kubernetes.io/liftie-infra=true:NoSchedule
      Events:
        Type     Reason               Age                From               Message
        ----     ------               ----               ----               -------
        Normal   Scheduled            61s                default-scheduler  Successfully assigned yunikorn/yunikorn-scheduler-6577f789d8-vc5cc to ip-10-192-153-109.ca-central-1.compute.internal
        Normal   Pulling              57s                kubelet            Pulling image "<>/cloudera/yunikorn-web:0.10.0-b9"
        Normal   Started              56s                kubelet            Started container yunikorn-scheduler-web
        Normal   Created              56s                kubelet            Created container yunikorn-scheduler-web
        Normal   Pulled               56s                kubelet            Successfully pulled image "<>/cloudera/yunikorn-web:0.10.0-b9"
        Warning  FailedPreStopHook    55s (x2 over 58s)  kubelet            Exec lifecycle hook ([/bin/sh /admission_util.sh delete]) for Container "yunikorn-scheduler-k8s" in Pod "yunikorn-scheduler-6577f789d8-vc5cc_yunikorn(082e1cc7-8765-4aa3-baac-48e3b048cfc6)" failed - error: command '/bin/sh /admission_util.sh delete' exited with 126: , message: "cannot exec in a stopped state: unknown\r\n"
        Normal   Killing              55s (x2 over 58s)  kubelet            FailedPostStartHook
        Warning  BackOff              53s (x2 over 54s)  kubelet            Back-off restarting failed container
        Normal   Pulling              41s (x3 over 60s)  kubelet            Pulling image "<>/cloudera/yunikorn-scheduler:0.10.0-b9"
        Warning  FailedPostStartHook  40s (x3 over 58s)  kubelet            Exec lifecycle hook ([/bin/sh /admission_util.sh create]) for Container "yunikorn-scheduler-k8s" in Pod "yunikorn-scheduler-6577f789d8-vc5cc_yunikorn(082e1cc7-8765-4aa3-baac-48e3b048cfc6)" failed - error: command '/bin/sh /admission_util.sh create' exited with 137: , message: ""
        Normal   Started              40s (x3 over 58s)  kubelet            Started container yunikorn-scheduler-k8s
        Normal   Created              40s (x3 over 58s)  kubelet            Created container yunikorn-scheduler-k8s
        Normal   Pulled               40s (x3 over 58s)  kubelet            Successfully pulled image "<>/cloudera/yunikorn-scheduler:0.10.0-b9" 

      This is not a blocker but the scheduler was restarted multiple(3) times, hence reporting. This could be due to issue in admission controller start script/

      Attachments

        1. yk-sc.log
          7 kB
          Ayub Pathan

        Issue Links

          Activity

            People

              wwei Weiwei Yang
              ayubpathan Ayub Pathan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: