Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-229

shim sends the same remove request twice for a remove allocation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Cannot Reproduce
    • None
    • None
    • shim - kubernetes
    • None

    Description

      In the logs it looks like the shim asks to remove the same allocation using the same UUID:

      First release request from shim:

      2020-06-10T05:54:24.564Z	DEBUG	cache/cluster_info.go:136	enqueued event	{"eventType": "*cacheevent.RMUpdateRequestEvent", "event": {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task completed"}]},"rmID":"mycluster"}}, "currentQueueSize": 0}
      2020-06-10T05:54:24.565Z	DEBUG	scheduler/scheduler.go:191	enqueued event	{"eventType": "*schedulerevent.SchedulerAllocationUpdatesEvent", "event": {"RejectedAllocations":null,"AcceptedAllocations":null,"NewAsks":null,"ToReleases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task completed"}]},"ExistingAllocations":null,"RMId":""}, "currentQueueSize": 0}
      2020-06-10T05:54:24.565Z	DEBUG	cache/cluster_info.go:136	enqueued event	{"eventType": "*cacheevent.ReleaseAllocationsEvent", "event": {"AllocationsToRelease":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","ApplicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","PartitionName":"[mycluster]default","Message":"task completed","ReleaseType":0}]}, "currentQueueSize": 0}
      2020-06-10T05:54:24.565Z	DEBUG	cache/partition_info.go:429	removing allocations	{"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c", "allocationId": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
      2020-06-10T05:54:24.566Z	INFO	cache/partition_info.go:477	allocation removed	{"numOfAllocationReleased": 1, "partitionName": "[mycluster]default"}
      2020-06-10T05:54:24.566Z	DEBUG	rmproxy/rmproxy.go:65	enqueue event	{"event": {"RmID":"mycluster","ReleasedAllocations":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task completed"}]}, "currentQueueSize": 0}
      2020-06-10T05:54:24.566Z	DEBUG	callback/scheduler_callback.go:44	callback received	{"updateResponse": "releasedAllocations:<UUID:\"3bf0a159-89ee-4bdc-ada1-c577ac2097d1\" message:\"task completed\" > "}
      2020-06-10T05:54:24.566Z	DEBUG	callback/scheduler_callback.go:119	callback: response to released allocations	{"UUID": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
      

      Second release request from shim 0.16 seconds after the first request:

      2020-06-10T05:54:40.423Z	DEBUG	cache/cluster_info.go:136	enqueued event	{"eventType": "*cacheevent.RMUpdateRequestEvent", "event": {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task completed"}]},"rmID":"mycluster"}}, "currentQueueSize": 0}
      2020-06-10T05:54:40.423Z	DEBUG	scheduler/scheduler.go:191	enqueued event	{"eventType": "*schedulerevent.SchedulerAllocationUpdatesEvent", "event": {"RejectedAllocations":null,"AcceptedAllocations":null,"NewAsks":null,"ToReleases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task completed"}]},"ExistingAllocations":null,"RMId":""}, "currentQueueSize": 0}
      2020-06-10T05:54:40.423Z	DEBUG	cache/cluster_info.go:136	enqueued event	{"eventType": "*cacheevent.ReleaseAllocationsEvent", "event": {"AllocationsToRelease":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","ApplicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","PartitionName":"[mycluster]default","Message":"task completed","ReleaseType":0}]}, "currentQueueSize": 0}
      2020-06-10T05:54:40.423Z	DEBUG	cache/partition_info.go:429	removing allocations	{"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c", "allocationId": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
      2020-06-10T05:54:40.423Z	DEBUG	cache/partition_info.go:442	no active allocations found to release	{"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c"}
      

      The core scheduler handles it correctly and just ignores the request but when the number of tasks in the shim grows this could have a big performance impact and we need to find out why it removes it twice.

      Attachments

        Issue Links

          Activity

            People

              wwei Weiwei Yang
              wilfreds Wilfred Spiegelenburg
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: