Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-3558

Affinity task hangs when Collision SPI produces a lot of job rejections & Failover SPI produces many attempts

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.3
    • compute
    • None

    Description

      The test to reproduce:
      IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest.testJobFinishing

      Root cause
      GridJobExecuteResponse isn't set from target node because there is a confusion with GridJobWorker instances in the CollisionContext.

      Suggestion
      The method GridJobProcessor.CollisionJobContext.cancel()
      use passiveJobs.remove(jobWorker.getJobId(), jobWorker).
      passiveJobs is a ConcurrentHashMap and GridJobWorker.equals() implements as a equation of jobId.

      So, when two thread try to cancel the two workers with the same jobIds we have the case:

      • thread0 remove jobWorker0 & cancel jobWorker0.
      • thread0 put jobWorker1 (because jobWorker0 already removed);
      • thread1: (has a copy of jobWorker0) and try to cancel it.
      • thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to identify);
      • thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled.

      Proposal
      Try to use system default equals for the GridJobWorker

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tledkov-gridgain Taras Ledkov
            tledkov-gridgain Taras Ledkov
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3h
                3h

                Slack

                  Issue deployment