Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-12523

Need to improve handling of TimeoutException when committing offsets

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.8.0
    • Fix Version/s: 2.8.0
    • Component/s: streams
    • Labels:
      None

      Description

      Right now, in TaskManager#commitOffsetsOrTransaction if we catch a TimeoutException then under ALOS we just rethrow it while in EOS we rethrow it as TaskCorruptedException. The problem is that commitOffsetsOrTransaction can be invoked from several places:

      1. Commit within StreamThread main processing loop (either user requested or commit interval has elapsed: this is presumably the case we had in mind when deciding how to handle the TimeoutException in commitOffsetsOrTransaction , no problem here
      2. Clean shutdown of application: a bit weird to throw a TaskCorruptedException in this case, but it’ll just end up being caught and forcing a closeDirty, so again no problem here
      3. From TaskManager#handleRevocation: in this case, it’s possible we hit a TimeoutException on a task that’s actually being revoked. This exception will be saved and rethrown from poll, so under EOS we would catch a TaskCorruptedException and then try to revive this task that we actually no longer own. Pretty sure this will cause an NPE in the TaskManager. Under ALOS, the rethrown TimeoutException will be bubbled up through poll again, but unlike TaskCorruptedException we actually don’t catch TimeoutException anywhere in the StreamThread loop. This will trigger the uncaught exception handler
      4. From TaskManager#handleTaskCorrupted: this method is itself invoked from within the catch TaskCorruptedException block of the StreamThread’s runLoop. If we throw TaskCorruptedException again then I believe we won’t even catch this in the safety net catch Throwable block of the runLoop – it’ll just be thrown directly up through run().

        Attachments

          Activity

            People

            • Assignee:
              ableegoldman A. Sophie Blee-Goldman
              Reporter:
              ableegoldman A. Sophie Blee-Goldman

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment