Kafka
  1. Kafka
  2. KAFKA-1106

HighwaterMarkCheckpoint failure puting broker into a bad state

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.8.0
    • Fix Version/s: None
    • Component/s: core
    • Labels:
      None

      Description

      I'm encountering a case where broker get stuck due to HighwaterMarkCheckpoint failing to recover from reading what appear to be corrupted isr entries. Once in this state, leader election can never succeed and hence stalling the entire cluster.

      Please see the detailed stack trace from the attached log. Perhaps failing fast when HighwaterMarkCheckpoint fails to read would force the broker to restart and recover.

      1. kafka.log
        7 kB
        David Lao
      2. KAFKA-1106-patch
        4 kB
        David Lao

        Activity

        Hide
        Jay Kreps added a comment -

        Yeah a corrupted offset file would lead to this (but could also be some other bug). We do shut down the broker on any I/O error (as that means we don't know the state of the data on disk and need to run recovery). Do you have the log from that previous shutdown?

        If the offset checkpoint is corrupt I think the desired behavior is for the node to crash. So in that case I think the problem is that we throw that number format exception which we probably don't handle right instead of IOException which would cause the broker to shoot itself in the head.

        Let's do this: I'll fix the parsing logic on trunk so that any unparsable file throws IOException. This will let us gracefully handle corruption in the file. I'm still not convinced that this is a file corruption thing and not just some bug in our code, but without the actual file it's a little hard to know. If you can reproduce it on another machine that proves it is a bug--if so grab the file, I suspect it will give a clue what is going on.

        Show
        Jay Kreps added a comment - Yeah a corrupted offset file would lead to this (but could also be some other bug). We do shut down the broker on any I/O error (as that means we don't know the state of the data on disk and need to run recovery). Do you have the log from that previous shutdown? If the offset checkpoint is corrupt I think the desired behavior is for the node to crash. So in that case I think the problem is that we throw that number format exception which we probably don't handle right instead of IOException which would cause the broker to shoot itself in the head. Let's do this: I'll fix the parsing logic on trunk so that any unparsable file throws IOException. This will let us gracefully handle corruption in the file. I'm still not convinced that this is a file corruption thing and not just some bug in our code, but without the actual file it's a little hard to know. If you can reproduce it on another machine that proves it is a bug--if so grab the file, I suspect it will give a clue what is going on.
        Hide
        David Lao added a comment -

        No there is no chance of manual intervention. However the broker node in question appeared to have gone through fail fast like exit and recovery a few hours prior but it was working fine until hitting this bug. Could a corrupted file have led to this? If so is failing fast the way to handle the situation?

        Show
        David Lao added a comment - No there is no chance of manual intervention. However the broker node in question appeared to have gone through fail fast like exit and recovery a few hours prior but it was working fine until hitting this bug. Could a corrupted file have led to this? If so is failing fast the way to handle the situation?
        Hide
        Jay Kreps added a comment -

        Is there any chance the file was manually modified? I don't see how we could get a blank line like that otherwise. The parsing code on trunk is a bit more robust (still not perfect) but I guess the question is whether kafka generated a corrupt file or it was munged manually...

        Show
        Jay Kreps added a comment - Is there any chance the file was manually modified? I don't see how we could get a blank line like that otherwise. The parsing code on trunk is a bit more robust (still not perfect) but I guess the question is whether kafka generated a corrupt file or it was munged manually...
        Hide
        David Lao added a comment -

        Unfortunately the logs are no longer available but I will post if I see this again. This was on a production with canonical usage pattern.

        Show
        David Lao added a comment - Unfortunately the logs are no longer available but I will post if I see this again. This was on a production with canonical usage pattern.
        Hide
        Jay Kreps added a comment -

        Do you have the highwatermark checkpoint file that caused this? Your patch makes things more tolerant of errors but I guess the question is how we got into that state...

        Show
        Jay Kreps added a comment - Do you have the highwatermark checkpoint file that caused this? Your patch makes things more tolerant of errors but I guess the question is how we got into that state...
        David Lao made changes -
        Attachment KAFKA-1106-patch [ 12610741 ]
        David Lao made changes -
        Field Original Value New Value
        Attachment kafka.log [ 12610737 ]
        David Lao created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            David Lao
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development