Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-4763

Handle disk failure for JBOD (KIP-112)

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None

      Issue Links

        Activity

        Hide
        uncleGen Genmao Yu added a comment -

        Please feel free to give feedback whenever you have time. What confused me is why do you think it is wrong to re-create those lost replicas on a good log directory.

        Show
        uncleGen Genmao Yu added a comment - Please feel free to give feedback whenever you have time. What confused me is why do you think it is wrong to re-create those lost replicas on a good log directory.
        Hide
        lindong Dong Lin added a comment -

        And BTW, the current design already does what you suggested – if you have removed the failed log directory from the log.dirs, the replica will be re-created on the good disks.

        Show
        lindong Dong Lin added a comment - And BTW, the current design already does what you suggested – if you have removed the failed log directory from the log.dirs, the replica will be re-created on the good disks.
        Hide
        lindong Dong Lin added a comment -

        Genmao Yu Sorry, I am not ready to start a long discussion here because I am currently working on something. If you think it is worth doing, please feel free to open a KIP and we can discuss there. Thanks!

        Show
        lindong Dong Lin added a comment - Genmao Yu Sorry, I am not ready to start a long discussion here because I am currently working on something. If you think it is worth doing, please feel free to open a KIP and we can discuss there. Thanks!
        Hide
        uncleGen Genmao Yu added a comment -

        Let me be clear: "remaining disk" means "remaining usable disks", i.e. I have more than one disk. If one disk break down, we can exclude from "log.dirs" in broker config and then restart broker. So, I think it is reasonable to recover lost replica (assume remaining disks is enough to cover lost replicas). What your opinion?

        Show
        uncleGen Genmao Yu added a comment - Let me be clear: "remaining disk" means "remaining usable disks", i.e. I have more than one disk. If one disk break down, we can exclude from "log.dirs" in broker config and then restart broker. So, I think it is reasonable to recover lost replica (assume remaining disks is enough to cover lost replicas). What your opinion?
        Hide
        lindong Dong Lin added a comment -

        Genmao Yu The current design don't use remaining disk currently for simplicity.

        Show
        lindong Dong Lin added a comment - Genmao Yu The current design don't use remaining disk currently for simplicity.
        Hide
        uncleGen Genmao Yu added a comment - - edited

        Dong Lin hmm, make sense. What happens if remaining disk is enough? Is it OK to recover lost replica?

        Show
        uncleGen Genmao Yu added a comment - - edited Dong Lin hmm, make sense. What happens if remaining disk is enough? Is it OK to recover lost replica?
        Hide
        lindong Dong Lin added a comment - - edited

        Genmao Yu Let's say we have a broker of 2 disks. Each disk has 10 GB capacity. And both disks currently have 6 GB data. If one of the disk failed, are you going to re-create the lost replica on the other good disk?

        Show
        lindong Dong Lin added a comment - - edited Genmao Yu Let's say we have a broker of 2 disks. Each disk has 10 GB capacity. And both disks currently have 6 GB data. If one of the disk failed, are you going to re-create the lost replica on the other good disk?
        Hide
        uncleGen Genmao Yu added a comment -

        Dong Lin IMHO, it is OK to re-create the lost replica which is on one broken disk when restart broker. Is there something I missed?

        Show
        uncleGen Genmao Yu added a comment - Dong Lin IMHO, it is OK to re-create the lost replica which is on one broken disk when restart broker. Is there something I missed?
        Hide
        lindong Dong Lin added a comment -

        Genmao Yu Can you be more specific which part or which sentence needs clarification?

        Show
        lindong Dong Lin added a comment - Genmao Yu Can you be more specific which part or which sentence needs clarification?
        Hide
        uncleGen Genmao Yu added a comment - - edited

        Dong Lin
        If a broker starts with some replicas unavailable because they are on a bad log directory, it will re-create those replicas on a good log directory when it receives LeaderAndIsrRequest from the controller. This is wrong. To avoid this, controller needs to know whether the replica has been created on the broker and explicitly specify whether broker should create replica in the LeaderAndIsrRequest.

        What's this meaning? Thanks!

        Show
        uncleGen Genmao Yu added a comment - - edited Dong Lin If a broker starts with some replicas unavailable because they are on a bad log directory, it will re-create those replicas on a good log directory when it receives LeaderAndIsrRequest from the controller. This is wrong. To avoid this, controller needs to know whether the replica has been created on the broker and explicitly specify whether broker should create replica in the LeaderAndIsrRequest. What's this meaning? Thanks!
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/kafka/pull/2929

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/2929
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user lindong28 opened a pull request:

        https://github.com/apache/kafka/pull/2929

        KAFKA-4763; Handle disk failure for JBOD (KIP-112)

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/lindong28/kafka KAFKA-4763

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/kafka/pull/2929.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #2929


        commit ab6302b82b6245d1bbf8d77d836e362b95750ca4
        Author: Dong Lin <lindong28@gmail.com>
        Date: 2017-04-03T00:46:34Z

        KAFKA-4763; Handle disk failure for JBOD (KIP-112)


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user lindong28 opened a pull request: https://github.com/apache/kafka/pull/2929 KAFKA-4763 ; Handle disk failure for JBOD (KIP-112) You can merge this pull request into a Git repository by running: $ git pull https://github.com/lindong28/kafka KAFKA-4763 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/2929.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2929 commit ab6302b82b6245d1bbf8d77d836e362b95750ca4 Author: Dong Lin <lindong28@gmail.com> Date: 2017-04-03T00:46:34Z KAFKA-4763 ; Handle disk failure for JBOD (KIP-112)

          People

          • Assignee:
            lindong Dong Lin
            Reporter:
            lindong Dong Lin
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development