Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-1599

Change preferred replica election admin command to handle large clusters

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.8.2.0
    • None
    • None

    Description

      We ran into a problem with a cluster that has 70k partitions where we could not trigger a preferred replica election for all topics and partitions using the admin tool. Upon investigation, it was determined that this was because the JSON object that was being written to the admin znode to tell the controller to start the election was 1.8 MB in size. As the default Zookeeper data size limit is 1MB, and it is non-trivial to change, we should come up with a better way to represent the list of topics and partitions for this admin command.

      I have several thoughts on this so far:
      1) Trigger the command for all topics and partitions with a JSON object that does not include an explicit list of them (i.e. a flag that says "all partitions")

      2) Use a more compact JSON representation. Currently, the JSON contains a 'partitions' key which holds a list of dictionaries that each have a 'topic' and 'partition' key, and there must be one list item for each partition. This results in a lot of repetition of key names that is unneeded. Changing this to a format like this would be much more compact:
      {'topics':

      {'topicName1': [0, 1, 2, 3], 'topicName2': [0,1]}

      , 'version': 1}

      3) Use a representation other than JSON. Strings are inefficient. A binary format would be the most compact. This does put a greater burden on tools and scripts that do not use the inbuilt libraries, but it is not too high.

      4) Use a representation that involves multiple znodes. A structured tree in the admin command would probably provide the most complete solution. However, we would need to make sure to not exceed the data size limit with a wide tree (the list of children for any single znode cannot exceed the ZK data size of 1MB)

      Obviously, there could be a combination of #1 with a change in the representation, which would likely be appropriate as well.

      Attachments

        Issue Links

          Activity

            People

              anigam Abhishek Nigam
              toddpalino Todd Palino
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: