Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-7286

Loading offsets and group metadata hangs with large group metadata records

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.1, 2.1.0
    • Component/s: None
    • Labels:
      None

      Description

      When a (Kafka-based) consumer group contains many members, group metadata records (in the __consumer-offsets topic) may happen to be quite large.

      Increasing the message.max.bytes makes storing these records possible.
      Loading them when a broker restart is done via doLoadGroupsAndOffsets. However, this method relies on the offsets.load.buffer.size configuration to create a buffer that will contain the records being loaded.

      If a group metadata record is too large for this buffer, the loading method will get stuck trying to load records (in a tight loop) into a buffer that cannot accommodate a single record.


      For example, if the __consumer-offsets-9 partition contains a record smaller than message.max.bytes but larger than offsets.load.buffer.size, logs would indicate the following:

      ...
      [2018-08-13 21:00:21,073] INFO [GroupMetadataManager brokerId=0] Scheduling loading of offsets and group metadata from __consumer_offsets-9 (kafka.coordinator.group.GroupMetadataManager)
      ...
      

      But logs will never contain the expected Finished loading offsets and group metadata from ... line.

      Consumers whose group are assigned to this partition will see Marking the coordinator dead and will never be able to stabilize and make progress.


      From what I could gather in the code, it seems that:


      It would be great to let the partition load even if a record is larger than the configured offsets.load.buffer.size limit. The fact that minOneMessage = true when reading records seems to indicate it might be a good idea for the buffer to accommodate at least one record.

      If you think the limit should stay a hard limit, then at least adding a log line indicating offsets.load.buffer.size is not large enough and should be increased. Otherwise, one can only guess and dig through the code to figure out what is happening

      I will try to open a PR with the first idea (allowing large records to be read when needed) soon, but any feedback from anyone who also had the same issue in the past would be appreciated

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                flavr Flavien Raynaud
                Reporter:
                flavr Flavien Raynaud
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: