Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-6872

Doc for log.roll.* is wrong

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.0.0
    • Fix Version/s: None
    • Component/s: documentation
    • Labels:

      Description

      For log.roll.ms, doc says for example:

      The maximum time before a new log segment is rolled out (in milliseconds). If not set, the value in log.roll.hours is used

      In other parts (see https://kafka.apache.org/10/documentation.html#upgrade_10_1_breaking), it says:

      The log rolling time is no longer depending on log segment create time. Instead it is now based on the timestamp in the messages. More specifically. if the timestamp of the first message in the segment is T, the log will be rolled out when a new message has a timestamp greater than or equal to T + log.roll.ms

      which is wrong. More specifically, the wrong part is:

      if the timestamp of the first message in the segment is T

      Indeed, the truth is actually:

      if the timestamp of the last message in the segment is T

       

      A simple use case to reproduce this is to configure a single broker with:

      # One partition ... or any small number should be fine
      num.partitions=1
      # 100MB segment
      log.segment.bytes=1073741824
      # Delete old segments when their last addition is 24h old
      log.retention.hours=24
      # Check age of segments every 5 minutes
      log.retention.check.interval.ms=300000
      # Every hour (?!?!?), roll a new segment
      log.roll.hours=1
      

      and loop on sending a small message (a few bytes so that you never reach 100MB during the period of this test) every minute to one topic.

      After at least 24h running, according to what is described in the doc, on would expect to see ~24 segments (on new segment rolled every hour).
      But the truth is that there is only one log segment with all the records you sent. Stop the producer for a bit more than one hour and restart it: you will have a second segment created per partition because at some point, when adding a new record, the previous one (the last one of what was the current segment) was more than 1h old.

      This proves that the doc should say:

      if the timestamp of the last message in the segment is T, the log will be rolled out when a new message has a timestamp greater than or equal to T + log.roll.ms

       

      Notes:

      • as a DevOps, I would prefer the doc to be true and kafka's behavior to be changed. But I think that both should be done: doc updated to let users of current versions know what to expect (and avoid running into the problem we faced) and later the behavior of kafka updated. Indeed, one could have kafka keep very old records with default conf where log.roll.hours=168 and log.segment.bytes=1073741824 and when pushing like one small (~1k) record a day -> 100k records can fit in that segment -> it is never rotated
      • I detected this on version 1.0.0 but assume it impacts much more than that version (and very likely 1.1.0 too)

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              fld Fabien LD
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: