Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-17020

After enabling tiered storage, occasional residual logs are left in the replica

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.7.0
    • None
    • None
    • None

    Description

      After enabling tiered storage, occasional residual logs are left in the replica.
      Based on the observed phenomenon, the index values of the rolled-out logs generated by the replica and the leader are not the same. As a result, the logs uploaded to S3 at the same time do not include the corresponding log files on the replica side, making it impossible to delete the local logs.

      leader config:

      num.partitions=3
      default.replication.factor=2
      delete.topic.enable=true
      auto.create.topics.enable=false
      num.recovery.threads.per.data.dir=1
      offsets.topic.replication.factor=3
      transaction.state.log.replication.factor=2
      transaction.state.log.min.isr=1
      offsets.retention.minutes=4320
      log.roll.ms=86400000
      log.local.retention.ms=600000
      log.segment.bytes=536870912
      num.replica.fetchers=1
      log.retention.ms=15811200000
      remote.log.manager.thread.pool.size=4
      remote.log.reader.threads=4
      remote.log.metadata.topic.replication.factor=3
      remote.log.storage.system.enable=true
      remote.log.metadata.topic.retention.ms=180000000
      rsm.config.fetch.chunk.cache.class=io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCache
      rsm.config.fetch.chunk.cache.path=/data01/kafka-tiered-storage-cache
      
      Pick some cache size, 16 GiB here:
      rsm.config.fetch.chunk.cache.size=34359738368
      rsm.config.fetch.chunk.cache.retention.ms=1200000
      # # Prefetching size, 16 MiB here:
      rsm.config.fetch.chunk.cache.prefetch.max.size=33554432
      rsm.config.storage.backend.class=io.aiven.kafka.tieredstorage.storage.s3.S3Storage
      rsm.config.storage.s3.bucket.name=
      rsm.config.storage.s3.region=us-west-1
      rsm.config.storage.aws.secret.access.key=
      rsm.config.storage.aws.access.key.id=
      rsm.config.chunk.size=8388608
      remote.log.storage.manager.class.path=/home/admin/core-0.0.1-SNAPSHOT/:/home/admin/s3-0.0.1-SNAPSHOT/
      remote.log.storage.manager.class.name=io.aiven.kafka.tieredstorage.RemoteStorageManager
      remote.log.metadata.manager.class.name=org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager
      remote.log.metadata.manager.listener.name=PLAINTEXT
      rsm.config.upload.rate.limit.bytes.per.second=31457280
      

       replica config:

      num.partitions=3
      default.replication.factor=2
      delete.topic.enable=true
      auto.create.topics.enable=false
      num.recovery.threads.per.data.dir=1
      offsets.topic.replication.factor=3
      transaction.state.log.replication.factor=2
      transaction.state.log.min.isr=1
      offsets.retention.minutes=4320
      log.roll.ms=86400000
      log.local.retention.ms=600000
      log.segment.bytes=536870912
      num.replica.fetchers=1
      log.retention.ms=15811200000
      remote.log.manager.thread.pool.size=4
      remote.log.reader.threads=4
      remote.log.metadata.topic.replication.factor=3
      remote.log.storage.system.enable=true
      #remote.log.metadata.topic.retention.ms=180000000
      rsm.config.fetch.chunk.cache.class=io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCache
      rsm.config.fetch.chunk.cache.path=/data01/kafka-tiered-storage-cache
      # Pick some cache size, 16 GiB here:
      rsm.config.fetch.chunk.cache.size=34359738368
      rsm.config.fetch.chunk.cache.retention.ms=1200000
      # # # Prefetching size, 16 MiB here:
      rsm.config.fetch.chunk.cache.prefetch.max.size=33554432
      rsm.config.storage.backend.class=io.aiven.kafka.tieredstorage.storage.s3.S3Storage
      rsm.config.storage.s3.bucket.name=
      rsm.config.storage.s3.region=us-west-1
      rsm.config.storage.aws.secret.access.key=
      rsm.config.storage.aws.access.key.id=
      rsm.config.chunk.size=8388608
      remote.log.storage.manager.class.path=/home/admin/core-0.0.1-SNAPSHOT/*:/home/admin/s3-0.0.1-SNAPSHOT/*
      remote.log.storage.manager.class.name=io.aiven.kafka.tieredstorage.RemoteStorageManager
      remote.log.metadata.manager.class.name=org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager
      remote.log.metadata.manager.listener.name=PLAINTEXT
      rsm.config.upload.rate.limit.bytes.per.second=31457280 

      topic config:

      Dynamic configs for topic xxxxxx are:
      local.retention.ms=600000 sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:local.retention.ms=600000, STATIC_BROKER_CONFIG:log.local.retention.ms=600000, DEFAULT_CONFIG:log.local.retention.ms=-2}
      remote.storage.enable=true sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:remote.storage.enable=true}
      retention.ms=15811200000 sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:retention.ms=15811200000, STATIC_BROKER_CONFIG:log.retention.ms=15811200000, DEFAULT_CONFIG:log.retention.hours=168}
      segment.bytes=536870912 sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:segment.bytes=536870912, STATIC_BROKER_CONFIG:log.segment.bytes=536870912, DEFAULT_CONFIG:log.segment.bytes=1073741824} 

       

      By examining the segment logs for that time period in S3 for the topic, it can be observed that the indices of the two are different.

      By searching for the residual log index through log analysis, it was found that there were no delete logs on both the leader and replica nodes. However, the logs for the corresponding time period in S3 can be queried in the leader node logs but not in the replica node logs. Therefore, I believe that the issue is due to the different log files generated by the leader and replica nodes.

      Restarting does not resolve this issue. The only solution is to delete the log folder corresponding to the replica where the log segment anomaly occurred and then resynchronize from the leader.

      Attachments

        1. image-2024-06-22-21-45-43-815.png
          398 kB
          Jianbin Chen
        2. image-2024-06-22-21-46-12-371.png
          495 kB
          Jianbin Chen
        3. image-2024-06-22-21-46-26-530.png
          340 kB
          Jianbin Chen
        4. image-2024-06-22-21-46-42-917.png
          271 kB
          Jianbin Chen
        5. image-2024-06-22-21-47-00-230.png
          463 kB
          Jianbin Chen

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jianbin Jianbin Chen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: