Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-15931

Cached transaction index gets closed if tiered storage read is interrupted

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Minor
    • Resolution: Unresolved
    • 3.6.0
    • None
    • Tiered-Storage
    • None

    Description

      This reproduces when reading from remote storage with the default fetch.max.wait.ms (500) or lower. isolation.level=read_committed is needed to trigger this.

      It's not easy to reproduce on local-only setups, unfortunately, because reads are fast and aren't interrupted.

      This error is logged

      [2023-11-29 14:01:01,166] ERROR Error occurred while reading the remote data for topic1-0 (kafka.log.remote.RemoteLogReader)
      org.apache.kafka.common.KafkaException: Failed read position from the transaction index <index_file_name>
          at org.apache.kafka.storage.internals.log.TransactionIndex$1.hasNext(TransactionIndex.java:235)
          at org.apache.kafka.storage.internals.log.TransactionIndex.collectAbortedTxns(TransactionIndex.java:171)
          at kafka.log.remote.RemoteLogManager.collectAbortedTransactions(RemoteLogManager.java:1359)
          at kafka.log.remote.RemoteLogManager.addAbortedTransactions(RemoteLogManager.java:1341)
          at kafka.log.remote.RemoteLogManager.read(RemoteLogManager.java:1310)
          at kafka.log.remote.RemoteLogReader.call(RemoteLogReader.java:62)
          at kafka.log.remote.RemoteLogReader.call(RemoteLogReader.java:31)
          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
          at java.base/java.lang.Thread.run(Thread.java:829)
      Caused by: java.nio.channels.ClosedChannelException
          at java.base/sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:150)
          at java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:325)
          at org.apache.kafka.storage.internals.log.TransactionIndex$1.hasNext(TransactionIndex.java:233)
          ... 10 more
      

      and after that this txn index becomes unusable until the process is restarted.

      I suspect, it's caused by the reading thread being interrupted due to the fetch timeout. At least this code in AbstractInterruptibleChannel is called.

      Fixing may be easy: reopen the channel in TransactionIndex if it's close. However, off the top of my head I can't say if there are some less obvious implications of this change.

       

       

      Attachments

        Issue Links

          Activity

            People

              jeqo Jorge Esteban Quilcate Otoya
              ivanyu Ivan Yurchenko
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: