Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-17663

Ensure FileStreamTask cannot compromise shared channel proxy for system table when interrupted

    XMLWordPrintableJSON

Details

    Description

      This is very similar to what we fixed in CASSANDRA-17466 , just in a different part of the repair/streaming machine.

      NettyStreamingMessageSender is responsible for dispatching FileStreamTask on an executor it manages to stream files to its peers. When the sender is closed for any reason, like perhaps a peer blowing up while deserializing the stream, the executor it manages is shut down w/ interruption (i.e. shutdownNow()). This is problematic if we happen to have not gotten very far along in FileStreamTask#run(). If we're just about to call getOrCreateChannel(), which reads from the peers_v2 system table, the ChannelProxy read will throw a ClosedByInterruptedException and the proxy will be useless. The twist is that, since CASSANDRA-15666, this exception has essentially been swallowed by FileStreamTask's exception handling. So we don't see a ClosedByInterruptedException in the logs, but we do have things like this pop up when anything else tries to hit the peers table:

      ERROR 2022-05-19T21:49:23,218 [AntiEntropyStage:1] org.apache.cassandra.service.CassandraDaemon:601 - Exception in thread Thread[AntiEntropyStage:1,5,main]
      java.lang.RuntimeException: FSReadError in .../data/system/peers_v2-c4325fbb8e5e3bafbd070f9250ed818e/system-peers_v2-nb-101-big-Data.db
              at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:108) 
              at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
              at org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:433)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 
              at java.util.concurrent.FutureTask.run(FutureTask.java:264)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.58.Final.jar:4.1.58.Final]
              at java.lang.Thread.run(Thread.java:834)
      Caused by: org.apache.cassandra.io.FSReadError: java.nio.channels.ClosedChannelException
              at org.apache.cassandra.io.util.ChannelProxy.read(ChannelProxy.java:143) 
              at org.apache.cassandra.io.util.CompressedChunkReader$Standard.readChunk(CompressedChunkReader.java:115) 
              at org.apache.cassandra.io.util.BufferManagingRebufferer.rebuffer(BufferManagingRebufferer.java:79) 
              at org.apache.cassandra.io.util.RandomAccessReader.reBufferAt(RandomAccessReader.java:68) 
              at org.apache.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:210) 
              at org.apache.cassandra.io.util.FileHandle.createReader(FileHandle.java:151) 
      

      ...and...

      ERROR 2022-05-19T22:06:20,175 [CompactionExecutor:12] org.apache.cassandra.service.CassandraDaemon:601 - Exception in thread Thread[CompactionExecutor:12,1,main]
      org.apache.cassandra.io.FSReadError: java.nio.channels.ClosedChannelException
              at org.apache.cassandra.io.util.ChannelProxy.read(ChannelProxy.java:143) 
              at org.apache.cassandra.io.util.CompressedChunkReader$Standard.readChunk(CompressedChunkReader.java:115) 
              at org.apache.cassandra.io.util.BufferManagingRebufferer.rebuffer(BufferManagingRebufferer.java:79) 
              at org.apache.cassandra.io.util.RandomAccessReader.reBufferAt(RandomAccessReader.java:68) 
              at org.apache.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:210) 
              at org.apache.cassandra.io.sstable.format.big.BigTableScanner.seekToCurrentRangeStart(BigTableScanner.java:196) 
              at org.apache.cassandra.io.sstable.format.big.BigTableScanner.access$400(BigTableScanner.java:52) 
              at org.apache.cassandra.io.sstable.format.big.BigTableScanner$KeyScanningIterator.computeNext(BigTableScanner.java:305) 
              at org.apache.cassandra.io.sstable.format.big.BigTableScanner$KeyScanningIterator.computeNext(BigTableScanner.java:282) 
              at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:46) 
              at org.apache.cassandra.io.sstable.format.big.BigTableScanner.hasNext(BigTableScanner.java:261) 
      

      ...which obviously get us into trouble w/ the disk failure policy.

      The fix proposed here is just to get the peers table read out of the thread that can be interrupted. Specifically, NettyStreamingMessageSender materializes a connectTo address at stream task creation time. This seemed a better option than making shutdown non-interrupting, since that would mean changing how the actual file streaming responds to shutdown.

      Attachments

        Issue Links

          Activity

            People

              maedhroz Caleb Rackliffe
              maedhroz Caleb Rackliffe
              Caleb Rackliffe
              David Capwell, Jon Meredith, Marcus Eriksson
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 20m
                  4h 20m