Cassandra
  1. Cassandra
  2. CASSANDRA-475

sending random data crashes thrift service

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 0.7 beta 1
    • Component/s: Core
    • Labels:
      None

      Description

      Use dd if=/dev/urandom count=1 | nc $host 9160 as a handy recipe for shutting a cassandra instance down.

      Thrift has spoken (see THRIFT-601), but "Don't Do That" is probably an insufficient answer for our users.

      1. trunk-475-src-4.txt
        0.9 kB
        Nate McCall
      2. trunk-475-src-3.txt
        4 kB
        Nate McCall
      3. trunk-475-config.txt
        0.8 kB
        Nate McCall

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          We don't have the resources to devote to properly fixing THRIFT-601, so we'll have to close this as wontfix until or unless something changes there.

          If you need to expose Cassandra to untrusted sources, I suggest helping with Avro integration. (Ask Eric for how, he is the closest to that.)

          Show
          Jonathan Ellis added a comment - We don't have the resources to devote to properly fixing THRIFT-601 , so we'll have to close this as wontfix until or unless something changes there. If you need to expose Cassandra to untrusted sources, I suggest helping with Avro integration. (Ask Eric for how, he is the closest to that.)
          Hide
          Jonathan Ellis added a comment -

          THRIFT-601 has reportedly been fixed. Not sure how painful their fix is for us, we should have a look.

          Show
          Jonathan Ellis added a comment - THRIFT-601 has reportedly been fixed. Not sure how painful their fix is for us, we should have a look.
          Hide
          Pedro Paixao added a comment -

          Do you know if their patch has been committed to to cassandra's trunk?

          Show
          Pedro Paixao added a comment - Do you know if their patch has been committed to to cassandra's trunk?
          Hide
          Jonathan Ellis added a comment -

          this ticket is open precisely to incorporate a newer thrift and make any other necessary changes to fix the problem

          Show
          Jonathan Ellis added a comment - this ticket is open precisely to incorporate a newer thrift and make any other necessary changes to fix the problem
          Hide
          Jonathan Ellis added a comment -

          looks like we need to set a transport max length and a protocol read length (besides upgrading to a new thrift jar).

          unclear if these would prohibit using large binary column values. may need to make them configurable. (would transport length be read length + some overhead? or is transport length just a buffer size that we can leave at some reasonably safe value and is transparent to the rest?)

          Show
          Jonathan Ellis added a comment - looks like we need to set a transport max length and a protocol read length (besides upgrading to a new thrift jar). unclear if these would prohibit using large binary column values. may need to make them configurable. (would transport length be read length + some overhead? or is transport length just a buffer size that we can leave at some reasonably safe value and is transparent to the rest?)
          Hide
          Nate McCall added a comment -

          TFramedTransport maxLength is just a frame buffer size. It defaults to 2^31-1 in the latest thrift. Im not sure tweaking that buys us much given the following.

          TBinaryProtocol.readLength is what needs to be set. NOTE: The value provided here will effectively be the max column size (there is no overhead due to thrift's delimitation granularity).

          This should be configurable with a sane default. How about thrift_max_message_length with a default of 1 (in MB). (This is the default max_packet_size in mysql, so might be easier to grok if we match that).

          Show
          Nate McCall added a comment - TFramedTransport maxLength is just a frame buffer size. It defaults to 2^31-1 in the latest thrift. Im not sure tweaking that buys us much given the following. TBinaryProtocol.readLength is what needs to be set. NOTE : The value provided here will effectively be the max column size (there is no overhead due to thrift's delimitation granularity). This should be configurable with a sane default. How about thrift_max_message_length with a default of 1 (in MB). (This is the default max_packet_size in mysql, so might be easier to grok if we match that).
          Hide
          Jonathan Ellis added a comment -

          so readLength is the max size of a single field, and maxLength the max size of the entire rpc call?

          is maxLength actually controlling a byte[] size somewhere, or not? if it is then we do need to set it, or if the first 4 bytes correspond to a large int we still OOM.

          Show
          Jonathan Ellis added a comment - so readLength is the max size of a single field, and maxLength the max size of the entire rpc call? is maxLength actually controlling a byte[] size somewhere, or not? if it is then we do need to set it, or if the first 4 bytes correspond to a large int we still OOM.
          Hide
          Nate McCall added a comment -

          Yes, maxLength does directly control the size of the input byte[] on TFramedTransport, so it is succeptible to the same issue.

          I would like to change thrift_framed_transport to thrift_framed_transport_size with 0 (in MB as well) as the default indicating TSocket over TFramedTransport

          Show
          Nate McCall added a comment - Yes, maxLength does directly control the size of the input byte[] on TFramedTransport, so it is succeptible to the same issue. I would like to change thrift_framed_transport to thrift_framed_transport_size with 0 (in MB as well) as the default indicating TSocket over TFramedTransport
          Hide
          Nate McCall added a comment -

          Looks like the thrift folks have changed the parameterization of TBase: THRIFT-759

          Unfortunately, this beat our patch for THRIFT-601 by a couple of days. Thoughts?

          Show
          Nate McCall added a comment - Looks like the thrift folks have changed the parameterization of TBase: THRIFT-759 Unfortunately, this beat our patch for THRIFT-601 by a couple of days. Thoughts?
          Hide
          Jonathan Ellis added a comment -

          If you're asking "is this worth having to rewrite a bunch of our code" then the answer is "yes, but this is why we upgrade thrift version so infrequently."

          Show
          Jonathan Ellis added a comment - If you're asking "is this worth having to rewrite a bunch of our code" then the answer is "yes, but this is why we upgrade thrift version so infrequently."
          Hide
          Nate McCall added a comment - - edited

          We are agreed on the importance of this. The changes on the surface seem trivial, since Comparator is widely used already. My concerns here are strictly schedule related given the process overhead involved in touching so many files in active development.

          I'm going to have to put this down temporarily, until I can focus on this and make the change a one-shot deal (or as close to that as possible).

          EDIT: I just added CASSANDRA-1266 to track the jar upgrade separately so we don't muddy the waters on this issue.

          Show
          Nate McCall added a comment - - edited We are agreed on the importance of this. The changes on the surface seem trivial, since Comparator is widely used already. My concerns here are strictly schedule related given the process overhead involved in touching so many files in active development. I'm going to have to put this down temporarily, until I can focus on this and make the change a one-shot deal (or as close to that as possible). EDIT: I just added CASSANDRA-1266 to track the jar upgrade separately so we don't muddy the waters on this issue.
          Hide
          Nate McCall added a comment -

          I can continue with the config modifications required if the changes mentioned above are cool

          Show
          Nate McCall added a comment - I can continue with the config modifications required if the changes mentioned above are cool
          Hide
          Nate McCall added a comment - - edited
          • trunk-475-config.txt updates the yaml config
          • trunk-475-src.txt updates java src

          EDIT: I dropped in a default setting of 256mb in the Converter when someone had framed transport configured in storage-conf. Was not sure what all to do there except add a sane default and a warning message.

          Show
          Nate McCall added a comment - - edited trunk-475-config.txt updates the yaml config trunk-475-src.txt updates java src EDIT: I dropped in a default setting of 256mb in the Converter when someone had framed transport configured in storage-conf. Was not sure what all to do there except add a sane default and a warning message.
          Hide
          Jonathan Ellis added a comment -

          do we need a sanity check that frame size must be < message length?

          Show
          Jonathan Ellis added a comment - do we need a sanity check that frame size must be < message length?
          Hide
          Nate McCall added a comment -

          I thought about that, but it's pretty clear exception wise when you go over the frame size that it is a transport issue

          Show
          Nate McCall added a comment - I thought about that, but it's pretty clear exception wise when you go over the frame size that it is a transport issue
          Hide
          Nate McCall added a comment -

          Come to think, nice to have completeness-wise and was easy enough to add.

          trunk-475-src-2.txt superseeds trunk-475-src.txt

          Show
          Nate McCall added a comment - Come to think, nice to have completeness-wise and was easy enough to add. trunk-475-src-2.txt superseeds trunk-475-src.txt
          Hide
          Jonathan Ellis added a comment -

          hmm, why isn't that check firing with the default frame of 256 and length of 1?

          Show
          Jonathan Ellis added a comment - hmm, why isn't that check firing with the default frame of 256 and length of 1?
          Hide
          Nate McCall added a comment -

          I'm gonna blame that one on your central-TX weather melting my brain.

          trunk-475-src-3.txt replaces trunk-475-src-2.txt, fixes comparisson induced by mild heat stroke.

          Show
          Nate McCall added a comment - I'm gonna blame that one on your central-TX weather melting my brain. trunk-475-src-3.txt replaces trunk-475-src-2.txt, fixes comparisson induced by mild heat stroke.
          Hide
          Jonathan Ellis added a comment -

          is thrift smart enough to re-use those byte[] buffers for multiple requests? allocating byte[] is expensive in java.

          Show
          Jonathan Ellis added a comment - is thrift smart enough to re-use those byte[] buffers for multiple requests? allocating byte[] is expensive in java.
          Hide
          Nate McCall added a comment -

          Both TBinaryProtocol and TFramedTransport use the limits in checks only. The size used to construct the byte[] is pulled from the first couple bytes of any thrift message. When junk was sent, this size was getting intepreted as extremely large values.

          Show
          Nate McCall added a comment - Both TBinaryProtocol and TFramedTransport use the limits in checks only. The size used to construct the byte[] is pulled from the first couple bytes of any thrift message. When junk was sent, this size was getting intepreted as extremely large values.
          Hide
          Jonathan Ellis added a comment -

          That makes sense.

          That would be an obvious optimization for them to make, then – allocate a buffer at the max size, and re-use it.

          Show
          Jonathan Ellis added a comment - That makes sense. That would be an obvious optimization for them to make, then – allocate a buffer at the max size, and re-use it.
          Hide
          Jonathan Ellis added a comment -

          committed

          Show
          Jonathan Ellis added a comment - committed
          Hide
          Brandon Williams added a comment -

          I'm occasionally getting this error with stress.py after doing a few million inserts:

          ERROR 13:30:12,158 Thrift error occurred during processing of message.
          org.apache.thrift.TException: Message length exceeded: 32
          at org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:384)
          at org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:361)
          at org.apache.cassandra.thrift.Column.read(Column.java:498)
          at org.apache.cassandra.thrift.ColumnOrSuperColumn.read(ColumnOrSuperColumn.java:351)
          at org.apache.cassandra.thrift.Mutation.read(Mutation.java:346)
          at org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:16780)
          at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3041)
          at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2531)
          at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          at java.lang.Thread.run(Thread.java:619)

          Show
          Brandon Williams added a comment - I'm occasionally getting this error with stress.py after doing a few million inserts: ERROR 13:30:12,158 Thrift error occurred during processing of message. org.apache.thrift.TException: Message length exceeded: 32 at org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:384) at org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:361) at org.apache.cassandra.thrift.Column.read(Column.java:498) at org.apache.cassandra.thrift.ColumnOrSuperColumn.read(ColumnOrSuperColumn.java:351) at org.apache.cassandra.thrift.Mutation.read(Mutation.java:346) at org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:16780) at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3041) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2531) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619)
          Hide
          Jon Hermes added a comment - - edited

          Hudson also found this an error in build 492.
          I assume the hadoop test that failed found it by doing a lot of inserts as above.

          Show
          Jon Hermes added a comment - - edited Hudson also found this an error in build 492. I assume the hadoop test that failed found it by doing a lot of inserts as above.
          Hide
          Nate McCall added a comment -

          checkReadLength in TBinaryProtocol looks like it is treating the readLength attribute as an instance variable and is therefore slowly getting decremented on each call!

          I'm verifying my findings w/ thrift folks now

          Show
          Nate McCall added a comment - checkReadLength in TBinaryProtocol looks like it is treating the readLength attribute as an instance variable and is therefore slowly getting decremented on each call! I'm verifying my findings w/ thrift folks now
          Hide
          Nate McCall added a comment -

          THRIFT-820 created and patch submitted. Waiting on their review and feedback.

          Show
          Nate McCall added a comment - THRIFT-820 created and patch submitted. Waiting on their review and feedback.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Cassandra #493 (See http://hudson.zones.apache.org/hudson/job/Cassandra/493/ )
          Hide
          Nate McCall added a comment -

          Resets the input and output protocol on each after each successful call to process in CustomTThreadPoolServer

          passes nosetests and stress.py with 5 million rows

          Show
          Nate McCall added a comment - Resets the input and output protocol on each after each successful call to process in CustomTThreadPoolServer passes nosetests and stress.py with 5 million rows
          Hide
          Brandon Williams added a comment -

          Verified this solves the issue and checked for any adverse performance effects. Committed.

          Show
          Brandon Williams added a comment - Verified this solves the issue and checked for any adverse performance effects. Committed.
          Hide
          Jonathan Ellis added a comment -

          is this going to hurt performance? (not sure how heavyweight getProtocol is)

          Show
          Jonathan Ellis added a comment - is this going to hurt performance? (not sure how heavyweight getProtocol is)
          Hide
          Brandon Williams added a comment -

          I checked the performance and couldn't see any noticeable difference. The extra garbage might minorly exacerbate CASSANDRA-1014.

          Show
          Brandon Williams added a comment - I checked the performance and couldn't see any noticeable difference. The extra garbage might minorly exacerbate CASSANDRA-1014 .
          Hide
          Hudson added a comment -

          Integrated in Cassandra #496 (See http://hudson.zones.apache.org/hudson/job/Cassandra/496/)
          Reset the input and output protocol on each after each successful call. Patch by Nate McCall, reviewed by brandonwilliams for CASSANDRA-475

          Show
          Hudson added a comment - Integrated in Cassandra #496 (See http://hudson.zones.apache.org/hudson/job/Cassandra/496/ ) Reset the input and output protocol on each after each successful call. Patch by Nate McCall, reviewed by brandonwilliams for CASSANDRA-475
          Hide
          Jignesh Dhruv added a comment -

          Hello,

          I am using the latest source code from trunk. SVN 961952

          After few hundred thousand inserts cassandra crashes and is throwing 2 different types of exceptions:
          The first one being:
          org.apache.thrift.transport.TTransportException
          at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
          at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
          at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
          at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
          at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
          at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:369)
          at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:295)
          at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:202)
          at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:960)
          at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:944)
          at com.cbsi.pi.rtss.data.cassandra.CassandraDataManager.insert(CassandraDataManager.java:107)
          at com.cbsi.pi.rtss.service.bulk.BulkThread.run(BulkThread.java:59)
          at java.lang.Thread.run(Unknown Source)

          and the second one being:
          org.apache.thrift.transport.TTransportException: java.net.SocketException: Software caused connection abort: socket write error
          at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
          at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
          at org.apache.cassandra.thrift.Cassandra$Client.send_set_keyspace(Cassandra.java:441)
          at org.apache.cassandra.thrift.Cassandra$Client.set_keyspace(Cassandra.java:430)
          at com.cbsi.pi.rtss.data.cassandra.CassandraDataManager.insert(CassandraDataManager.java:106)
          at com.cbsi.pi.rtss.service.bulk.BulkThread.run(BulkThread.java:59)
          at java.lang.Thread.run(Unknown Source)
          Caused by: java.net.SocketException: Software caused connection abort: socket write error
          at java.net.SocketOutputStream.socketWrite0(Native Method)
          at java.net.SocketOutputStream.socketWrite(Unknown Source)
          at java.net.SocketOutputStream.write(Unknown Source)
          at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
          at java.io.BufferedOutputStream.write(Unknown Source)
          at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
          ... 6 more

          I was not getting this error yesterday but this morning when I updated my svn trunk I got update for following 2 files:
          U src/java/org/apache/cassandra/thrift/CustomTThreadPoolServer.java
          U src/java/org/apache/cassandra/scheduler/RoundRobinScheduler.java

          One more thing.

          Before I did a SVN update this morning, I checkout thrift and applied the patch suggested by Nate in bug THRIFT-820 https://issues.apache.org/jira/browse/THRIFT-820. When I rebuild thrift jar file and used it, I had no crashes or exceptions yesterday.

          But today with/without thrift patch, I am getting exceptions mentioned above.

          Thanks,
          Jignesh

          and that is causing the problem.

          Show
          Jignesh Dhruv added a comment - Hello, I am using the latest source code from trunk. SVN 961952 After few hundred thousand inserts cassandra crashes and is throwing 2 different types of exceptions: The first one being: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129) at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:369) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:295) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:202) at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:960) at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:944) at com.cbsi.pi.rtss.data.cassandra.CassandraDataManager.insert(CassandraDataManager.java:107) at com.cbsi.pi.rtss.service.bulk.BulkThread.run(BulkThread.java:59) at java.lang.Thread.run(Unknown Source) and the second one being: org.apache.thrift.transport.TTransportException: java.net.SocketException: Software caused connection abort: socket write error at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147) at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156) at org.apache.cassandra.thrift.Cassandra$Client.send_set_keyspace(Cassandra.java:441) at org.apache.cassandra.thrift.Cassandra$Client.set_keyspace(Cassandra.java:430) at com.cbsi.pi.rtss.data.cassandra.CassandraDataManager.insert(CassandraDataManager.java:106) at com.cbsi.pi.rtss.service.bulk.BulkThread.run(BulkThread.java:59) at java.lang.Thread.run(Unknown Source) Caused by: java.net.SocketException: Software caused connection abort: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(Unknown Source) at java.net.SocketOutputStream.write(Unknown Source) at java.io.BufferedOutputStream.flushBuffer(Unknown Source) at java.io.BufferedOutputStream.write(Unknown Source) at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145) ... 6 more I was not getting this error yesterday but this morning when I updated my svn trunk I got update for following 2 files: U src/java/org/apache/cassandra/thrift/CustomTThreadPoolServer.java U src/java/org/apache/cassandra/scheduler/RoundRobinScheduler.java One more thing. Before I did a SVN update this morning, I checkout thrift and applied the patch suggested by Nate in bug THRIFT-820 https://issues.apache.org/jira/browse/THRIFT-820 . When I rebuild thrift jar file and used it, I had no crashes or exceptions yesterday. But today with/without thrift patch, I am getting exceptions mentioned above. Thanks, Jignesh and that is causing the problem.
          Hide
          Jignesh Dhruv added a comment -

          Apologize for the confusion. I did another svn update and picked bunch of new files. I am not getting any cassandra crashes or exceptions any more.

          Show
          Jignesh Dhruv added a comment - Apologize for the confusion. I did another svn update and picked bunch of new files. I am not getting any cassandra crashes or exceptions any more.

            People

            • Assignee:
              Nate McCall
              Reporter:
              Eric Evans
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development