Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-440

get_key_range problems when a node is down

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Fixed
    • 0.4, 0.5
    • None
    • None
    • 64-bit 4GB Rackspace-cloud boxes running FC11 (saw problem on 32-bit platform as well)

    • Normal

    Description

      I'm running Cassandra on 5 nodes using the
      OrderPreservingPartitioner, and have populated Cassandra with 78
      records, and I can use get_key_range via Thrift just fine. Then, if I
      manually kill one of the nodes (if I kill off node #5), the node (node
      #1) which I've been using to call get_key_range will timeout and the
      error:

      Thrift: Internal error processing get_key_range

      The Cassandra output traceback:

      ERROR - Encountered IOException on connection:
      java.nio.channels.SocketChannel[closed]
      java.net.ConnectException: Connection refused
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
      at org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349)
      at org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131)
      at org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98)
      WARN - Closing down connection java.nio.channels.SocketChannel[closed]
      ERROR - Internal error processing get_key_range
      java.lang.RuntimeException: java.util.concurrent.TimeoutException:
      Operation timed out.
      at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573)
      at org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595)
      at org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853)
      at org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606)
      at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
      at java.lang.Thread.run(Thread.java:675)
      Caused by: java.util.concurrent.TimeoutException: Operation timed out.
      at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97)
      at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569)
      ... 7 more

      The error starts as soon as the downed node #5 goes down and lasts
      until I restart the downed node #5.

      bin/nodeprobe cluster is accurate (it knows quickly when #5 is down,
      and when it is up again)

      Since I set the replication set to 3, I'm confused as to why (after
      the first few seconds or so) there is an error just because one host
      is down temporarily.

      (Jonathan Ellis and I discussed this on the mailing list, let me know if more information is needed.)

      Attachments

        1. 440.patch
          62 kB
          Jonathan Ellis

        Activity

          People

            jbellis Jonathan Ellis
            simongsmith Simon Smith
            Jonathan Ellis
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: