Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-6352

Cluster does not repond to new SELECT query after a timeout

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Normal
    • Resolution: Duplicate
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Windows7, C* v2.0.xx, 4-node cluster, JVM 1.7.0_45-b18 Xmx16GB, Datastax Java Driver 1.0.4 and 2.0.0-beta2

    • Severity:
      Normal
    • Since Version:

      Description

      Hello,

      We encounter the following issue three times. Here are the descriptions of the issue:

      • data are imported via Datastax Java driver (DJD) v2.0.0-b2 with BatchStatement (i.e.: batch of PreparedStatement). The performance is quite impressive.
      • if we query the cluster via cqlsh (C* 2.0.x) and DJD v1.0.4, everything goes well.
      • but when we use DJD v2.0.0-b2, we got an exception:

        com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)

      • afterward, no Select query works anymore:
        • all query via cqlsh failed with rpc_timeout
        • all query via DJD v1.0.4 failed with the same exception as the v2.0.0-b2
        • these queries have worked perfectly before the first select with DJD v2.0.0
      • nodetool status shows all nodes still Up and Normal
      • nodetool flush still works on all nodes

      Only a reboot of all nodes could solve the issue.
      Unfortunately, we don't have any exploitable informations in log files:
      Node1: the handshaking at 11:28:48 is strange because we didn't reboot any node

      INFO [MemoryMeter:1] 2013-11-15 11:27:11,724 Memtable.java (line 444) CFS(Keyspace='hector', ColumnFamily='pdl_caching') liveRatio is 5.06951175012658 (just-counted was 4.902669365509605). calculation took 140ms for 57108 columns
      INFO [HANDSHAKE-/10.30.226.166] 2013-11-15 11:28:48,550 OutboundTcpConnection.java (line 386) Handshaking version with /10.30.226.166
      INFO [RMI TCP Connection(4)-10.30.224.229] 2013-11-15 11:32:29,256 ColumnFamilyStore.java (line 734) Enqueuing flush of Memtable-sstable_activity@2142066849(0/0 serialized/live bytes, 24 ops)
      INFO [FlushWriter:76] 2013-11-15 11:32:29,257 Memtable.java (line 328) Writing Memtable-sstable_activity@2142066849(0/0 serialized/live bytes, 24 ops)

      Node2: there is a hinted-handoff at 11:30:02...

      INFO [MemoryMeter:1] 2013-11-15 11:25:32,897 Memtable.java (line 444) CFS(Keyspace='hector', ColumnFamily='pdl_identity') liveRatio is 6.046071792095967 (just-counted was 5.493829833297251). calculation took 3ms for 608 columns
      INFO [HintedHandoff:1] 2013-11-15 11:30:02,656 HintedHandOffManager.java (line 322) Started hinted handoff for host: 2ce9f0a8-795c-4733-9d52-06057fcc690d with IP: /10.30.227.8
      INFO [HintedHandoff:1] 2013-11-15 11:30:12,663 HintedHandOffManager.java (line 449) Timed out replaying hints to /10.30.227.8; aborting (0 delivered)
      INFO [RMI TCP Connection(6)-10.30.224.229] 2013-11-15 11:35:20,096 ColumnFamilyStore.java (line 734) Enqueuing flush of Memtable-hints@581765413(1028/10280 serialized/live bytes, 2 ops)

      It seems that the first Select query with DJD v2.0.0-b2 let the cluster in a "pending"/"anormal" state and it no longer responds to future queries.

      I know that without logs it will be hard to reproduce.

      Thanks and regards,
      Minh

        Attachments

          Activity

            People

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment