Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Later
    • Fix Version/s: None
    • Component/s: Core

      Description

      Currently when using vnodes, repair takes much longer to complete than without them. This appears at least in part because it's using a session per range and processing them sequentially. This generates a lot of log spam with vnodes, and while being gentler and lighter on hard disk deployments, ssd-based deployments would often prefer that repair be as fast as possible.

      1. 5220-yourkit.tar.bz2
        2.18 MB
        Ryan McGuire
      2. 5220-yourkit.png
        32 kB
        Ryan McGuire

        Issue Links

          Activity

          Hide
          Yuki Morishita added a comment - - edited

          The reason the repair is done almost sequentially is to synchronize merkle tree creation across the nodes(CASSANDRA-2816). If we could form the groups of nodes that do not overlap for several ranges, we would be able to parallelize create/validate merkle tree across those groups.

          Show
          Yuki Morishita added a comment - - edited The reason the repair is done almost sequentially is to synchronize merkle tree creation across the nodes( CASSANDRA-2816 ). If we could form the groups of nodes that do not overlap for several ranges, we would be able to parallelize create/validate merkle tree across those groups.
          Hide
          Jonathan Ellis added a comment -

          I suspect and hope that CASSANDRA-5351 will speed up repair enough that we won't need to tweak around the edges like this.

          Show
          Jonathan Ellis added a comment - I suspect and hope that CASSANDRA-5351 will speed up repair enough that we won't need to tweak around the edges like this.
          Hide
          Brandon Williams added a comment -

          I think this is still worth looking at.

          Without vnodes:
          repair_test.TestRepair.simple_repair_test: 103.3584

          With vnodes:
          repair_test.TestRepair.simple_repair_test: 348.8908

          So we're 3-3.5x slower in the simple case. I'll admit that doing the order preserving repair test with vnodes is whack, but that one takes 10x longer with vnodes.

          Show
          Brandon Williams added a comment - I think this is still worth looking at. Without vnodes: repair_test.TestRepair.simple_repair_test: 103.3584 With vnodes: repair_test.TestRepair.simple_repair_test: 348.8908 So we're 3-3.5x slower in the simple case. I'll admit that doing the order preserving repair test with vnodes is whack, but that one takes 10x longer with vnodes.
          Hide
          Donald Smith added a comment - - edited

          We ran "nodetool repair" on a 3 node cassandra cluster with production-quality hardware, using version 2.0.3. Each node had about 1TB of data. This is still testing. After 5 days the repair job still hasn't finished. I can see it's still running.

          Here's the process:

          root     30835 30774  0 Dec17 pts/0    00:03:53 /usr/bin/java -cp /etc/cassandra/conf:/usr/share/java/jna.jar:/usr/share/cassandra/lib/antlr-3.2.jar:/usr/share/cassandra/lib/apache-cassandra-2.0.3.jar:/usr/share/cassandra/lib/apache-cassandra-clientutil-2.0.3.jar:/usr/share/cassandra/lib/apache-cassandra-thrift-2.0.3.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang3-3.1.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/share/cassandra/lib/disruptor-3.0.1.jar:/usr/share/cassandra/lib/guava-15.0.jar:/usr/share/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.2.5.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jline-1.0.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.9.1.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/lz4-1.2.0.jar:/usr/share/cassandra/lib/metrics-core-2.2.0.jar:/usr/share/cassandra/lib/netty-3.6.6.Final.jar:/usr/share/cassandra/lib/reporter-config-2.1.0.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.7.2.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.7.2.jar:/usr/share/cassandra/lib/snakeyaml-1.11.jar:/usr/share/cassandra/lib/snappy-java-1.0.5.jar:/usr/share/cassandra/lib/snaptree-0.1.jar:/usr/share/cassandra/lib/stress.jar:/usr/share/cassandra/lib/thrift-server-0.3.2.jar -Xmx32m -Dlog4j.configuration=log4j-tools.properties -Dstorage-config=/etc/cassandra/conf org.apache.cassandra.tools.NodeCmd -p 7199 repair -pr as_reports
          

          The log output has just:

          xss =  -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M -Xmn2048M -XX:+HeapDumpOnOutOfMemoryError -Xss256k
          [2013-12-17 23:26:48,144] Starting repair command #1, repairing 256 ranges for keyspace as_reports
          

          Here's the output of "nodetool tpstats":

          cass3 /tmp> nodetool tpstats
          xss =  -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M -Xmn2048M -XX:+HeapDumpOnOutOfMemoryError -Xss256k
          Pool Name                    Active   Pending      Completed   Blocked  All time blocked
          ReadStage                         1         0       38083403         0                 0
          RequestResponseStage              0         0     1951200451         0                 0
          MutationStage                     0         0     2853354069         0                 0
          ReadRepairStage                   0         0        3794926         0                 0
          ReplicateOnWriteStage             0         0              0         0                 0
          GossipStage                       0         0        4880147         0                 0
          AntiEntropyStage                  1         3              9         0                 0
          MigrationStage                    0         0             30         0                 0
          MemoryMeter                       0         0            115         0                 0
          MemtablePostFlusher               0         0          75121         0                 0
          FlushWriter                       0         0          49934         0                52
          MiscStage                         0         0              0         0                 0
          PendingRangeCalculator            0         0              7         0                 0
          commitlog_archiver                0         0              0         0                 0
          AntiEntropySessions               1         1              1         0                 0
          InternalResponseStage             0         0              9         0                 0
          HintedHandoff                     0         0           1141         0                 0
          
          Message type           Dropped
          RANGE_SLICE                  0
          READ_REPAIR                  0
          PAGED_RANGE                  0
          BINARY                       0
          READ                       884
          MUTATION               1407711
          _TRACE                       0
          REQUEST_RESPONSE             0
          

          The cluster has some write traffic to it. We decided to test it under load.
          This is the busiest column family, as reported by "nodetool cfstats":

             Read Count: 38084316
                  Read Latency: 9.409910464927346 ms.
                  Write Count: 2850436738
                  Write Latency: 0.8083138546641199 ms.
                  Pending Tasks: 0
          ....
              Table: data_report_details
                          SSTable count: 592
                          Space used (live), bytes: 160644106183
                          Space used (total), bytes: 160663248847
                          SSTable Compression Ratio: 0.5296494510512617
                          Number of keys (estimate): 51015040
                          Memtable cell count: 311180
                          Memtable data size, bytes: 46275953
                          Memtable switch count: 6100
                          Local read count: 6147
                          Local read latency: 154.539 ms
                          Local write count: 750865416
                          Local write latency: 0.029 ms
                          Pending tasks: 0
                          Bloom filter false positives: 265
                          Bloom filter false ratio: 0.06009
                          Bloom filter space used, bytes: 64690104
                          Compacted partition minimum bytes: 30
                          Compacted partition maximum bytes: 10090808
                          Compacted partition mean bytes: 5267
                          Average live cells per slice (last five minutes): 1.0
                          Average tombstones per slice (last five minutes): 0.0
          

          We're gonna restart the node. We barely do deletes or updates (only if a report is re-uploaded), so we suspect that we can get by without doing repairs. Correct me if we're wrong about that.

          nodetool compactionstats outputs:

          xss =  -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M -Xmn2048M -XX:+HeapDumpOnOutOfMemoryError -Xss256k
          pending tasks: 166
                    compaction type        keyspace           table       completed           total      unit  progress
                         Compaction      as_reportsdata_report_details_below_threshold       971187148      1899419306     bytes    51.13%
                         Compaction      as_reportsdata_report_details_below_threshold       950086203      1941500979     bytes    48.94%
                         Compaction      as_reportsdata_hierarchy_details      2968934609      5808990354     bytes    51.11%
                         Compaction      as_reportsdata_report_details_below_threshold       945816183      1900166474     bytes    49.78%
                         Compaction      as_reportsdata_report_details_below_threshold       899143344      1943534395     bytes    46.26%
                         Compaction      as_reportsdata_report_details_below_threshold       856329840      1946566670     bytes    43.99%
                         Compaction      as_reportsdata_report_details       195235688       915395763     bytes    21.33%
                         Compaction      as_reportsdata_report_details_below_threshold       982460217      1931001761     bytes    50.88%
                         Compaction      as_reportsdata_report_details_below_threshold       896609409      1931075688     bytes    46.43%
                         Compaction      as_reportsdata_report_details_below_threshold       869219044      1928977382     bytes    45.06%
                         Compaction      as_reportsdata_report_details_below_threshold       870931112      1901729646     bytes    45.80%
                         Compaction      as_reportsdata_report_details_below_threshold       879343635      1939491280     bytes    45.34%
                         Compaction      as_reportsdata_report_details_below_threshold       981888944      1893024439     bytes    51.87%
                         Compaction      as_reportsdata_report_details_below_threshold       871785587      1884652607     bytes    46.26%
                         Compaction      as_reportsdata_report_details_below_threshold       902340327      1913280943     bytes    47.16%
                         Compaction      as_reportsdata_report_details_below_threshold      1025069846      1901568674     bytes    53.91%
                         Compaction      as_reportsdata_report_details_below_threshold       920112020      1893272832     bytes    48.60%
                         Compaction      as_reportsdata_hierarchy_details      2962138268      5774762866     bytes    51.29%
                         Compaction      as_reportsdata_report_details_below_threshold       790782860      1918640911     bytes    41.22%
                         Compaction      as_reportsdata_hierarchy_details      2972501409      5885217724     bytes    50.51%
                         Compaction      as_reportsdata_report_details_below_threshold      1611697659      1939040337     bytes    83.12%
                         Compaction      as_reportsdata_report_details_below_threshold       943130526      1943713837     bytes    48.52%
                         Compaction      as_reportsdata_report_details_below_threshold       911127302      1952885196     bytes    46.66%
                         Compaction      as_reportsdata_report_details_below_threshold       911230087      1927967871     bytes    47.26%
          

          Now "nodetool tpstats" says:

          Pool Name                    Active   Pending      Completed   Blocked  All time blocked
          AntiEntropyStage                  1         3              9         0                 0
          

          We ran "nodetool repair -pr" on 10.1.40.43. Here are references to it. So, maybe the nodetool repair job hung.

          cass3 /var/log/cassandra> grep -i repair system.log.? | grep -i merkle
          system.log.1: INFO [AntiEntropySessions:1] 2013-12-17 23:26:48,459 RepairJob.java (line 116) [repair #c1540f60-67b5-11e3-b8b7-fb178cd88033] requesting merkle trees for data_report_details_by_uus (to [/10.1.40.42, dc1-cassandra-staging-03.dc01.revsci.net/10.1.40.43])
          system.log.1: INFO [AntiEntropyStage:1] 2013-12-17 23:26:48,807 RepairSession.java (line 157) [repair #c1540f60-67b5-11e3-b8b7-fb178cd88033] Received merkle tree for data_report_details_by_uus from /10.1.40.42
          system.log.1: INFO [AntiEntropyStage:1] 2013-12-17 23:26:49,091 RepairSession.java (line 157) [repair #c1540f60-67b5-11e3-b8b7-fb178cd88033] Received merkle tree for data_report_details_by_uus from /10.1.40.43
          system.log.1: INFO [AntiEntropyStage:1] 2013-12-19 03:58:31,007 RepairJob.java (line 116) [repair #c1540f60-67b5-11e3-b8b7-fb178cd88033] requesting merkle trees for data_hierarchy_details (to [/10.1.40.42, dc1-cassandra-staging-03.dc01.revsci.net/10.1.40.43])
          system.log.1: INFO [AntiEntropySessions:5] 2013-12-19 03:58:31,012 RepairJob.java (line 116) [repair #e0ff9ba0-68a4-11e3-b8b7-fb178cd88033] requesting merkle trees for data_report_details_by_uus (to [/10.1.40.41, dc1-cassandra-staging-03.dc01.revsci.net/10.1.40.43])
          system.log.1: INFO [AntiEntropyStage:1] 2013-12-19 03:58:31,316 RepairSession.java (line 157) [repair #e0ff9ba0-68a4-11e3-b8b7-fb178cd88033] Received merkle tree for data_report_details_by_uus from /10.1.40.41
          system.log.1: INFO [AntiEntropyStage:1] 2013-12-19 03:58:31,431 RepairSession.java (line 157) [repair #e0ff9ba0-68a4-11e3-b8b7-fb178cd88033] Received merkle tree for data_report_details_by_uus from /10.1.40.43
          
          Show
          Donald Smith added a comment - - edited We ran "nodetool repair" on a 3 node cassandra cluster with production-quality hardware, using version 2.0.3. Each node had about 1TB of data. This is still testing. After 5 days the repair job still hasn't finished. I can see it's still running. Here's the process: root 30835 30774 0 Dec17 pts/0 00:03:53 /usr/bin/java -cp /etc/cassandra/conf:/usr/share/java/jna.jar:/usr/share/cassandra/lib/antlr-3.2.jar:/usr/share/cassandra/lib/apache-cassandra-2.0.3.jar:/usr/share/cassandra/lib/apache-cassandra-clientutil-2.0.3.jar:/usr/share/cassandra/lib/apache-cassandra-thrift-2.0.3.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang3-3.1.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/share/cassandra/lib/disruptor-3.0.1.jar:/usr/share/cassandra/lib/guava-15.0.jar:/usr/share/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.2.5.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jline-1.0.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.9.1.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/lz4-1.2.0.jar:/usr/share/cassandra/lib/metrics-core-2.2.0.jar:/usr/share/cassandra/lib/netty-3.6.6.Final.jar:/usr/share/cassandra/lib/reporter-config-2.1.0.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.7.2.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.7.2.jar:/usr/share/cassandra/lib/snakeyaml-1.11.jar:/usr/share/cassandra/lib/snappy-java-1.0.5.jar:/usr/share/cassandra/lib/snaptree-0.1.jar:/usr/share/cassandra/lib/stress.jar:/usr/share/cassandra/lib/thrift-server-0.3.2.jar -Xmx32m -Dlog4j.configuration=log4j-tools.properties -Dstorage-config=/etc/cassandra/conf org.apache.cassandra.tools.NodeCmd -p 7199 repair -pr as_reports The log output has just: xss = -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M -Xmn2048M -XX:+HeapDumpOnOutOfMemoryError -Xss256k [2013-12-17 23:26:48,144] Starting repair command #1, repairing 256 ranges for keyspace as_reports Here's the output of "nodetool tpstats": cass3 /tmp> nodetool tpstats xss = -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M -Xmn2048M -XX:+HeapDumpOnOutOfMemoryError -Xss256k Pool Name Active Pending Completed Blocked All time blocked ReadStage 1 0 38083403 0 0 RequestResponseStage 0 0 1951200451 0 0 MutationStage 0 0 2853354069 0 0 ReadRepairStage 0 0 3794926 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 4880147 0 0 AntiEntropyStage 1 3 9 0 0 MigrationStage 0 0 30 0 0 MemoryMeter 0 0 115 0 0 MemtablePostFlusher 0 0 75121 0 0 FlushWriter 0 0 49934 0 52 MiscStage 0 0 0 0 0 PendingRangeCalculator 0 0 7 0 0 commitlog_archiver 0 0 0 0 0 AntiEntropySessions 1 1 1 0 0 InternalResponseStage 0 0 9 0 0 HintedHandoff 0 0 1141 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 0 PAGED_RANGE 0 BINARY 0 READ 884 MUTATION 1407711 _TRACE 0 REQUEST_RESPONSE 0 The cluster has some write traffic to it. We decided to test it under load. This is the busiest column family, as reported by "nodetool cfstats": Read Count: 38084316 Read Latency: 9.409910464927346 ms. Write Count: 2850436738 Write Latency: 0.8083138546641199 ms. Pending Tasks: 0 .... Table: data_report_details SSTable count: 592 Space used (live), bytes: 160644106183 Space used (total), bytes: 160663248847 SSTable Compression Ratio: 0.5296494510512617 Number of keys (estimate): 51015040 Memtable cell count: 311180 Memtable data size, bytes: 46275953 Memtable switch count: 6100 Local read count: 6147 Local read latency: 154.539 ms Local write count: 750865416 Local write latency: 0.029 ms Pending tasks: 0 Bloom filter false positives: 265 Bloom filter false ratio: 0.06009 Bloom filter space used, bytes: 64690104 Compacted partition minimum bytes: 30 Compacted partition maximum bytes: 10090808 Compacted partition mean bytes: 5267 Average live cells per slice (last five minutes): 1.0 Average tombstones per slice (last five minutes): 0.0 We're gonna restart the node. We barely do deletes or updates (only if a report is re-uploaded), so we suspect that we can get by without doing repairs. Correct me if we're wrong about that. nodetool compactionstats outputs: xss = -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M -Xmn2048M -XX:+HeapDumpOnOutOfMemoryError -Xss256k pending tasks: 166 compaction type keyspace table completed total unit progress Compaction as_reportsdata_report_details_below_threshold 971187148 1899419306 bytes 51.13% Compaction as_reportsdata_report_details_below_threshold 950086203 1941500979 bytes 48.94% Compaction as_reportsdata_hierarchy_details 2968934609 5808990354 bytes 51.11% Compaction as_reportsdata_report_details_below_threshold 945816183 1900166474 bytes 49.78% Compaction as_reportsdata_report_details_below_threshold 899143344 1943534395 bytes 46.26% Compaction as_reportsdata_report_details_below_threshold 856329840 1946566670 bytes 43.99% Compaction as_reportsdata_report_details 195235688 915395763 bytes 21.33% Compaction as_reportsdata_report_details_below_threshold 982460217 1931001761 bytes 50.88% Compaction as_reportsdata_report_details_below_threshold 896609409 1931075688 bytes 46.43% Compaction as_reportsdata_report_details_below_threshold 869219044 1928977382 bytes 45.06% Compaction as_reportsdata_report_details_below_threshold 870931112 1901729646 bytes 45.80% Compaction as_reportsdata_report_details_below_threshold 879343635 1939491280 bytes 45.34% Compaction as_reportsdata_report_details_below_threshold 981888944 1893024439 bytes 51.87% Compaction as_reportsdata_report_details_below_threshold 871785587 1884652607 bytes 46.26% Compaction as_reportsdata_report_details_below_threshold 902340327 1913280943 bytes 47.16% Compaction as_reportsdata_report_details_below_threshold 1025069846 1901568674 bytes 53.91% Compaction as_reportsdata_report_details_below_threshold 920112020 1893272832 bytes 48.60% Compaction as_reportsdata_hierarchy_details 2962138268 5774762866 bytes 51.29% Compaction as_reportsdata_report_details_below_threshold 790782860 1918640911 bytes 41.22% Compaction as_reportsdata_hierarchy_details 2972501409 5885217724 bytes 50.51% Compaction as_reportsdata_report_details_below_threshold 1611697659 1939040337 bytes 83.12% Compaction as_reportsdata_report_details_below_threshold 943130526 1943713837 bytes 48.52% Compaction as_reportsdata_report_details_below_threshold 911127302 1952885196 bytes 46.66% Compaction as_reportsdata_report_details_below_threshold 911230087 1927967871 bytes 47.26% Now "nodetool tpstats" says: Pool Name Active Pending Completed Blocked All time blocked AntiEntropyStage 1 3 9 0 0 We ran "nodetool repair -pr" on 10.1.40.43. Here are references to it. So, maybe the nodetool repair job hung. cass3 /var/log/cassandra> grep -i repair system.log.? | grep -i merkle system.log.1: INFO [AntiEntropySessions:1] 2013-12-17 23:26:48,459 RepairJob.java (line 116) [repair #c1540f60-67b5-11e3-b8b7-fb178cd88033] requesting merkle trees for data_report_details_by_uus (to [/10.1.40.42, dc1-cassandra-staging-03.dc01.revsci.net/10.1.40.43]) system.log.1: INFO [AntiEntropyStage:1] 2013-12-17 23:26:48,807 RepairSession.java (line 157) [repair #c1540f60-67b5-11e3-b8b7-fb178cd88033] Received merkle tree for data_report_details_by_uus from /10.1.40.42 system.log.1: INFO [AntiEntropyStage:1] 2013-12-17 23:26:49,091 RepairSession.java (line 157) [repair #c1540f60-67b5-11e3-b8b7-fb178cd88033] Received merkle tree for data_report_details_by_uus from /10.1.40.43 system.log.1: INFO [AntiEntropyStage:1] 2013-12-19 03:58:31,007 RepairJob.java (line 116) [repair #c1540f60-67b5-11e3-b8b7-fb178cd88033] requesting merkle trees for data_hierarchy_details (to [/10.1.40.42, dc1-cassandra-staging-03.dc01.revsci.net/10.1.40.43]) system.log.1: INFO [AntiEntropySessions:5] 2013-12-19 03:58:31,012 RepairJob.java (line 116) [repair #e0ff9ba0-68a4-11e3-b8b7-fb178cd88033] requesting merkle trees for data_report_details_by_uus (to [/10.1.40.41, dc1-cassandra-staging-03.dc01.revsci.net/10.1.40.43]) system.log.1: INFO [AntiEntropyStage:1] 2013-12-19 03:58:31,316 RepairSession.java (line 157) [repair #e0ff9ba0-68a4-11e3-b8b7-fb178cd88033] Received merkle tree for data_report_details_by_uus from /10.1.40.41 system.log.1: INFO [AntiEntropyStage:1] 2013-12-19 03:58:31,431 RepairSession.java (line 157) [repair #e0ff9ba0-68a4-11e3-b8b7-fb178cd88033] Received merkle tree for data_report_details_by_uus from /10.1.40.43
          Hide
          Robert Coli added a comment - - edited

          So we're 3-3.5x slower in the simple case.

          So, if :

          1) the default for gc_grace_seconds is how frequently we want people to repair
          2) and vnodes make repair 3-3.5x slower in the simple case
          3) and vnodes are enabled by default

          ... why has the default for gc_grace_seconds not been increased by 3-3.5x? (CASSANDRA-5850)

          Show
          Robert Coli added a comment - - edited So we're 3-3.5x slower in the simple case. So, if : 1) the default for gc_grace_seconds is how frequently we want people to repair 2) and vnodes make repair 3-3.5x slower in the simple case 3) and vnodes are enabled by default ... why has the default for gc_grace_seconds not been increased by 3-3.5x? ( CASSANDRA-5850 )
          Hide
          Ryan McGuire added a comment - - edited

          As of today, on cassandra-2.0 HEAD

          repair_test.TestRepair.simple_repair_test:

          without vnodes: Repair time: 5.10s

          with vnodes: Repair time: 562.97s

          100x slower than without vnodes. So I'm not sure what happened here since Brandon Williams ran this in November.

          Seeing the same slowdown in 2.1 HEAD.

          EDIT: Actually, this isn't the test length, this is the time the actual repair command took in that same dtest.

          Show
          Ryan McGuire added a comment - - edited As of today, on cassandra-2.0 HEAD repair_test.TestRepair.simple_repair_test: without vnodes: Repair time: 5.10s with vnodes: Repair time: 562.97s 100x slower than without vnodes. So I'm not sure what happened here since Brandon Williams ran this in November. Seeing the same slowdown in 2.1 HEAD. EDIT: Actually, this isn't the test length, this is the time the actual repair command took in that same dtest.
          Hide
          Jonathan Ellis added a comment -

          Worth bisecting?

          Show
          Jonathan Ellis added a comment - Worth bisecting?
          Hide
          Brandon Williams added a comment -

          without vnodes: Repair time: 5.10s

          That honestly sounds too fast to be believable to me, when I was tracking the time on the dtests repair was always one of, if not the, longest one.

          Show
          Brandon Williams added a comment - without vnodes: Repair time: 5.10s That honestly sounds too fast to be believable to me, when I was tracking the time on the dtests repair was always one of, if not the, longest one.
          Hide
          Brandon Williams added a comment -

          After talking with Ryan, I'm convinced that I just didn't have an accurate measure of actual repair time when I filed this, and the problem is even worse than I initially thought

          Show
          Brandon Williams added a comment - After talking with Ryan, I'm convinced that I just didn't have an accurate measure of actual repair time when I filed this, and the problem is even worse than I initially thought
          Hide
          Richard Low added a comment -

          It's going to be a lot slower when there's little data because there is num_tokens times as much work to do. But when there is lots of data the times should be pretty much independent of num_tokens because most of repair is spent reading data and hashing. I ran some tests when we were developing vnodes (sorry, I don't have the data still available) and this was the case. Something might have regressed though.

          Show
          Richard Low added a comment - It's going to be a lot slower when there's little data because there is num_tokens times as much work to do. But when there is lots of data the times should be pretty much independent of num_tokens because most of repair is spent reading data and hashing. I ran some tests when we were developing vnodes (sorry, I don't have the data still available) and this was the case. Something might have regressed though.
          Hide
          Jonathan Ellis added a comment -

          I just talked to some people who were seeing an 8 node (256 vnodes each) repair with about 1GB/node take two days.

          I would suggest doing some more digging to see where all the overhead is coming from, before guessing at solutions.

          Show
          Jonathan Ellis added a comment - I just talked to some people who were seeing an 8 node (256 vnodes each) repair with about 1GB/node take two days . I would suggest doing some more digging to see where all the overhead is coming from, before guessing at solutions.
          Hide
          Lyuben Todorov added a comment -

          I'll have a shot at adding in some logging into the repair process to see if we can get a better idea of how much time is being spend in the different repair stages.

          Show
          Lyuben Todorov added a comment - I'll have a shot at adding in some logging into the repair process to see if we can get a better idea of how much time is being spend in the different repair stages.
          Hide
          Ryan McGuire added a comment - - edited

          This is a yourkit comparison between the two runs of the dtest. Old Time is without vnodes, New Time is with vnodes. I turned off internode_compression because that run showed all the time being spent in Snappy compressor code, but even without compression the repair time didn't improve (much):

          Show
          Ryan McGuire added a comment - - edited This is a yourkit comparison between the two runs of the dtest. Old Time is without vnodes, New Time is with vnodes. I turned off internode_compression because that run showed all the time being spent in Snappy compressor code, but even without compression the repair time didn't improve (much):
          Hide
          Ryan McGuire added a comment -

          I attached 5220-yourkit.tar.bz2 which is snapshots made in yourkit of various runs of the dtest. Maybe someone with more experience with yourkit can glean some more info out of it.

          Show
          Ryan McGuire added a comment - I attached 5220-yourkit.tar.bz2 which is snapshots made in yourkit of various runs of the dtest. Maybe someone with more experience with yourkit can glean some more info out of it.
          Hide
          Yuki Morishita added a comment -

          Thanks, Ryan.
          Time increase in Incoming/OutboundTcpConnection indicate repair is spending more time in messaging.
          It is understandable the messaging is taking more than 200x for repairing 256x ranges.

          One possible solutin is to repair multiple ranges at once.
          I have two ideas in my mind:

          1. Build two-level MerkleTree of multiple ranges. In the lower level we have regular, per range MT and in the upper level, we have MT whose leaf is root hash of lower MT. So we can carry multiple MT in one round trip of message.
          2. Send validation request once for all ranges, replica node builds MT for each range one by one, and sent back MT as it is built.
          Show
          Yuki Morishita added a comment - Thanks, Ryan. Time increase in Incoming/OutboundTcpConnection indicate repair is spending more time in messaging. It is understandable the messaging is taking more than 200x for repairing 256x ranges. One possible solutin is to repair multiple ranges at once. I have two ideas in my mind: Build two-level MerkleTree of multiple ranges. In the lower level we have regular, per range MT and in the upper level, we have MT whose leaf is root hash of lower MT. So we can carry multiple MT in one round trip of message. Send validation request once for all ranges, replica node builds MT for each range one by one, and sent back MT as it is built.
          Hide
          Ryan McGuire added a comment -

          yourkit also listed some potential deadlocks, which apparently it doesn't save to the snapshot:

          Frozen threads found (potential deadlock)
           
          It seems that the following threads have not changed their stack for more than 10 seconds.
          These threads are possibly (but not necessarily!) in a deadlock or hung.
           
          Thread-10 <--- Frozen for at least 48 sec
          sun.nio.ch.FileDispatcherImpl.read0(FileDescriptor, long, int)
          sun.nio.ch.SocketDispatcher.read(FileDescriptor, long, int)
          sun.nio.ch.IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long, NativeDispatcher)
          sun.nio.ch.IOUtil.read(FileDescriptor, ByteBuffer, long, NativeDispatcher)
          sun.nio.ch.SocketChannelImpl.read(ByteBuffer)
          sun.nio.ch.SocketAdaptor$SocketInputStream.read(ByteBuffer)
          sun.nio.ch.ChannelInputStream.read(byte[], int, int)
          org.xerial.snappy.SnappyInputStream.hasNextChunk()
          org.xerial.snappy.SnappyInputStream.read()
          java.io.DataInputStream.readInt()
          org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion()
          org.apache.cassandra.net.IncomingTcpConnection.run()
          
          
          
          Thread-11 <--- Frozen for at least 1m 17 sec
          sun.nio.ch.FileDispatcherImpl.read0(FileDescriptor, long, int)
          sun.nio.ch.SocketDispatcher.read(FileDescriptor, long, int)
          sun.nio.ch.IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long, NativeDispatcher)
          sun.nio.ch.IOUtil.read(FileDescriptor, ByteBuffer, long, NativeDispatcher)
          sun.nio.ch.SocketChannelImpl.read(ByteBuffer)
          sun.nio.ch.SocketAdaptor$SocketInputStream.read(ByteBuffer)
          sun.nio.ch.ChannelInputStream.read(byte[], int, int)
          org.xerial.snappy.SnappyInputStream.hasNextChunk()
          org.xerial.snappy.SnappyInputStream.read()
          java.io.DataInputStream.readInt()
          org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion()
          org.apache.cassandra.net.IncomingTcpConnection.run()
          
          
          
          Thread-12 <--- Frozen for at least 48 sec
          sun.nio.ch.FileDispatcherImpl.read0(FileDescriptor, long, int)
          sun.nio.ch.SocketDispatcher.read(FileDescriptor, long, int)
          sun.nio.ch.IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long, NativeDispatcher)
          sun.nio.ch.IOUtil.read(FileDescriptor, ByteBuffer, long, NativeDispatcher)
          sun.nio.ch.SocketChannelImpl.read(ByteBuffer)
          sun.nio.ch.SocketAdaptor$SocketInputStream.read(ByteBuffer)
          sun.nio.ch.ChannelInputStream.read(byte[], int, int)
          org.xerial.snappy.SnappyInputStream.hasNextChunk()
          org.xerial.snappy.SnappyInputStream.read()
          java.io.DataInputStream.readInt()
          org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion()
          org.apache.cassandra.net.IncomingTcpConnection.run()
          
          
          
          Thread-13 <--- Frozen for at least 48 sec
          sun.nio.ch.FileDispatcherImpl.read0(FileDescriptor, long, int)
          sun.nio.ch.SocketDispatcher.read(FileDescriptor, long, int)
          sun.nio.ch.IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long, NativeDispatcher)
          sun.nio.ch.IOUtil.read(FileDescriptor, ByteBuffer, long, NativeDispatcher)
          sun.nio.ch.SocketChannelImpl.read(ByteBuffer)
          sun.nio.ch.SocketAdaptor$SocketInputStream.read(ByteBuffer)
          sun.nio.ch.ChannelInputStream.read(byte[], int, int)
          org.xerial.snappy.SnappyInputStream.hasNextChunk()
          org.xerial.snappy.SnappyInputStream.read()
          java.io.DataInputStream.readInt()
          org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion()
          org.apache.cassandra.net.IncomingTcpConnection.run()
          
          
          
          Thread-3 <--- Frozen for at least 1m 21 sec
          sun.nio.ch.FileDispatcherImpl.read0(FileDescriptor, long, int)
          sun.nio.ch.SocketDispatcher.read(FileDescriptor, long, int)
          sun.nio.ch.IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long, NativeDispatcher)
          sun.nio.ch.IOUtil.read(FileDescriptor, ByteBuffer, long, NativeDispatcher)
          sun.nio.ch.SocketChannelImpl.read(ByteBuffer)
          sun.nio.ch.SocketAdaptor$SocketInputStream.read(ByteBuffer)
          sun.nio.ch.ChannelInputStream.read(byte[], int, int)
          org.xerial.snappy.SnappyInputStream.hasNextChunk()
          org.xerial.snappy.SnappyInputStream.read()
          java.io.DataInputStream.readInt()
          org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion()
          org.apache.cassandra.net.IncomingTcpConnection.run()
          
          
          
          Thread-7 <--- Frozen for at least 1m 21 sec
          sun.nio.ch.FileDispatcherImpl.read0(FileDescriptor, long, int)
          sun.nio.ch.SocketDispatcher.read(FileDescriptor, long, int)
          sun.nio.ch.IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long, NativeDispatcher)
          sun.nio.ch.IOUtil.read(FileDescriptor, ByteBuffer, long, NativeDispatcher)
          sun.nio.ch.SocketChannelImpl.read(ByteBuffer)
          sun.nio.ch.SocketAdaptor$SocketInputStream.read(ByteBuffer)
          sun.nio.ch.ChannelInputStream.read(byte[], int, int)
          org.xerial.snappy.SnappyInputStream.hasNextChunk()
          org.xerial.snappy.SnappyInputStream.read()
          java.io.DataInputStream.readInt()
          org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion()
          org.apache.cassandra.net.IncomingTcpConnection.run()
          
          Show
          Ryan McGuire added a comment - yourkit also listed some potential deadlocks, which apparently it doesn't save to the snapshot: Frozen threads found (potential deadlock) It seems that the following threads have not changed their stack for more than 10 seconds. These threads are possibly (but not necessarily!) in a deadlock or hung. Thread -10 <--- Frozen for at least 48 sec sun.nio.ch.FileDispatcherImpl.read0(FileDescriptor, long , int ) sun.nio.ch.SocketDispatcher.read(FileDescriptor, long , int ) sun.nio.ch.IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long , NativeDispatcher) sun.nio.ch.IOUtil.read(FileDescriptor, ByteBuffer, long , NativeDispatcher) sun.nio.ch.SocketChannelImpl.read(ByteBuffer) sun.nio.ch.SocketAdaptor$SocketInputStream.read(ByteBuffer) sun.nio.ch.ChannelInputStream.read( byte [], int , int ) org.xerial.snappy.SnappyInputStream.hasNextChunk() org.xerial.snappy.SnappyInputStream.read() java.io.DataInputStream.readInt() org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion() org.apache.cassandra.net.IncomingTcpConnection.run() Thread -11 <--- Frozen for at least 1m 17 sec sun.nio.ch.FileDispatcherImpl.read0(FileDescriptor, long , int ) sun.nio.ch.SocketDispatcher.read(FileDescriptor, long , int ) sun.nio.ch.IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long , NativeDispatcher) sun.nio.ch.IOUtil.read(FileDescriptor, ByteBuffer, long , NativeDispatcher) sun.nio.ch.SocketChannelImpl.read(ByteBuffer) sun.nio.ch.SocketAdaptor$SocketInputStream.read(ByteBuffer) sun.nio.ch.ChannelInputStream.read( byte [], int , int ) org.xerial.snappy.SnappyInputStream.hasNextChunk() org.xerial.snappy.SnappyInputStream.read() java.io.DataInputStream.readInt() org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion() org.apache.cassandra.net.IncomingTcpConnection.run() Thread -12 <--- Frozen for at least 48 sec sun.nio.ch.FileDispatcherImpl.read0(FileDescriptor, long , int ) sun.nio.ch.SocketDispatcher.read(FileDescriptor, long , int ) sun.nio.ch.IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long , NativeDispatcher) sun.nio.ch.IOUtil.read(FileDescriptor, ByteBuffer, long , NativeDispatcher) sun.nio.ch.SocketChannelImpl.read(ByteBuffer) sun.nio.ch.SocketAdaptor$SocketInputStream.read(ByteBuffer) sun.nio.ch.ChannelInputStream.read( byte [], int , int ) org.xerial.snappy.SnappyInputStream.hasNextChunk() org.xerial.snappy.SnappyInputStream.read() java.io.DataInputStream.readInt() org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion() org.apache.cassandra.net.IncomingTcpConnection.run() Thread -13 <--- Frozen for at least 48 sec sun.nio.ch.FileDispatcherImpl.read0(FileDescriptor, long , int ) sun.nio.ch.SocketDispatcher.read(FileDescriptor, long , int ) sun.nio.ch.IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long , NativeDispatcher) sun.nio.ch.IOUtil.read(FileDescriptor, ByteBuffer, long , NativeDispatcher) sun.nio.ch.SocketChannelImpl.read(ByteBuffer) sun.nio.ch.SocketAdaptor$SocketInputStream.read(ByteBuffer) sun.nio.ch.ChannelInputStream.read( byte [], int , int ) org.xerial.snappy.SnappyInputStream.hasNextChunk() org.xerial.snappy.SnappyInputStream.read() java.io.DataInputStream.readInt() org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion() org.apache.cassandra.net.IncomingTcpConnection.run() Thread -3 <--- Frozen for at least 1m 21 sec sun.nio.ch.FileDispatcherImpl.read0(FileDescriptor, long , int ) sun.nio.ch.SocketDispatcher.read(FileDescriptor, long , int ) sun.nio.ch.IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long , NativeDispatcher) sun.nio.ch.IOUtil.read(FileDescriptor, ByteBuffer, long , NativeDispatcher) sun.nio.ch.SocketChannelImpl.read(ByteBuffer) sun.nio.ch.SocketAdaptor$SocketInputStream.read(ByteBuffer) sun.nio.ch.ChannelInputStream.read( byte [], int , int ) org.xerial.snappy.SnappyInputStream.hasNextChunk() org.xerial.snappy.SnappyInputStream.read() java.io.DataInputStream.readInt() org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion() org.apache.cassandra.net.IncomingTcpConnection.run() Thread -7 <--- Frozen for at least 1m 21 sec sun.nio.ch.FileDispatcherImpl.read0(FileDescriptor, long , int ) sun.nio.ch.SocketDispatcher.read(FileDescriptor, long , int ) sun.nio.ch.IOUtil.readIntoNativeBuffer(FileDescriptor, ByteBuffer, long , NativeDispatcher) sun.nio.ch.IOUtil.read(FileDescriptor, ByteBuffer, long , NativeDispatcher) sun.nio.ch.SocketChannelImpl.read(ByteBuffer) sun.nio.ch.SocketAdaptor$SocketInputStream.read(ByteBuffer) sun.nio.ch.ChannelInputStream.read( byte [], int , int ) org.xerial.snappy.SnappyInputStream.hasNextChunk() org.xerial.snappy.SnappyInputStream.read() java.io.DataInputStream.readInt() org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion() org.apache.cassandra.net.IncomingTcpConnection.run()
          Hide
          Jonathan Ellis added a comment -

          Send validation request once for all ranges, replica node builds MT for each range one by one, and sent back MT as it is built.

          This is a fairly straightforward extension, isn't it? I'd favor that approach.

          Show
          Jonathan Ellis added a comment - Send validation request once for all ranges, replica node builds MT for each range one by one, and sent back MT as it is built. This is a fairly straightforward extension, isn't it? I'd favor that approach.
          Hide
          Juho Mäkinen added a comment -

          In addition the repair operation gives poor status on its progress so it would be nice that some additional logging about repair progress would be added both to log4j and also to JMX.

          Show
          Juho Mäkinen added a comment - In addition the repair operation gives poor status on its progress so it would be nice that some additional logging about repair progress would be added both to log4j and also to JMX.
          Hide
          Jeffery Schnick added a comment -

          I see this is fixed in 2.1 rc1, but is there a patch or could I be pointed to the GIT commit that this was addressed?

          Thanks

          Show
          Jeffery Schnick added a comment - I see this is fixed in 2.1 rc1, but is there a patch or could I be pointed to the GIT commit that this was addressed? Thanks
          Hide
          Cyril Scetbon added a comment -

          Jeffery Schnick It's not fixed yet. We just hope it'll be fixed in version 2.1.1, and currently guys are digging to find where is located the overhead that slows the repair

          Show
          Cyril Scetbon added a comment - Jeffery Schnick It's not fixed yet. We just hope it'll be fixed in version 2.1.1, and currently guys are digging to find where is located the overhead that slows the repair
          Hide
          Jeffery Schnick added a comment -

          Cyril Scetbon Thank you for the info.

          Show
          Jeffery Schnick added a comment - Cyril Scetbon Thank you for the info.
          Hide
          Yuki Morishita added a comment -

          Re-targeting this to 3.0 since my current work involves messaging change to handle multiple ranges at once, and is based on major refactoring being done in CASSANDRA-6455.

          Show
          Yuki Morishita added a comment - Re-targeting this to 3.0 since my current work involves messaging change to handle multiple ranges at once, and is based on major refactoring being done in CASSANDRA-6455 .
          Hide
          Jonathan Ellis added a comment -

          I just talked to some people who were seeing an 8 node (256 vnodes each) repair with about 1GB/node take two days.

          I'm still not sure we have a good handle on this. Is this reproducible? I'm not convinced "spending more time in messaging" is an adequate explanation.

          Show
          Jonathan Ellis added a comment - I just talked to some people who were seeing an 8 node (256 vnodes each) repair with about 1GB/node take two days. I'm still not sure we have a good handle on this. Is this reproducible? I'm not convinced "spending more time in messaging" is an adequate explanation.
          Hide
          Yuki Morishita added a comment -

          I'm inclined to mark this 'later' in favor of incremental repair and internal refactoring such as CASSANDRA-6455.
          Especially, incremental repair should decrease the time needed for validating data, which is one of the major heavy-liftin processes of repair.

          Show
          Yuki Morishita added a comment - I'm inclined to mark this 'later' in favor of incremental repair and internal refactoring such as CASSANDRA-6455 . Especially, incremental repair should decrease the time needed for validating data, which is one of the major heavy-liftin processes of repair.
          Hide
          Jonathan Ellis added a comment -

          Marking Later as suggested.

          Show
          Jonathan Ellis added a comment - Marking Later as suggested.
          Hide
          Jeremy Hanna added a comment -

          I think it's important to reiterate that the project devs recognize that these inefficiencies are impacting many users. However, lots of parallel work is getting done on repair. As Yuki pointed out, with incremental repair (CASSANDRA-5351) already in 2.1 and improving the concurrency of the repair process (CASSANDRA-6455) coming in 3.0, many of the problems seen in this ticket will be resolved.

          Until 2.1/3.0, sub-range repair (CASSANDRA-5280) is helpful to parallelize and repair more efficiently with virtual nodes. See http://www.datastax.com/dev/blog/advanced-repair-techniques for details about efficiency gains with sub-range repair. It's just more tedious to track. Saving repair data to a system table (CASSANDRA-5839) will help track that in Cassandra itself.

          Show
          Jeremy Hanna added a comment - I think it's important to reiterate that the project devs recognize that these inefficiencies are impacting many users. However, lots of parallel work is getting done on repair. As Yuki pointed out, with incremental repair ( CASSANDRA-5351 ) already in 2.1 and improving the concurrency of the repair process ( CASSANDRA-6455 ) coming in 3.0, many of the problems seen in this ticket will be resolved. Until 2.1/3.0, sub-range repair ( CASSANDRA-5280 ) is helpful to parallelize and repair more efficiently with virtual nodes. See http://www.datastax.com/dev/blog/advanced-repair-techniques for details about efficiency gains with sub-range repair. It's just more tedious to track. Saving repair data to a system table ( CASSANDRA-5839 ) will help track that in Cassandra itself.

            People

            • Assignee:
              Yuki Morishita
              Reporter:
              Brandon Williams
            • Votes:
              13 Vote for this issue
              Watchers:
              36 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development