Details

    • Type: Bug
    • Status: Awaiting Feedback
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: 3.0.x, 3.11.x, 4.x
    • Component/s: Compaction
    • Labels:
      None

      Description

      After an upgrade from 2.2.6 to 3.0.15 (sstable version la to mc), when I try to run upgradesstables, most of them upgrade fine but I see the exception below on several nodes, and it doesn't complete.

      CASSANDRA-12717 looks similar but the stack trace is not the same, so I assumed it is not identical. The various nodes this happens on all give the same trace.

      Might be notable that this is an analytics cluster with some large partitions, in the GB size.

      error: Out of range: 7316844981
      -- StackTrace --
      java.lang.IllegalArgumentException: Out of range: 7316844981
      at com.google.common.primitives.Ints.checkedCast(Ints.java:91)
      at org.apache.cassandra.db.RowIndexEntry$IndexedEntry.promotedSize(RowIndexEntry.java:329)
      at org.apache.cassandra.db.RowIndexEntry$Serializer.serialize(RowIndexEntry.java:133)
      at org.apache.cassandra.io.sstable.format.big.BigTableWriter$IndexWriter.append(BigTableWriter.java:409)
      at org.apache.cassandra.io.sstable.format.big.BigTableWriter.afterAppend(BigTableWriter.java:120)
      at org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:157)
      at org.apache.cassandra.io.sstable.SSTableRewriter.append(SSTableRewriter.java:125)
      at org.apache.cassandra.db.compaction.writers.MaxSSTableSizeWriter.realAppend(MaxSSTableSizeWriter.java:88)
      at org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.append(CompactionAwareWriter.java:109)
      at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:195)
      at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
      at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:89)
      at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:61)
      at org.apache.cassandra.db.compaction.CompactionManager$5.execute(CompactionManager.java:424)
      at org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:311)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
      at java.lang.Thread.run(Thread.java:748)
      

        Issue Links

          Activity

          Hide
          dkinder Dan Kinder added a comment -

          Thanks Jeff Jirsa I'll give it a shot.

          Show
          dkinder Dan Kinder added a comment - Thanks Jeff Jirsa I'll give it a shot.
          Hide
          jjirsa Jeff Jirsa added a comment - - edited

          Thanks for the feedback Sylvain Lebresne. I really appreciate you taking the time to respond. The risky feeling here is why I'm moving slow on this myself - it seems straightforward, but obviously the potential for unexpected surprises here is pretty high.

          Better messaging on that assert makes a lot of sense. Also good to see the second confirmation that changing column_index_size_in_kb is a good workaround, the drawback is that it's instance-wide, so if you have just a handful of wide rows (like this user, the histogram shows their 99% size is less than 1MB, but their max size is 394GB), you suffer a disk penalty on all keyspaces/tables/rows in order to not crash on the one bad row.

          Dan Kinder if you need to unblock yourself right now, changing column_index_size_in_kb on your instance to 256 (7G of index data needs to go under 2G in size, so multiplying factor of 4) PROBABLY works past this issue, but expect a bit more disk IO (particularly reads) after the change (+upgradesstables)

          Show
          jjirsa Jeff Jirsa added a comment - - edited Thanks for the feedback Sylvain Lebresne . I really appreciate you taking the time to respond. The risky feeling here is why I'm moving slow on this myself - it seems straightforward, but obviously the potential for unexpected surprises here is pretty high. Better messaging on that assert makes a lot of sense. Also good to see the second confirmation that changing column_index_size_in_kb is a good workaround, the drawback is that it's instance-wide, so if you have just a handful of wide rows (like this user, the histogram shows their 99% size is less than 1MB, but their max size is 394GB), you suffer a disk penalty on all keyspaces/tables/rows in order to not crash on the one bad row. Dan Kinder if you need to unblock yourself right now, changing column_index_size_in_kb on your instance to 256 (7G of index data needs to go under 2G in size, so multiplying factor of 4) PROBABLY works past this issue, but expect a bit more disk IO (particularly reads) after the change (+upgradesstables)
          Hide
          slebresne Sylvain Lebresne added a comment -

          do you two feel this is safe?

          I can't think of something this would hard break off the top of my head. That said, and for what it's worth, my more complete initial reflexion is that:

          • on 3.0/3.X, this feels a tad risky: we're adding new code to the file indexing (granted, not excessively complex one) and it doesn't get a lot more critical. It could also change the performance profile, and while it might be on the good side in many cases, it may not always be (especially when the patch rely on 2 new settings whose defaults may happen to not be the right ones for someone's pratical workload). As few people will run into this problem in the first place, asking those rare users to change column_index_size_in_kb would probably be safer overall (tbc, I'm suggesting here to improve the error message of the checked cast to point people to that work-around, not to leave people on their own as done currently).
          • on 4.0, we already have CASSANDRA-11206 (which is in fact in 3.11 as well) to help work with large indexes, and things like CASSANDRA-9754 are supposed to make that even better, so the benefit on memory of this aren't that clear. CASSANDRA-11206 doesn't solve the AsssertionError of this ticket, but we could move the index size from int to long (or varint) for that. Which isn't necessarilly saying we shouldn't do this, but adding multiple ways to fix the same problem, each with their own new config setting (CASSANDRA-11206 added one, this patch adds 2) doesn't feel ideal, so that's to be taken into consideration imo.
          Show
          slebresne Sylvain Lebresne added a comment - do you two feel this is safe? I can't think of something this would hard break off the top of my head. That said, and for what it's worth, my more complete initial reflexion is that: on 3.0/3.X, this feels a tad risky: we're adding new code to the file indexing (granted, not excessively complex one) and it doesn't get a lot more critical. It could also change the performance profile, and while it might be on the good side in many cases, it may not always be (especially when the patch rely on 2 new settings whose defaults may happen to not be the right ones for someone's pratical workload). As few people will run into this problem in the first place, asking those rare users to change column_index_size_in_kb would probably be safer overall (tbc, I'm suggesting here to improve the error message of the checked cast to point people to that work-around, not to leave people on their own as done currently). on 4.0, we already have CASSANDRA-11206 (which is in fact in 3.11 as well) to help work with large indexes, and things like CASSANDRA-9754 are supposed to make that even better, so the benefit on memory of this aren't that clear. CASSANDRA-11206 doesn't solve the AsssertionError of this ticket, but we could move the index size from int to long (or varint) for that. Which isn't necessarilly saying we shouldn't do this, but adding multiple ways to fix the same problem, each with their own new config setting ( CASSANDRA-11206 added one, this patch adds 2) doesn't feel ideal, so that's to be taken into consideration imo.
          Hide
          jjirsa Jeff Jirsa added a comment -

          Branch here for a concept I think works, but I want some more opinions.

          Sylvain Lebresne / Aleksey Yeschenko in particular, do you two feel this is safe?

          Basically while we build the ColumnIndex, we have the list of List<IndexHelper.IndexInfo> columnsIndex in memory in a sorted list. At the time we call close(), we see if either: there are too many of those objects, OR, if they use too much memory (large keys). If so, we downsample the list, in place, by a constant factor.

          In ascii art:

          [a...c][d...f][e...h][i...k][l...n][o...p][q...r][s...u][v...x][y...z]
          

          If we set a target size of 3, we'd downsample this by a factor of 3 into:

          [a...c][d...k][l...r][s...z]
          

          We'll keep the latest / last lastName+deletion marker on merge, and we'll keep earliest/first firstName + offset, and recalculate the width.

          Not to be ignored: this has an added bonus of really making large partitions much less painful if you crank down those column index targets - little bit more IO, but much less GC. Running this in a test cluster, I was able to do reads/writes to 80GB partitions with 5GB heap, which is a pretty cool side effect.

          Show
          jjirsa Jeff Jirsa added a comment - Branch here for a concept I think works, but I want some more opinions. Sylvain Lebresne / Aleksey Yeschenko in particular, do you two feel this is safe? Basically while we build the ColumnIndex, we have the list of List<IndexHelper.IndexInfo> columnsIndex in memory in a sorted list. At the time we call close() , we see if either: there are too many of those objects, OR, if they use too much memory (large keys). If so, we downsample the list, in place, by a constant factor. In ascii art: [a...c][d...f][e...h][i...k][l...n][o...p][q...r][s...u][v...x][y...z] If we set a target size of 3, we'd downsample this by a factor of 3 into: [a...c][d...k][l...r][s...z] We'll keep the latest / last lastName+deletion marker on merge, and we'll keep earliest/first firstName + offset, and recalculate the width. Not to be ignored: this has an added bonus of really making large partitions much less painful if you crank down those column index targets - little bit more IO, but much less GC. Running this in a test cluster, I was able to do reads/writes to 80GB partitions with 5GB heap, which is a pretty cool side effect.
          Hide
          jjirsa Jeff Jirsa added a comment - - edited

          Those knobs are actually for a slightly different index (the partition summary itself). Though in yaml / Config.java there's column_index_size_in_kb (defaults to 64), which I'm pretty sure does this, but not sure enough to tell you to change it without testing in a lab (in particular, I don't know exactly how it behaves if you change this with pre-existing data; I'd hope it'd do the right thing, but I certainly haven't personally tested it, so this is not a recommendation , only a pointer to a config option that exists). In fact, CASSANDRA-5454 notes that this used to cause corruption, so I'm definitely not saying you should do this, though it's possible that CASSANDRA-5454 makes this safe to do.

          (FWIW, on the index summary, we downsample if we get over 2GB - see CASSANDRA-12014 - and I think instead of changing this to a long, we should consider that here as well, until CASSANDRA-9754 is done, which makes this a moot point).

          Show
          jjirsa Jeff Jirsa added a comment - - edited Those knobs are actually for a slightly different index (the partition summary itself). Though in yaml / Config.java there's column_index_size_in_kb (defaults to 64), which I'm pretty sure does this, but not sure enough to tell you to change it without testing in a lab (in particular, I don't know exactly how it behaves if you change this with pre-existing data; I'd hope it'd do the right thing, but I certainly haven't personally tested it, so this is not a recommendation , only a pointer to a config option that exists). In fact, CASSANDRA-5454 notes that this used to cause corruption, so I'm definitely not saying you should do this, though it's possible that CASSANDRA-5454 makes this safe to do. (FWIW, on the index summary, we downsample if we get over 2GB - see CASSANDRA-12014 - and I think instead of changing this to a long, we should consider that here as well, until CASSANDRA-9754 is done, which makes this a moot point).
          Hide
          dkinder Dan Kinder added a comment -

          I see... messing with min/max_index_interval wouldn't help would it?

          Show
          dkinder Dan Kinder added a comment - I see... messing with min/max_index_interval wouldn't help would it?
          Hide
          jjirsa Jeff Jirsa added a comment -

          OK, leaving some quick notes for whoever gets around to handling this (maybe me, not self assigning because I don't have bandwidth right now, and to be honest I haven't thought about the right fix yet).

          The code here is trying to calculate the serialized size, so it can write the index out as it rewrites that partition to a data file :

                  long size = TypeSizes.sizeofUnsignedVInt(headerLength)
                            + DeletionTime.serializer.serializedSize(deletionTime)
                            + TypeSizes.sizeofUnsignedVInt(columnsIndex.size()); // number of entries
                  for (IndexHelper.IndexInfo info : columnsIndex)
                      size += idxSerializer.serializedSize(info);
                  size += columnsIndex.size() * TypeSizes.sizeof(0);
                  return Ints.checkedCast(size);
          

          With 394GB and an index entry every 64k, you're going to write something like 6169617 index markers, and the field there to handle it is a (signed) integer (4 bytes), giving you a maximum size for all of the index markers of 2147483648 , about 348 bytes per marker. The size of a marker is here:

                          long size = clusteringSerializer.serializedSize(info.firstName)
                                    + clusteringSerializer.serializedSize(info.lastName)
                                    + TypeSizes.sizeofUnsignedVInt(info.offset)
                                    + TypeSizes.sizeofVInt(info.width - WIDTH_BASE)
                                    + TypeSizes.sizeof(info.endOpenMarker != null);
          
                          if (info.endOpenMarker != null)
                              size += DeletionTime.serializer.serializedSize(info.endOpenMarker);
                          return size;
          

          Note that it has both the first and last clustering within that marker - so for you not to overflow, assuming no range tombstones which would take up even more space, your clustering markers would have to average less than ~165 bytes each, which clearly isn't happening, so we overflow that int and stop.

          That's the short version of what's happening. I'm not sure why it's an int instead of a long , and I'm not immediately sure why you're hitting it here with upgradesstables when you didn't hit it previously.

          Show
          jjirsa Jeff Jirsa added a comment - OK, leaving some quick notes for whoever gets around to handling this (maybe me, not self assigning because I don't have bandwidth right now, and to be honest I haven't thought about the right fix yet). The code here is trying to calculate the serialized size, so it can write the index out as it rewrites that partition to a data file : long size = TypeSizes.sizeofUnsignedVInt(headerLength) + DeletionTime.serializer.serializedSize(deletionTime) + TypeSizes.sizeofUnsignedVInt(columnsIndex.size()); // number of entries for (IndexHelper.IndexInfo info : columnsIndex) size += idxSerializer.serializedSize(info); size += columnsIndex.size() * TypeSizes.sizeof(0); return Ints.checkedCast(size); With 394GB and an index entry every 64k, you're going to write something like 6169617 index markers, and the field there to handle it is a (signed) integer (4 bytes), giving you a maximum size for all of the index markers of 2147483648 , about 348 bytes per marker. The size of a marker is here: long size = clusteringSerializer.serializedSize(info.firstName) + clusteringSerializer.serializedSize(info.lastName) + TypeSizes.sizeofUnsignedVInt(info.offset) + TypeSizes.sizeofVInt(info.width - WIDTH_BASE) + TypeSizes.sizeof(info.endOpenMarker != null ); if (info.endOpenMarker != null ) size += DeletionTime.serializer.serializedSize(info.endOpenMarker); return size; Note that it has both the first and last clustering within that marker - so for you not to overflow, assuming no range tombstones which would take up even more space, your clustering markers would have to average less than ~165 bytes each, which clearly isn't happening, so we overflow that int and stop. That's the short version of what's happening. I'm not sure why it's an int instead of a long , and I'm not immediately sure why you're hitting it here with upgradesstables when you didn't hit it previously.
          Hide
          dkinder Dan Kinder added a comment -

          The partition key is a string, and there are 4 clustering columns (text, text, text, timestamp). The partition key and two of the clustering columns are on average small, on the order of 10 bytes. The third could be larger but have a maximum of 2048 bytes.

          Also, I don't recall running upgradesstables before, if that matters.

          Show
          dkinder Dan Kinder added a comment - The partition key is a string, and there are 4 clustering columns (text, text, text, timestamp). The partition key and two of the clustering columns are on average small, on the order of 10 bytes. The third could be larger but have a maximum of 2048 bytes. Also, I don't recall running upgradesstables before, if that matters.
          Hide
          jjirsa Jeff Jirsa added a comment -

          Ok, so a 394GB row on a 5TB table. Can you tell me just a little bit about your clustering columns - how many and how long are they typically? I'm pretty sure I see what the issue is, but I haven't yet checked the difference against 2.2 to see why it's only a problem now, and wasn't before.

          Show
          jjirsa Jeff Jirsa added a comment - Ok, so a 394GB row on a 5TB table. Can you tell me just a little bit about your clustering columns - how many and how long are they typically? I'm pretty sure I see what the issue is, but I haven't yet checked the difference against 2.2 to see why it's only a problem now, and wasn't before.
          Hide
          dkinder Dan Kinder added a comment -

          Jeff Jirsa

          This happens on about 12 out of 16 nodes, much greater than RF. (No real need to anonymize the names.)

          $ nodetool tablehistograms walker links
          walker/links histograms
          Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                                        (micros)          (micros)           (bytes)                  
          50%             4.00             51.01        1155149.91               258                 6
          75%             5.00             73.46        3449259.15              1597                42
          95%             5.00            152.32        7152383.77             73457              2299
          98%             5.00            182.79        8582860.53            315852              9887
          99%             5.00            219.34       53142810.15            943127             24601
          Min             0.00              8.24             88.15                43                 0
          Max             5.00        1155149.91       53142810.15      394855527800         962624926
          
          $ nodetool tablestats walker.links
          Keyspace: walker
                  Read Count: 72932
                  Read Latency: 834.917616670323 ms.
                  Write Count: 15501219
                  Write Latency: 0.4392410533003888 ms.
                  Pending Flushes: 0
                          Table: links
                          SSTable count: 14269
                          SSTables in each level: [0, 9, 68, 349, 2843, 11000, 0, 0, 0]
                          Space used (live): 5310634842695
                          Space used (total): 5310634842695
                          Space used by snapshots (total): 12868175419329
                          Off heap memory used (total): 16339462
                          SSTable Compression Ratio: 0.0
                          Number of partitions (estimate): 20112302
                          Memtable cell count: 2216835
                          Memtable data size: 347672840
                          Memtable off heap memory used: 0
                          Memtable switch count: 4
                          Local read count: 80395
                          Local read latency: 747.603 ms
                          Local write count: 16891401
                          Local write latency: 0.118 ms
                          Pending flushes: 0
                          Bloom filter false positives: 178
                          Bloom filter false ratio: 0.00081
                          Bloom filter space used: 14912616
                          Bloom filter off heap memory used: 14798464
                          Index summary off heap memory used: 1540998
                          Compression metadata off heap memory used: 0
                          Compacted partition minimum bytes: 43
                          Compacted partition maximum bytes: 394855527800
                          Compacted partition mean bytes: 249068
                          Average live cells per slice (last five minutes): 11010.451219512195
                          Maximum live cells per slice (last five minutes): 11864
                          Average tombstones per slice (last five minutes): 1.0
                          Maximum tombstones per slice (last five minutes): 1
          
          Show
          dkinder Dan Kinder added a comment - Jeff Jirsa This happens on about 12 out of 16 nodes, much greater than RF . (No real need to anonymize the names.) $ nodetool tablehistograms walker links walker/links histograms Percentile SSTables Write Latency Read Latency Partition Size Cell Count (micros) (micros) (bytes) 50% 4.00 51.01 1155149.91 258 6 75% 5.00 73.46 3449259.15 1597 42 95% 5.00 152.32 7152383.77 73457 2299 98% 5.00 182.79 8582860.53 315852 9887 99% 5.00 219.34 53142810.15 943127 24601 Min 0.00 8.24 88.15 43 0 Max 5.00 1155149.91 53142810.15 394855527800 962624926 $ nodetool tablestats walker.links Keyspace: walker Read Count: 72932 Read Latency: 834.917616670323 ms. Write Count: 15501219 Write Latency: 0.4392410533003888 ms. Pending Flushes: 0 Table: links SSTable count: 14269 SSTables in each level: [0, 9, 68, 349, 2843, 11000, 0, 0, 0] Space used (live): 5310634842695 Space used (total): 5310634842695 Space used by snapshots (total): 12868175419329 Off heap memory used (total): 16339462 SSTable Compression Ratio: 0.0 Number of partitions (estimate): 20112302 Memtable cell count: 2216835 Memtable data size: 347672840 Memtable off heap memory used: 0 Memtable switch count: 4 Local read count: 80395 Local read latency: 747.603 ms Local write count: 16891401 Local write latency: 0.118 ms Pending flushes: 0 Bloom filter false positives: 178 Bloom filter false ratio: 0.00081 Bloom filter space used: 14912616 Bloom filter off heap memory used: 14798464 Index summary off heap memory used: 1540998 Compression metadata off heap memory used: 0 Compacted partition minimum bytes: 43 Compacted partition maximum bytes: 394855527800 Compacted partition mean bytes: 249068 Average live cells per slice (last five minutes): 11010.451219512195 Maximum live cells per slice (last five minutes): 11864 Average tombstones per slice (last five minutes): 1.0 Maximum tombstones per slice (last five minutes): 1
          Hide
          jjirsa Jeff Jirsa added a comment -

          Is this happening on exactly 3 (or RF) nodes, or more than that?

          Can you paste (anonymized) nodetool tablestats or nodetool tablehistograms for that table?

          Show
          jjirsa Jeff Jirsa added a comment - Is this happening on exactly 3 (or RF ) nodes, or more than that? Can you paste (anonymized) nodetool tablestats or nodetool tablehistograms for that table?

            People

            • Assignee:
              jjirsa Jeff Jirsa
              Reporter:
              dkinder Dan Kinder
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development