Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-11206

Support large partitions on the 3.0 sstable format

    Details

      Description

      Cassandra saves a sample of IndexInfo objects that store the offset within each partition of every 64KB (by default) range of rows. To find a row, we binary search this sample, then scan the partition of the appropriate range.

      The problem is that this scales poorly as partitions grow: on a cache miss, we deserialize the entire set of IndexInfo, which both creates a lot of GC overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity (relative to reading a single 64KB row range) as partitions get truly large.

      We introduced an "offset map" in CASSANDRA-10314 that allows us to perform the IndexInfo bsearch while only deserializing IndexInfo that we need to compare against, i.e. log(N) deserializations.

      1. 11206-gc.png
        426 kB
        Robert Stupp
      2. trunk-gc.png
        482 kB
        Robert Stupp

        Issue Links

          Activity

          Hide
          jbellis Jonathan Ellis added a comment -

          The offset map is written AFTER the serialized IndexInfo, but since we write out the size of both up front, we can still access the map without deserializing everything. Here's the code from RIE that writes it out:

                      out.writeUnsignedVInt(rie.position);
                      out.writeUnsignedVInt(rie.promotedSize(idxSerializer));
                      ...
                      out.writeUnsignedVInt(rie.columnsIndex().size());
                     ... [write out the IndexInfo and compute offsets map as we go] ...
                     for (int off : offsets)
                          out.writeInt(off);
          

          (There is no code yet that reads it back in because CASSANDRA-9738 got put on the back burner.)

          Thus the offsets map starts at the total size, minus count * sizeof(int).

          So we can read the middle offsetmap entry, deserialize the IndexInfo it points to, compare with the row we're looking for, and repeat until bsearch is done.

          Show
          jbellis Jonathan Ellis added a comment - The offset map is written AFTER the serialized IndexInfo, but since we write out the size of both up front, we can still access the map without deserializing everything. Here's the code from RIE that writes it out: out.writeUnsignedVInt(rie.position); out.writeUnsignedVInt(rie.promotedSize(idxSerializer)); ... out.writeUnsignedVInt(rie.columnsIndex().size()); ... [write out the IndexInfo and compute offsets map as we go] ... for ( int off : offsets) out.writeInt(off); (There is no code yet that reads it back in because CASSANDRA-9738 got put on the back burner.) Thus the offsets map starts at the total size, minus count * sizeof(int). So we can read the middle offsetmap entry, deserialize the IndexInfo it points to, compare with the row we're looking for, and repeat until bsearch is done.
          Hide
          jbellis Jonathan Ellis added a comment -

          (Note that I am a big fan of the proposal in CASSANDRA-9754, this is intended as a simpler approach that we can ship quickly and replace when 9754 is ready. /cc Michael Kjellman)

          Show
          jbellis Jonathan Ellis added a comment - (Note that I am a big fan of the proposal in CASSANDRA-9754 , this is intended as a simpler approach that we can ship quickly and replace when 9754 is ready. /cc Michael Kjellman )
          Hide
          mkjellman Michael Kjellman added a comment -

          The IndexEntry objects are currently variable length Jonathan Ellis which might make this a bit complicated on the read path. Also, how many elements would need to be deserialized at minimum? Whatever the bucket size used for the skip list implementation?

          Show
          mkjellman Michael Kjellman added a comment - The IndexEntry objects are currently variable length Jonathan Ellis which might make this a bit complicated on the read path. Also, how many elements would need to be deserialized at minimum? Whatever the bucket size used for the skip list implementation?
          Hide
          jbellis Jonathan Ellis added a comment -

          The offset map is what allows us to deal with variable length index entries. So you only deserialize exactly as many IndexInfo as needed to locate the right 64KB row block. Then scanning for the row w/in the block is unchanged.

          Show
          jbellis Jonathan Ellis added a comment - The offset map is what allows us to deal with variable length index entries. So you only deserialize exactly as many IndexInfo as needed to locate the right 64KB row block. Then scanning for the row w/in the block is unchanged.
          Hide
          kohlisankalp sankalp kohli added a comment -

          +1 for stop gap.

          Show
          kohlisankalp sankalp kohli added a comment - +1 for stop gap.
          Hide
          snazy Robert Stupp added a comment -

          A brief outline of what I am planning ("full version"):

          For partitions < 64k (partitions without an IndexInfo object) we could skip the indirection during reads via RowIndexEntry at all by extending the IndexSummary and directly store the offset into the data file. (This also flattens the IndexedEntry vs. RowIndexEntry class hierarchy and removes some if-else constructs.) Maybe also use vint encoding in IndexSummary to save some space in memory and on disk (looks possible from a brief look). Eventually also add the partition deletion time to the summary, if it's worth to do that (not sure about this - it's in IndexedEntry but not in RowIndexEntry).

          For other partitions we use the offset information in IndexedEntry and only read those IndexInfo entries that are really necessary during the binary search. It doesn't really matter whether we are reading cold or hot data as cold data has to be read from disk anyway and hot data should already be in the page cache.

          Having the offset into the data file in the summary, we can remove the key cache.

          Tests for CASSANDRA-9738 have shown that there is not much benefit keeping the full IndexedEntry + IndexInfo structure in memory (off heap). So this ticket would supersede CASSANDRA-9738 and CASSANDRA-10320.

          Downside of this approach is that it changes the on-disk format of IndexSummary, which might be an issue in 3.x - so there's a "plan B version":

          • Leave IndexSummary untouched
          • Remove IndexInfo from the key cache (not from the index file on disk, of course)
          • Change IndexSummary and remove the whole key cache in a follow-up ticket for 4.x

          /cc Sylvain Lebresne Ariel Weisberg Aleksey Yeschenko

          Show
          snazy Robert Stupp added a comment - A brief outline of what I am planning ("full version"): For partitions < 64k (partitions without an IndexInfo object) we could skip the indirection during reads via RowIndexEntry at all by extending the IndexSummary and directly store the offset into the data file. (This also flattens the IndexedEntry vs. RowIndexEntry class hierarchy and removes some if-else constructs.) Maybe also use vint encoding in IndexSummary to save some space in memory and on disk (looks possible from a brief look). Eventually also add the partition deletion time to the summary, if it's worth to do that (not sure about this - it's in IndexedEntry but not in RowIndexEntry). For other partitions we use the offset information in IndexedEntry and only read those IndexInfo entries that are really necessary during the binary search. It doesn't really matter whether we are reading cold or hot data as cold data has to be read from disk anyway and hot data should already be in the page cache. Having the offset into the data file in the summary, we can remove the key cache. Tests for CASSANDRA-9738 have shown that there is not much benefit keeping the full IndexedEntry + IndexInfo structure in memory (off heap). So this ticket would supersede CASSANDRA-9738 and CASSANDRA-10320 . Downside of this approach is that it changes the on-disk format of IndexSummary, which might be an issue in 3.x - so there's a "plan B version": Leave IndexSummary untouched Remove IndexInfo from the key cache (not from the index file on disk, of course) Change IndexSummary and remove the whole key cache in a follow-up ticket for 4.x /cc Sylvain Lebresne Ariel Weisberg Aleksey Yeschenko
          Hide
          aweisberg Ariel Weisberg added a comment - - edited

          I'm summarizing to make sure I remember correctly what the key cache miss read path for a table looks like.
          1. Binary search index summary to find location of partition index entry in index
          2. Lookup index entry which may just be a pointer to the data file, or it may be a sampled index of rows in the partition
          3. Look up the partition contents based on the index entry

          The index summary is a sampling of the index so most of the time we aren't going to get a hit into the data file right? We have to scan the index to find the RIE and that entire process is what the key cache saves us from.

          If I remember correctly what I was thinking was that the key cache instead of storing a copy of the RIE it would store an offset into the index that is the location of the RIE. Then the RIE could be accessed off heap via a memory mapping without doing any allocations or copies.

          I agree that for partitions that aren't indexed the key cache could point straight to the data file and skip the index lookup since there doesn't need to be additional data there. I don't follow the path you are describing to completely removing the key cache without restructuring index summaries and indexes into something that is either traversed differently or doesn't summarize/sample.

          An aside. Is RowIndexEntry named incorrectly? Should it be PartitionIndexEntry?

          Show
          aweisberg Ariel Weisberg added a comment - - edited I'm summarizing to make sure I remember correctly what the key cache miss read path for a table looks like. 1. Binary search index summary to find location of partition index entry in index 2. Lookup index entry which may just be a pointer to the data file, or it may be a sampled index of rows in the partition 3. Look up the partition contents based on the index entry The index summary is a sampling of the index so most of the time we aren't going to get a hit into the data file right? We have to scan the index to find the RIE and that entire process is what the key cache saves us from. If I remember correctly what I was thinking was that the key cache instead of storing a copy of the RIE it would store an offset into the index that is the location of the RIE. Then the RIE could be accessed off heap via a memory mapping without doing any allocations or copies. I agree that for partitions that aren't indexed the key cache could point straight to the data file and skip the index lookup since there doesn't need to be additional data there. I don't follow the path you are describing to completely removing the key cache without restructuring index summaries and indexes into something that is either traversed differently or doesn't summarize/sample. An aside. Is RowIndexEntry named incorrectly? Should it be PartitionIndexEntry ?
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          For partitions < 64k (partitions without an IndexInfo object) we could skip the indirection during reads via RowIndexEntry at all by extending the IndexSummary and directly store the offset into the data file

          Since the idea here is to do something simple that we can be confident about shipping in 3.6 if CASSANDRA-9754 isn't ready, let's avoid making changes to the on disk layout.

          To clarify for others following along,

          Remove IndexInfo from the key cache (not from the index file on disk, of course)

          This sounds scary but it's core to the goal here: if we're going to support large partitions, we can't afford the overhead either of keeping the entire summary on heap, or of reading it from disk in the first place. (If we're reading a 1KB row, then reading 2MB of summary first on a cache miss is a huge overhead.) Moving the key cache off heap (CASSANDRA-9738) would have helped with the first but not the second.

          So one approach is to go back to the old strategy of only caching the partition key location, and then go through the index bsearch using the offsets map every time. For small partitions this will be trivial and I hope negligible to the performance story vs the current cache. (If not, we can look at a hybrid strategy, but I'd like to avoid that complexity if possible.)

          what I was thinking was that the key cache instead of storing a copy of the RIE it would store an offset into the index that is the location of the RIE. Then the RIE could be accessed off heap via a memory mapping without doing any allocations or copies

          I was thinking that even the offsets alone for a 4GB partition are going to be 256KB, so we don't want to cache the entire offsets map. But the other side there is that if you have a bunch of 4GB partitions you won't have very many of them. 16TB of data would be 1GB of offsets which is within the bounds of reasonable when off heap. And your approach may require less logic changes than the one above, since we're still "caching" the entire summary, sort of; only adding an extra indirection to read the IndexInfo entries. So that might well be simpler.

          Edit: but switching to a per-row cache (from per-partition) would be a much bigger change and I don't see the performance implications as straightforward at all, so let's not do that.

          Show
          jbellis Jonathan Ellis added a comment - - edited For partitions < 64k (partitions without an IndexInfo object) we could skip the indirection during reads via RowIndexEntry at all by extending the IndexSummary and directly store the offset into the data file Since the idea here is to do something simple that we can be confident about shipping in 3.6 if CASSANDRA-9754 isn't ready, let's avoid making changes to the on disk layout. To clarify for others following along, Remove IndexInfo from the key cache (not from the index file on disk, of course) This sounds scary but it's core to the goal here: if we're going to support large partitions, we can't afford the overhead either of keeping the entire summary on heap, or of reading it from disk in the first place. (If we're reading a 1KB row, then reading 2MB of summary first on a cache miss is a huge overhead.) Moving the key cache off heap ( CASSANDRA-9738 ) would have helped with the first but not the second. So one approach is to go back to the old strategy of only caching the partition key location, and then go through the index bsearch using the offsets map every time. For small partitions this will be trivial and I hope negligible to the performance story vs the current cache. (If not, we can look at a hybrid strategy, but I'd like to avoid that complexity if possible.) what I was thinking was that the key cache instead of storing a copy of the RIE it would store an offset into the index that is the location of the RIE. Then the RIE could be accessed off heap via a memory mapping without doing any allocations or copies I was thinking that even the offsets alone for a 4GB partition are going to be 256KB, so we don't want to cache the entire offsets map. But the other side there is that if you have a bunch of 4GB partitions you won't have very many of them. 16TB of data would be 1GB of offsets which is within the bounds of reasonable when off heap. And your approach may require less logic changes than the one above, since we're still "caching" the entire summary, sort of; only adding an extra indirection to read the IndexInfo entries. So that might well be simpler. Edit: but switching to a per-row cache (from per-partition) would be a much bigger change and I don't see the performance implications as straightforward at all, so let's not do that.
          Hide
          snazy Robert Stupp added a comment -

          Quick progress status:

          • refactored the code to be able to handle "flat byte structures" (i.e. a byte[] at the moment - as a pre-requisite to directly access the index file)
          • IndexInfo is only used from AbstractSSTableIterator.IndexState - an instance to an open index-file is available, so removing the byte[] and accessing the index file directly is the next step.
          • unit and dtests are mostly passing (i.e. there are some flakey ones on cassci, which are passing locally). Still need to identify what's going on with the failing paging dtests.
          • cstar tests show similar results compared to current trunk
          • IndexInfo is also used from UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound (CASSANDRA-8180) - not sure whether it's worth to deserialize the index for this functionality, as it is currently restricted to the entries that are present in the key cache. I tend to remove this access. (/cc Stefania)

          Observations:

          • accesses to IndexInfo objects are "random" during the binary search operation (as expected)
          • accesses to IndexInfo objects are "nearly sequential" during scan operations - "nearly" means, it accesses index N, then index N-1, then index N+1 before it actually moves ahead - but does some random accesses to previously accessed IndexInfo instances afterwards. Therefore IndexState "caches" the already deserialised IndexInfo objects. These should stay in new-gen as these are only referenced during the lifetime of the actual read. Alternatively it is possible to use a plain & boring LRU like cache for the 10 last IndexInfo objects in IndexState.
          • index-file writes (flushes/compactions) also used IndexInfo objects - replaced with a buffered write (DataOutputBuffer)

          Assumptions:

          • heap pressure due to the vast amount of IndexInfo objects is already handled by this patch (exchanged to one byte[] at the moment) both for reads and flushes/compactions
          • after replacing the byte[] with index file access, we could lower the (default) key-cache size since we then no longer cache IndexInfo objects on heap

          So the next step is to remove the byte[] from IndexedEntry and replace it with index-file access from IndexState.

          Show
          snazy Robert Stupp added a comment - Quick progress status: refactored the code to be able to handle "flat byte structures" (i.e. a byte[] at the moment - as a pre-requisite to directly access the index file) IndexInfo is only used from AbstractSSTableIterator.IndexState - an instance to an open index-file is available, so removing the byte[] and accessing the index file directly is the next step. unit and dtests are mostly passing (i.e. there are some flakey ones on cassci, which are passing locally). Still need to identify what's going on with the failing paging dtests. cstar tests show similar results compared to current trunk IndexInfo is also used from UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound ( CASSANDRA-8180 ) - not sure whether it's worth to deserialize the index for this functionality, as it is currently restricted to the entries that are present in the key cache. I tend to remove this access. (/cc Stefania ) Observations: accesses to IndexInfo objects are "random" during the binary search operation (as expected) accesses to IndexInfo objects are "nearly sequential" during scan operations - "nearly" means, it accesses index N, then index N-1, then index N+1 before it actually moves ahead - but does some random accesses to previously accessed IndexInfo instances afterwards. Therefore IndexState "caches" the already deserialised IndexInfo objects. These should stay in new-gen as these are only referenced during the lifetime of the actual read. Alternatively it is possible to use a plain & boring LRU like cache for the 10 last IndexInfo objects in IndexState. index-file writes (flushes/compactions) also used IndexInfo objects - replaced with a buffered write ( DataOutputBuffer ) Assumptions: heap pressure due to the vast amount of IndexInfo objects is already handled by this patch (exchanged to one byte[]  at the moment) both for reads and flushes/compactions after replacing the byte[] with index file access, we could lower the (default) key-cache size since we then no longer cache IndexInfo objects on heap So the next step is to remove the byte[] from IndexedEntry and replace it with index-file access from IndexState .
          Hide
          Stefania Stefania added a comment -

          IndexInfo is also used from UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound (CASSANDRA-8180) - not sure whether it's worth to deserialize the index for this functionality, as it is currently restricted to the entries that are present in the key cache. I tend to remove this access.

          If I am not mistaken when the sstable iterator is created, the partition should be added to the key cache if not already present. Please have a look at BigTableReader iterator() and getPosition() to confirm. The reason we need the index info is that the lower bounds in the sstable metatdata do not work for tombstones. This is the only lower bound we have for tombstones. If it's removed then the optimization of CASSANDRA-8180 no longer works in the presence of tombstones (whether this is acceptable is up for discussion).

          Can't we add the partition bounds to the offset map?

          For completeness, I also add that we don't necessarily need a lower bound for the partion, it can be a lower bound for the entire sstable if easier. However it should work for tombstones, that is it should be an instance of ClusteringPrefix rather than an array of ByteBuffer as it is currently stored in the sstable metadata.

          Show
          Stefania Stefania added a comment - IndexInfo is also used from UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound ( CASSANDRA-8180 ) - not sure whether it's worth to deserialize the index for this functionality, as it is currently restricted to the entries that are present in the key cache . I tend to remove this access. If I am not mistaken when the sstable iterator is created, the partition should be added to the key cache if not already present. Please have a look at BigTableReader iterator() and getPosition() to confirm. The reason we need the index info is that the lower bounds in the sstable metatdata do not work for tombstones. This is the only lower bound we have for tombstones. If it's removed then the optimization of CASSANDRA-8180 no longer works in the presence of tombstones (whether this is acceptable is up for discussion). Can't we add the partition bounds to the offset map? For completeness, I also add that we don't necessarily need a lower bound for the partion, it can be a lower bound for the entire sstable if easier. However it should work for tombstones, that is it should be an instance of ClusteringPrefix rather than an array of ByteBuffer as it is currently stored in the sstable metadata.
          Hide
          snazy Robert Stupp added a comment -

          Note: utests and dtests are fine now (did nothing than a rebase and re-run).

          partition should be added to the key cache if not already present

          Yes and no. This ticket will add a shallow version of IndexedEntry to the key cache (without the IndexInfo objects as these cause a lot of heap pressure). So, when the IndexInfo objects are actually needed, these will be read from disk. My understanding of UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound is, that it uses the IndexInfo objects that are already in the key-cache and will go to disk if there is a key-cache miss. If we would re-read the IndexInfo objects in UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound, this would add overhead. Or did I get it wrong and UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound accesses the same partition as IndexState does? If that's the case, we can maybe pass the current, "fully accessible" IndexedEntry to UnfilteredRowInteratorWithLowerBound (not checked that yet).

          We could (in theory) add stuff to the partition summary or change the serialized index - but unfortunately not in 3.x.

          Show
          snazy Robert Stupp added a comment - Note: utests and dtests are fine now (did nothing than a rebase and re-run). partition should be added to the key cache if not already present Yes and no. This ticket will add a shallow version of IndexedEntry to the key cache (without the IndexInfo objects as these cause a lot of heap pressure). So, when the IndexInfo objects are actually needed, these will be read from disk. My understanding of UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound is, that it uses the IndexInfo objects that are already in the key-cache and will go to disk if there is a key-cache miss. If we would re-read the IndexInfo objects in UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound , this would add overhead. Or did I get it wrong and UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound accesses the same partition as IndexState does? If that's the case, we can maybe pass the current, "fully accessible" IndexedEntry to UnfilteredRowInteratorWithLowerBound (not checked that yet). We could (in theory) add stuff to the partition summary or change the serialized index - but unfortunately not in 3.x.
          Hide
          Stefania Stefania added a comment -

          My understanding of UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound is, that it uses the IndexInfo objects that are already in the key-cache and will go to disk if there is a key-cache miss.

          Yes. Except previously it had to do this anyway because of the partition deletion, whereas now the partition deletion will be available but not the full IndexInfo objects.

          We could (in theory) add stuff to the partition summary or change the serialized index - but unfortunately not in 3.x.

          I think it's reasonable to wait until the new major version to improve on the optimization of CASSANDRA-8180. So I'm happy with this compromise. Shall we open a ticket for this?

          Show
          Stefania Stefania added a comment - My understanding of UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound is, that it uses the IndexInfo objects that are already in the key-cache and will go to disk if there is a key-cache miss. Yes. Except previously it had to do this anyway because of the partition deletion, whereas now the partition deletion will be available but not the full IndexInfo objects. We could (in theory) add stuff to the partition summary or change the serialized index - but unfortunately not in 3.x. I think it's reasonable to wait until the new major version to improve on the optimization of CASSANDRA-8180 . So I'm happy with this compromise. Shall we open a ticket for this?
          Hide
          snazy Robert Stupp added a comment -

          Alright - opened CASSANDRA-11369 as a follow-up for UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound

          Show
          snazy Robert Stupp added a comment - Alright - opened CASSANDRA-11369 as a follow-up for UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound
          Hide
          Stefania Stefania added a comment -

          Thank you!

          Show
          Stefania Stefania added a comment - Thank you!
          Hide
          doanduyhai DOAN DuyHai added a comment -

          I have some questions related to the outcome of this JIRA.

          Since 2.1 incremental repair only repairs chunks of a partition (e.g. the chunks that are in the un-repaired SSTables set) so even in case of mismatch we no longer stream the entire partition. And using paging we can read through very wide partitions. With the improvement brought by this JIRA, does it mean that now we can handle virtually unbounded or partitions exceeding 2.10^9 physical columns ?

          I'm asking because it will impact greatly the way we model data. There are still some points that can cause trouble with ultra-wide partitions:

          • bootstrapping/adding new nodes to the cluster --> streaming of an ultra-wide partitions. What happens if the streaming fails in the middle ? Do we restart the streaming of the whole partition or can we resume at the last clustering ?
          • compaction. With LCS, ultra wide partitions can create overly huge SSTables. In general, how compaction ultra wide partitions will impact node stability ?
          • read path with STCS --> more SSTables to touch on disk
          Show
          doanduyhai DOAN DuyHai added a comment - I have some questions related to the outcome of this JIRA. Since 2.1 incremental repair only repairs chunks of a partition (e.g. the chunks that are in the un-repaired SSTables set) so even in case of mismatch we no longer stream the entire partition. And using paging we can read through very wide partitions. With the improvement brought by this JIRA, does it mean that now we can handle virtually unbounded or partitions exceeding 2.10^9 physical columns ? I'm asking because it will impact greatly the way we model data. There are still some points that can cause trouble with ultra-wide partitions: bootstrapping/adding new nodes to the cluster --> streaming of an ultra-wide partitions. What happens if the streaming fails in the middle ? Do we restart the streaming of the whole partition or can we resume at the last clustering ? compaction. With LCS, ultra wide partitions can create overly huge SSTables. In general, how compaction ultra wide partitions will impact node stability ? read path with STCS --> more SSTables to touch on disk
          Hide
          snazy Robert Stupp added a comment - - edited

          I just finished with most of the coding for this ticket - i.e. "shallow" RowIndexEntry without IndexInfo - and ran a poor-man's comparison of current trunk against 11206 using different partition sizes covering writes, a major compaction and reads. The results are really promising especially with big and huge partitions (tested up to 8G partitions).

          Reads against big partitions really benefit from 11206. For example, with 11206 it takes a couple of seconds for 5000 random reads against 8G partitions vs. many minutes (not a typo) on current trunk). At the same time, the heap is quite full and causes a lot of GC pressure.

          Compactions also benefit from 11206 GC-wise - but not CPU- or I/O-wise since it's still the same amount of work to be done. 11206 "just" reduces GC pressure.

          Flushes also benefit, since it can "forget" IndexInfo objects sooner.

          This ticket will not raise the limit on cells.

          DOAN DuyHai, you're right. Having the ability to handle big partitions has a direct influence to data modeling. I'd not say "you are not longer limited by the size of your partitions". This ticket raises the current limitation WRT GC pressure and read performance. In theory the limit went away, but as you say, compaction gets even more important and other operational tasks like replacing nodes or changing topology need to be considered.

          My next steps are:

          • fix some unit tests that no longer work as they relied on the old implementation (expected to have IndexInfo on heap)
          • cleanup the code
          • run some tests on cstar

          I only ran a poor-man's comparison - on my laptop with small-ish 3G heap with default unit test settings. That's why I did not note exact numbers. But I'd like to show the GC pressure of the same test ran against trunk (took 3 hours) and 11206 (took 1 hour):

          Show
          snazy Robert Stupp added a comment - - edited I just finished with most of the coding for this ticket - i.e. "shallow" RowIndexEntry without IndexInfo - and ran a poor-man's comparison of current trunk against 11206 using different partition sizes covering writes, a major compaction and reads. The results are really promising especially with big and huge partitions (tested up to 8G partitions). Reads against big partitions really benefit from 11206. For example, with 11206 it takes a couple of seconds for 5000 random reads against 8G partitions vs. many minutes (not a typo) on current trunk). At the same time, the heap is quite full and causes a lot of GC pressure. Compactions also benefit from 11206 GC-wise - but not CPU- or I/O-wise since it's still the same amount of work to be done. 11206 "just" reduces GC pressure. Flushes also benefit, since it can "forget" IndexInfo objects sooner. This ticket will not raise the limit on cells. DOAN DuyHai , you're right. Having the ability to handle big partitions has a direct influence to data modeling. I'd not say "you are not longer limited by the size of your partitions". This ticket raises the current limitation WRT GC pressure and read performance. In theory the limit went away, but as you say, compaction gets even more important and other operational tasks like replacing nodes or changing topology need to be considered. My next steps are: fix some unit tests that no longer work as they relied on the old implementation (expected to have IndexInfo on heap) cleanup the code run some tests on cstar I only ran a poor-man's comparison - on my laptop with small-ish 3G heap with default unit test settings. That's why I did not note exact numbers. But I'd like to show the GC pressure of the same test ran against trunk (took 3 hours) and 11206 (took 1 hour):
          Hide
          jbellis Jonathan Ellis added a comment -

          Very promising!

          Show
          jbellis Jonathan Ellis added a comment - Very promising!
          Hide
          snazy Robert Stupp added a comment -

          Pushed the latest version to the git branch. CI results (testall, dtest) and cstar results (see below) look good.

          The initial approach was to “ban” all IndexInfo instances from the key cache. Although this is a great option for big partitions, “moderately” sized partitions suffer from that approach (see “0kB” cstar run below). So, as a compromise a new cassandra.yaml option column_index_cache_size_in_kb that defines when IndexInfo objects should not be kept on heap has been introduced. The new option defaults to 2 kB. It is possible to tune it to lower values (down to 0) and higher values. Some thoughts about both directions:

          • Setting the value to 0 or some other very low value will obviously reduce GC pressure at the cost of high I/O
          • The cost of accessing index samples on disk is two-folded: First, there’s the obvious I/O cost via a RandomAccessReader. Second, that each RandomAccessReader instance has its own buffer (which can be off- or on-heap, but seems to default to off-heap) - so there seems to be some (quite small) overhead to borrow/release that buffer.
          • The higher the value of column_index_cache_size_in_kb, the more objects will be in the heap - therefore: GC pressure
          • Note that the parameter refers to the serialized size and not the amount of IndexInfo objects. This was chosen to get some more obvious relation between the size of IndexInfo objects to the amount of consumed heap - size of IndexInfo objects is mostly related to the size of the clustering keys.
          • Also note that some internal system/schema tables - especially those for LWTs - use clustering keys and therefore index samples.
          • For workloads with a huge amount of random reads against a large data set, small values for column_index_cache_size_in_kb (like the default value) are beneficial if the key cache is always full (i.e. it is evicting a lot).

          Some local tests with the new LargePartitionTest on my Macbook (time machine + indexing turned off) indicate that caching seems to work for shallow indexed entries.

          I’ve scheduled some cstar runs against the trades workload. Only the result with column_index_cache_size_in_kb: 0 (which means, that no IndexInfo will be kept on heap (and in the key cache) shows a performance regression. The default value of 2kb for column_index_cache_size_in_kb was chosen as a result of this experiment.

          Other cstar runs (here, here and here) have shown that there’s no change for some plain workloads.

          Daily regression tests show a similar performance: compaction, repair, STCS, DTCS, LCS, 1 MV, 3 MV, rolling upgrade

          Summary of the changes:

          • Ungenerified RowIndexEntry
          • RowIndexEntry now has a method to create an accessor to the IndexInfo objects on disk - that accessor requires an instance of FileDataInput
          • RowIndexEntry now has three subclasses: ShallowIndexedEntry, which is basically the old IndexedEntry with the IndexInfo array-list removed but only responsible for index files with an offsets-table, and LegacyShallowIndexedEntry which is responsible for index files without an offsets-table (so pre-3.0). IndexedEntry keeps the IndexInfo objects in an array - used if the serialized size of the RIE’s payload is less than the new cassandra.yaml parameter column_index_cache_size_in_kb.
          • RowIndexEntry.IndexInfoRetriever is the interface to access IndexInfo on disk using a FileDataInput. It has concrete implementations: one for sstable versions with offsets and one for legacy sstable versions. This one is only used from AbstractSSTableIterator.IndexState.
          • Added “cache” of already deserialized IndexInfo instances in the base class of IndexInfoRetriever for “shallow” indexed entries. This is not necessary for binary-search but for many other access patterns, which sometimes appear to “jump around” in the IndexInfo objects. Since IndexState is a short lived object, these cached IndexInfo instances get garbage collected early.
          • Writing of index files is also changed. It now switches to serialization into a byte buffer instead of collecting an array-list of IndexInfo objects, when column_index_cache_size_in_kb is hit.
          • Bumped version of serialized key-cache from d to e. The key cache and its serialized form no longer contain IndexInfo objects for indexed entries that exceed column_index_cache_size_in_kb but need the position in the index file. Therefore, the serialized format of the key cache has changed.
          • Serializers (which is an instance per CFMetaData) keeps a “singleton” IndexInfo.Serializer instance for BigFormat.latestVersion and constructs and keeps instances for other versions. For “shallow” RIEs we need an instance of IndexInfo.Serializer to read IndexInfo from disk - a “singleton” further reduces the number of objects on heap. TBC we create(d) a lot of these instances (roughly one per IndexInfo instance/operation). We could also reduce the number of IndexSerializer instances in the future - but it felt not to be necessary for this ticket.
          • Merged RIE’s IndexSerializer interface and Serializer class (that interface had only a single implementation)
          • Added methods to IndexSerializer to handle the special serialization for saved key caches
          • Added specialized deserializePosition method to IndexSerializer as some invocations just need the position in the data file.
          • Moved IndexInfo binary-search into AbstractSSTableIterator.IndexState class (the only place, where it’s used)
          • Added some more skip methods in various places. These are required to calculate the offsets array for legacy sstable versions.
          • Classes ColumnIndex and IndexHelpler have been removed (functionality moved), IndexInfo is now a top-level class.
          • Added some Pre_11206_* classes that are copies of the previous implementations into RowIndexEntryTest
          • Added new PagingQueryTest to test paged queries
          • Added new LargePartitionsTest to test/compare various partition sizes (to be run explicitly, otherwise ignored)
          • Added test methods in KeyCacheTest and KeyCacheCqlTest for shallow/non-shallow indexed entries.
          • Also re-added behavior of CASSANDRA-8180 for IndexedEntry (but not for ShallowIndexedEntry}})
          Show
          snazy Robert Stupp added a comment - Pushed the latest version to the git branch . CI results ( testall , dtest ) and cstar results (see below) look good. The initial approach was to “ban” all IndexInfo instances from the key cache. Although this is a great option for big partitions, “moderately” sized partitions suffer from that approach (see “0kB” cstar run below). So, as a compromise a new cassandra.yaml option column_index_cache_size_in_kb that defines when IndexInfo objects should not be kept on heap has been introduced. The new option defaults to 2 kB. It is possible to tune it to lower values (down to 0) and higher values. Some thoughts about both directions: Setting the value to 0 or some other very low value will obviously reduce GC pressure at the cost of high I/O The cost of accessing index samples on disk is two-folded: First, there’s the obvious I/O cost via a RandomAccessReader . Second, that each RandomAccessReader instance has its own buffer (which can be off- or on-heap, but seems to default to off-heap) - so there seems to be some (quite small) overhead to borrow/release that buffer. The higher the value of column_index_cache_size_in_kb , the more objects will be in the heap - therefore: GC pressure Note that the parameter refers to the serialized size and not the amount of IndexInfo objects. This was chosen to get some more obvious relation between the size of IndexInfo objects to the amount of consumed heap - size of IndexInfo objects is mostly related to the size of the clustering keys. Also note that some internal system/schema tables - especially those for LWTs - use clustering keys and therefore index samples. For workloads with a huge amount of random reads against a large data set, small values for column_index_cache_size_in_kb (like the default value) are beneficial if the key cache is always full (i.e. it is evicting a lot). Some local tests with the new LargePartitionTest on my Macbook (time machine + indexing turned off) indicate that caching seems to work for shallow indexed entries. I’ve scheduled some cstar runs against the trades workload. Only the result with column_index_cache_size_in_kb: 0 (which means, that no IndexInfo will be kept on heap (and in the key cache) shows a performance regression. The default value of 2kb for column_index_cache_size_in_kb was chosen as a result of this experiment. column_index_cache_size_in_kb: 0 - cstar result column_index_cache_size_in_kb: 2 - cstar result column_index_cache_size_in_kb: 4 - cstar result column_index_cache_size_in_kb: 8 - cstar result Other cstar runs ( here , here and here ) have shown that there’s no change for some plain workloads. Daily regression tests show a similar performance: compaction , repair , STCS , DTCS , LCS , 1 MV , 3 MV , rolling upgrade Summary of the changes: Ungenerified RowIndexEntry RowIndexEntry now has a method to create an accessor to the IndexInfo objects on disk - that accessor requires an instance of FileDataInput RowIndexEntry now has three subclasses: ShallowIndexedEntry , which is basically the old IndexedEntry with the IndexInfo array-list removed but only responsible for index files with an offsets-table, and LegacyShallowIndexedEntry which is responsible for index files without an offsets-table (so pre-3.0). IndexedEntry keeps the IndexInfo objects in an array - used if the serialized size of the RIE’s payload is less than the new cassandra.yaml parameter column_index_cache_size_in_kb . RowIndexEntry.IndexInfoRetriever  is the interface to access IndexInfo on disk using a FileDataInput . It has concrete implementations: one for sstable versions with offsets and one for legacy sstable versions. This one is only used from AbstractSSTableIterator.IndexState . Added “cache” of already deserialized IndexInfo instances in the base class of IndexInfoRetriever for “shallow” indexed entries. This is not necessary for binary-search but for many other access patterns, which sometimes appear to “jump around” in the IndexInfo  objects. Since IndexState is a short lived object, these cached IndexInfo instances get garbage collected early. Writing of index files is also changed. It now switches to serialization into a byte buffer instead of collecting an array-list of IndexInfo objects, when column_index_cache_size_in_kb is hit. Bumped version of serialized key-cache from d to e . The key cache and its serialized form no longer contain IndexInfo objects for indexed entries that exceed column_index_cache_size_in_kb but need the position in the index file. Therefore, the serialized format of the key cache has changed. Serializers (which is an instance per CFMetaData ) keeps a “singleton” IndexInfo.Serializer instance for BigFormat.latestVersion and constructs and keeps instances for other versions. For “shallow” RIEs we need an instance of IndexInfo.Serializer to read IndexInfo from disk - a “singleton” further reduces the number of objects on heap. TBC we create(d) a lot of these instances (roughly one per IndexInfo instance/operation). We could also reduce the number of IndexSerializer instances in the future - but it felt not to be necessary for this ticket. Merged RIE’s IndexSerializer interface and Serializer class (that interface had only a single implementation) Added methods to IndexSerializer to handle the special serialization for saved key caches Added specialized deserializePosition method to IndexSerializer as some invocations just need the position in the data file. Moved IndexInfo binary-search into AbstractSSTableIterator.IndexState class (the only place, where it’s used) Added some more skip methods in various places. These are required to calculate the offsets array for legacy sstable versions. Classes ColumnIndex and IndexHelpler have been removed (functionality moved), IndexInfo is now a top-level class. Added some Pre_11206_* classes that are copies of the previous implementations into RowIndexEntryTest Added new PagingQueryTest to test paged queries Added new LargePartitionsTest to test/compare various partition sizes (to be run explicitly, otherwise ignored) Added test methods in KeyCacheTest and KeyCacheCqlTest for shallow/non-shallow indexed entries. Also re-added behavior of CASSANDRA-8180 for IndexedEntry (but not for ShallowIndexedEntry}})
          Hide
          tjake T Jake Luciani added a comment -

          I haven't dug into this much but on the surface this effectively breaks CASSANDRA-7443 since you removed all generics from the IndexEntry.
          I don't see any reason you can't support a serializer implementaion per format.

          Show
          tjake T Jake Luciani added a comment - I haven't dug into this much but on the surface this effectively breaks CASSANDRA-7443 since you removed all generics from the IndexEntry. I don't see any reason you can't support a serializer implementaion per format.
          Hide
          snazy Robert Stupp added a comment -

          Pushed a commit that re-adds the generics.

          Show
          snazy Robert Stupp added a comment - Pushed a commit that re-adds the generics.
          Hide
          tjake T Jake Luciani added a comment -
          • You need to change the version of sstable since this change alters the Index component.
          • Please run dtests/unit test with column_index_cache_size_in_kb: 0
          • Is the AutoSavingCache change require a step on the users part or will it naturally skip the saved cache on startup?
          • The 0,1,2 magic bytes that encode what type of index entry this is should be made constants
          Show
          tjake T Jake Luciani added a comment - You need to change the version of sstable since this change alters the Index component. Please run dtests/unit test with column_index_cache_size_in_kb: 0 Is the AutoSavingCache change require a step on the users part or will it naturally skip the saved cache on startup? The 0,1,2 magic bytes that encode what type of index entry this is should be made constants
          Hide
          snazy Robert Stupp added a comment -

          need to change the version of sstable

          The change does not change the index sstable format - just the format of the saved key cache.

          AutoSavingCache change require a step on the users part

          No, all that happens is that you lose the contents of the old saved key cache. This is since the change requires some more information on shallow indexed entries (offset in index file).

          0,1,2 magic bytes

          Made these constants and pushed a commit for this.

          dtests/unit test with column_index_cache_size_in_kb: 0

          I've setup a new branch 11206-large-part-0kb-trunk and triggered CI for this. testall dtest

          Show
          snazy Robert Stupp added a comment - need to change the version of sstable The change does not change the index sstable format - just the format of the saved key cache. AutoSavingCache change require a step on the users part No, all that happens is that you lose the contents of the old saved key cache. This is since the change requires some more information on shallow indexed entries (offset in index file). 0,1,2 magic bytes Made these constants and pushed a commit for this. dtests/unit test with column_index_cache_size_in_kb: 0 I've setup a new branch 11206-large-part-0kb-trunk and triggered CI for this. testall dtest
          Hide
          tjake T Jake Luciani added a comment -

          I think it would make sense to expose a metric of what kind of index cache hit we have Shallow or Regular

          Show
          tjake T Jake Luciani added a comment - I think it would make sense to expose a metric of what kind of index cache hit we have Shallow or Regular
          Hide
          snazy Robert Stupp added a comment -

          Pushed another commit for the metrics. Intention of the metrics is to find the sweet spot of column_index_cache_size_in_kb. In order to find that sweet spot you need to know the size of the entries. The metrics below org.apache.cassandra.metrics:type=Index,name=RowIndexEntry are updated on each call to openWithIndex. But again, configuring column_index_cache_size_in_kb too high would result in GC pressure and probably in a bad key cache hit ratio.

          • IndexedEntrySize histogram about the side of IndexedEntry (every type)
          • IndexInfoCount histogram about the number of IndexInfo objects per IndexedEntry (every type)
          • IndexInfoGets histogram about the number of gets of a IndexInfo objects per IndexedEntry (every type) (for example the number of gets for a binary search)
          Show
          snazy Robert Stupp added a comment - Pushed another commit for the metrics. Intention of the metrics is to find the sweet spot of column_index_cache_size_in_kb . In order to find that sweet spot you need to know the size of the entries. The metrics below org.apache.cassandra.metrics:type=Index,name=RowIndexEntry are updated on each call to openWithIndex . But again, configuring column_index_cache_size_in_kb too high would result in GC pressure and probably in a bad key cache hit ratio. IndexedEntrySize histogram about the side of IndexedEntry (every type) IndexInfoCount histogram about the number of IndexInfo objects per IndexedEntry (every type) IndexInfoGets histogram about the number of gets of a IndexInfo objects per IndexedEntry (every type) (for example the number of gets for a binary search)
          Hide
          tjake T Jake Luciani added a comment -

          Looks like you still have ColumnIndex but it's been refactored into RowIndexWriter.
          I think RowIndexWriter should be moved to and replace ColumnIndex since there is
          no need to move it.

          In BTW.addIndexBlock() the indexOffsets[0] is always 0 since its always skipped on the null case and columnIndexCount is incremented.
          It looks like it was intentional but it's not easy to understand. I think it works out because indexSamplesSerializedSize is 0 anyway.

          Please explain in RowIndexEntry.create why you are returning each of the types. It's not clear why indexSamples == null && columnIndexRow > 1 is significant.

          It seems like you don't need indexOffsets once you reach column_index_cache_size_in_kb
          it's only used for the non-indexes. Does that mean the offsets aren't being written to the index properly?
          In the RIE example they are all appended to the end.

          Show
          tjake T Jake Luciani added a comment - Looks like you still have ColumnIndex but it's been refactored into RowIndexWriter. I think RowIndexWriter should be moved to and replace ColumnIndex since there is no need to move it. In BTW.addIndexBlock() the indexOffsets [0] is always 0 since its always skipped on the null case and columnIndexCount is incremented. It looks like it was intentional but it's not easy to understand. I think it works out because indexSamplesSerializedSize is 0 anyway. Please explain in RowIndexEntry.create why you are returning each of the types. It's not clear why indexSamples == null && columnIndexRow > 1 is significant. It seems like you don't need indexOffsets once you reach column_index_cache_size_in_kb it's only used for the non-indexes. Does that mean the offsets aren't being written to the index properly? In the RIE example they are all appended to the end.
          Hide
          snazy Robert Stupp added a comment -

          have ColumnIndex but it's been refactored into RowIndexWriter

          Yea - it doesn't look the same any more. So I went ahead and moved it into BTW since it's the only class from which it's being used. Could move that to o.a.c.io.sstable.format.big, where BTW is.

          BTW.addIndexBlock() the indexOffsets[0] is always 0

          Put some comments in the code for that.

          explain in RowIndexEntry.create why you are returning each of the types

          Put some comments in the code for that.

          don't need indexOffsets once you reach column_index_cache_size_in_kb

          It's needed for both cases (shallow and non-shallow RIEs). Put a comment in the code for that.

          Also ran some cstar tests to compare a version with and without the metrics with column_index_cache_size_in_kb 0kB and 2kB on taylor and blade_11_b:
          2kB on taylor 2kB on blade_11_b 0kB on taylor 0kB on blade_11_b

          Commits pushed and CI triggered.

          Show
          snazy Robert Stupp added a comment - have ColumnIndex but it's been refactored into RowIndexWriter Yea - it doesn't look the same any more. So I went ahead and moved it into BTW since it's the only class from which it's being used. Could move that to o.a.c.io.sstable.format.big , where BTW is. BTW.addIndexBlock() the indexOffsets[0] is always 0 Put some comments in the code for that. explain in RowIndexEntry.create why you are returning each of the types Put some comments in the code for that. don't need indexOffsets once you reach column_index_cache_size_in_kb It's needed for both cases (shallow and non-shallow RIEs). Put a comment in the code for that. Also ran some cstar tests to compare a version with and without the metrics with column_index_cache_size_in_kb 0kB and 2kB on taylor and blade_11_b: 2kB on taylor 2kB on blade_11_b 0kB on taylor 0kB on blade_11_b Commits pushed and CI triggered.
          Hide
          tjake T Jake Luciani added a comment -

          Looks like the offsets are written every time now in this CI.close() now thx https://github.com/apache/cassandra/commit/aad9988701ca49bc905d1933c1f4b2ecb3ba84d8

          Thanks for the clarifying comments etc. I think this patch is good to commit barring CI results +1

          Show
          tjake T Jake Luciani added a comment - Looks like the offsets are written every time now in this CI.close() now thx https://github.com/apache/cassandra/commit/aad9988701ca49bc905d1933c1f4b2ecb3ba84d8 Thanks for the clarifying comments etc. I think this patch is good to commit barring CI results +1
          Hide
          snazy Robert Stupp added a comment -

          Thanks!
          Rebased again and triggered CI for that before commit.

          Show
          snazy Robert Stupp added a comment - Thanks! Rebased again and triggered CI for that before commit.
          Hide
          snazy Robert Stupp added a comment -

          Thanks for the review!

          Fixed a last issue that would have broken CASSANDRA-11183. Finally, CI looks good after that fix.

          Committed as ef5bbedd687d75923e9a20fde9d2f78b4535241d to trunk.

          Show
          snazy Robert Stupp added a comment - Thanks for the review! Fixed a last issue that would have broken CASSANDRA-11183 . Finally, CI looks good after that fix. Committed as ef5bbedd687d75923e9a20fde9d2f78b4535241d to trunk.
          Hide
          mkjellman Michael Kjellman added a comment - - edited

          going thru the changes and have some questions

          1. RowIndexEntry$serializedSize used to return the size of the index for the entire row. As the size of the IndexInfo elements are variable length I'm having trouble understanding how the new/current implementation does this:
            private static int serializedSize(DeletionTime deletionTime, long headerLength, int columnIndexCount)
                {
                    return TypeSizes.sizeofUnsignedVInt(headerLength)
                           + (int) DeletionTime.serializer.serializedSize(deletionTime)
                           + TypeSizes.sizeofUnsignedVInt(columnIndexCount);
                }
            
          2. In the class level Javadoc for IndexInfo there is a lot of comment about serialization format changes and even a comment "Serialization format changed in 3.0" yet I don't see any corresponding changes in BigFormat$BigVersion
          3. I see a class named *Pre_C_11206_RowIndexEntry* in RowIndexEntryTest which has a lot of the logic that used to be in RowIndexEntry. I don't see the logic outside of the test classes though.
          Show
          mkjellman Michael Kjellman added a comment - - edited going thru the changes and have some questions RowIndexEntry$serializedSize used to return the size of the index for the entire row. As the size of the IndexInfo elements are variable length I'm having trouble understanding how the new/current implementation does this: private static int serializedSize(DeletionTime deletionTime, long headerLength, int columnIndexCount) { return TypeSizes.sizeofUnsignedVInt(headerLength) + ( int ) DeletionTime.serializer.serializedSize(deletionTime) + TypeSizes.sizeofUnsignedVInt(columnIndexCount); } In the class level Javadoc for IndexInfo there is a lot of comment about serialization format changes and even a comment "Serialization format changed in 3.0" yet I don't see any corresponding changes in BigFormat$BigVersion I see a class named * Pre_C_11206_RowIndexEntry * in RowIndexEntryTest which has a lot of the logic that used to be in RowIndexEntry. I don't see the logic outside of the test classes though.
          Hide
          snazy Robert Stupp added a comment -

          RowIndexEntry$serializedSize used to return the size of the index for the entire row.

          The meaning of this method changed but hasn't been renamed accordingly - my bad. It just returns the serialized size of these fields, so without the actual "index payload".

          Javadoc for IndexInfo

          The only real new thing in 3.0 index format is the table with the offsets to the IndexInfo objects. The rest has changed mostly by switching to vint encoding - "hidden" by the note for "ma" store rows natively.

          Pre_C_11206_RowIndexEntry

          You can safely ignore (or even remove) the Pre-C-11206 stuff in RowIndexEntryTest. It just felt safer to have it initially as it was meant to ensure that the new implementation is binary compatible with the old one.

          Show
          snazy Robert Stupp added a comment - RowIndexEntry$serializedSize used to return the size of the index for the entire row. The meaning of this method changed but hasn't been renamed accordingly - my bad. It just returns the serialized size of these fields, so without the actual "index payload". Javadoc for IndexInfo The only real new thing in 3.0 index format is the table with the offsets to the IndexInfo objects. The rest has changed mostly by switching to vint encoding - "hidden" by the note for "ma" store rows natively . Pre_C_11206_RowIndexEntry You can safely ignore (or even remove) the Pre-C-11206 stuff in RowIndexEntryTest. It just felt safer to have it initially as it was meant to ensure that the new implementation is binary compatible with the old one.

            People

            • Assignee:
              snazy Robert Stupp
              Reporter:
              jbellis Jonathan Ellis
              Reviewer:
              T Jake Luciani
            • Votes:
              1 Vote for this issue
              Watchers:
              30 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development