Details

      Description

      The row index contains entries for configurably sized blocks of a wide row. For a row of appreciable size, the row index ends up directing the third seek (1. index, 2. row index, 3. content) to nearby the first column of a scan.

      Since the row index is always used for wide rows, and since it contains information that tells us whether or not the 3rd seek is necessary (the column range or name we are trying to slice may not exist in a given sstable), promoting the row index into the sstable index would allow us to drop the maximum number of seeks for wide rows back to 2, and, more importantly, would allow sstables to be eliminated using only the index.

      An example usecase that benefits greatly from this change is time series data in wide rows, where data is appended to the beginning or end of the row. Our existing compaction strategy gets lucky and clusters the oldest data in the oldest sstables: for queries to recently appended data, we would be able to eliminate wide rows using only the sstable index, rather than needing to seek into the data file to determine that it isn't interesting. For narrow rows, this change would have no effect, as they will not reach the threshold for indexing anyway.

      A first cut design for this change would look very similar to the file format design proposed on #674: http://wiki.apache.org/cassandra/FileFormatDesignDoc: row keys clustered, column names clustered, and offsets clustered and delta encoded.

      1. version-g-lzf.txt
        2 kB
        Stu Hood
      2. version-g.txt
        2 kB
        Stu Hood
      3. version-f.txt
        2 kB
        Stu Hood
      4. promotion.pdf
        25 kB
        Stu Hood
      5. 2319-v2.tgz
        62 kB
        Stu Hood
      6. 2319-v1.tgz
        64 kB
        Stu Hood

        Issue Links

          Activity

          Hide
          Sylvain Lebresne added a comment -

          does column_index_size_in_kb actually affect row keys

          No, it doesn't. In fact this patch didn't change anything on what composes the indexes and in particular the setting have the same meaning. The only difference is that the column indexes are in the on-disk index file instead of the on-disk data file.

          Show
          Sylvain Lebresne added a comment - does column_index_size_in_kb actually affect row keys No, it doesn't. In fact this patch didn't change anything on what composes the indexes and in particular the setting have the same meaning. The only difference is that the column indexes are in the on-disk index file instead of the on-disk data file.
          Hide
          Jonathan Ellis added a comment -

          does column_index_size_in_kb actually affect row keys as well as column names, or are those still special-cased to always be present in the (on-disk) index?

          Show
          Jonathan Ellis added a comment - does column_index_size_in_kb actually affect row keys as well as column names, or are those still special-cased to always be present in the (on-disk) index?
          Hide
          Sylvain Lebresne added a comment -

          which makes tuning complicated

          No, you're right, that part didn't change. However a very easy fix would be to turn as I said above index_interval into index_interval_in_kb. Without deeper changes, the index interval would have to include at least one key (i.e. the index of one key), but aside from that it would give pretty much what we want in term of making tuning work as well for narrow and wide row. And that's a very simple change. Which doesn't preclude doing deeper changes later if we so wish, but in the meantime would give use 91.32% (my best guess) of the benefits. We could even rename the settings to say disk_index_interval_in_kb and memory_index_interval_in_kb (and internally both would really mean that the interval is min(

          {disk,memory}

          _index_interval_in_kb, 1 row)).

          Show
          Sylvain Lebresne added a comment - which makes tuning complicated No, you're right, that part didn't change. However a very easy fix would be to turn as I said above index_interval into index_interval_in_kb. Without deeper changes, the index interval would have to include at least one key (i.e. the index of one key), but aside from that it would give pretty much what we want in term of making tuning work as well for narrow and wide row. And that's a very simple change. Which doesn't preclude doing deeper changes later if we so wish, but in the meantime would give use 91.32% (my best guess) of the benefits. We could even rename the settings to say disk_index_interval_in_kb and memory_index_interval_in_kb (and internally both would really mean that the interval is min( {disk,memory} _index_interval_in_kb, 1 row)).
          Hide
          Jonathan Ellis added a comment -

          One of the setting is about how sparse is the main sstable index ... the other is how sparse is the in-memory summary of the on-disk sparse index.

          The problem is that one only applies to row keys, and the other only to column names, which makes tuning complicated.

          Or did that change, leaving the setting names obsolete?

          Show
          Jonathan Ellis added a comment - One of the setting is about how sparse is the main sstable index ... the other is how sparse is the in-memory summary of the on-disk sparse index. The problem is that one only applies to row keys, and the other only to column names, which makes tuning complicated. Or did that change, leaving the setting names obsolete?
          Hide
          Sylvain Lebresne added a comment -

          Sure I understand. What I'm saying is that the initial discussion was about removing a setting. But we can't. One of the setting is about how sparse is the main sstable index (it's already sparse in fact, we're not indexing every column, but every min(1 row, column_index_size_in_kb)), the other is how sparse is the in-memory summary of the on-disk sparse index.

          Show
          Sylvain Lebresne added a comment - Sure I understand. What I'm saying is that the initial discussion was about removing a setting. But we can't. One of the setting is about how sparse is the main sstable index (it's already sparse in fact, we're not indexing every column, but every min(1 row, column_index_size_in_kb)), the other is how sparse is the in-memory summary of the on-disk sparse index.
          Hide
          Stu Hood added a comment - - edited

          I.e. the raison d'être of both index_interval and column_index_size_in_kb is not because we have the notion of rows in the on-disk format.

          If I'm understanding what Ellis is suggesting, it is that the entire sstable index could become sparse: that would mean that column_index_size_in_kb could be renamed to index_size_in_kb. index_interval would not change.

          EDIT: Well, the meaning of index_interval might change, because even in a sparse index, you might want to store something even sparser in memory.

          Show
          Stu Hood added a comment - - edited I.e. the raison d'être of both index_interval and column_index_size_in_kb is not because we have the notion of rows in the on-disk format. If I'm understanding what Ellis is suggesting, it is that the entire sstable index could become sparse: that would mean that column_index_size_in_kb could be renamed to index_size_in_kb. index_interval would not change. EDIT: Well, the meaning of index_interval might change, because even in a sparse index, you might want to store something even sparser in memory.
          Hide
          Sylvain Lebresne added a comment -

          I still don't see how that would make us remove one of the 2 config option we're talking about. Even if you basically switch to an on-disk format where there is no notion of Row, you still have to (1) say how big are the block you index in the index file, and (2) how much of those index entries you load in memory (unless you decide to load the full index in memory, but I don't think that's what we're talking about). I.e. the raison d'être of both index_interval and column_index_size_in_kb is not because we have the notion of rows in the on-disk format.

          Don't get me wrong, I'm not saying that having a file format where the row key is not special anymore is a bad idea at all (though I'll admit that I'm less convinced that moving C* to such a file format should be a priority), but it seems to me that it's a completely different debate.

          Show
          Sylvain Lebresne added a comment - I still don't see how that would make us remove one of the 2 config option we're talking about. Even if you basically switch to an on-disk format where there is no notion of Row, you still have to (1) say how big are the block you index in the index file, and (2) how much of those index entries you load in memory (unless you decide to load the full index in memory, but I don't think that's what we're talking about). I.e. the raison d'être of both index_interval and column_index_size_in_kb is not because we have the notion of rows in the on-disk format. Don't get me wrong, I'm not saying that having a file format where the row key is not special anymore is a bad idea at all (though I'll admit that I'm less convinced that moving C* to such a file format should be a priority), but it seems to me that it's a completely different debate.
          Hide
          Stu Hood added a comment -

          What if we dropped the "main" index and just kept the "sample" index of every 1/128 columns?

          This is what the original 674-v1 patch did, and it worked out fairly well. In the long run, it would be a win in terms of config and code complexity, but it will probably be about the same performance-wise.

          From 674-v1 (where ColumnKey is a full path to a column: (key, name1, name2, name3, ...)

          +/**
          + * An entry in the SSTable index file which points to a block in the data file.
          + * Each entry contains the full path of the column at the beginning of the block
          + * in the SSTable, and two file positions: the offset of the serialized version
          + * of this object in the index, and the offset of the block in the data file.
          + *
          + * To find a key in the data file, we first look at IndexEntries in memory, and find
          + * the last entry less than the key we want. We then seek to the position of that
          + * entry in the index file, read forward until we find the last entry less than the
          + * key, and then seek to the position of that entry's block in the data file.
          + */
          +public class IndexEntry extends ColumnKey
          
          Show
          Stu Hood added a comment - What if we dropped the "main" index and just kept the "sample" index of every 1/128 columns? This is what the original 674-v1 patch did, and it worked out fairly well. In the long run, it would be a win in terms of config and code complexity, but it will probably be about the same performance-wise. From 674-v1 (where ColumnKey is a full path to a column: (key, name1, name2, name3, ...) +/** + * An entry in the SSTable index file which points to a block in the data file. + * Each entry contains the full path of the column at the beginning of the block + * in the SSTable, and two file positions: the offset of the serialized version + * of this object in the index, and the offset of the block in the data file. + * + * To find a key in the data file, we first look at IndexEntries in memory, and find + * the last entry less than the key we want. We then seek to the position of that + * entry in the index file, read forward until we find the last entry less than the + * key, and then seek to the position of that entry's block in the data file. + */ + public class IndexEntry extends ColumnKey
          Hide
          Sylvain Lebresne added a comment -

          What if we dropped the "main" index and just kept the "sample" index of every 1/128 columns? Seems like we'd trade a little more seq i/o to do less random i/o, and being able to get rid of the index sampling phase on startup...

          I am not sure I follow.

          Show
          Sylvain Lebresne added a comment - What if we dropped the "main" index and just kept the "sample" index of every 1/128 columns? Seems like we'd trade a little more seq i/o to do less random i/o, and being able to get rid of the index sampling phase on startup... I am not sure I follow.
          Hide
          Jonathan Ellis added a comment -

          I don't see an easy way to merge those 2 settings into 1 if that was what you were hinting to.

          Yes, that's where I was going.

          What if we dropped the "main" index and just kept the "sample" index of every 1/128 columns? Seems like we'd trade a little more seq i/o to do less random i/o, and being able to get rid of the index sampling phase on startup...

          Show
          Jonathan Ellis added a comment - I don't see an easy way to merge those 2 settings into 1 if that was what you were hinting to. Yes, that's where I was going. What if we dropped the "main" index and just kept the "sample" index of every 1/128 columns? Seems like we'd trade a little more seq i/o to do less random i/o, and being able to get rid of the index sampling phase on startup...
          Hide
          Sylvain Lebresne added a comment -

          we have the row index_interval measured in rows, but for columns we have column_index_size_in_kb

          Yes, I suppose we could switch the index_interval to the size_in_kb type, which would better limit the amount of data we may end up reading in the index now that row index entry may be of more widely different sizes. However, I don't see an easy way to merge those 2 settings into 1 if that was what you were hinting to.

          So, this isn't quite true

          I really just meant that the committed code had no interaction with the compression code. You are right though that having compression block not aligned with data boundaries could mean that sometimes we'll have to decompress one more block that would be strictly necessary, but that's more a drawback of having compression blocks not being aligned to any data boundaries than anything related to this actual patch. Don't know how much that matters in practice though.

          Show
          Sylvain Lebresne added a comment - we have the row index_interval measured in rows, but for columns we have column_index_size_in_kb Yes, I suppose we could switch the index_interval to the size_in_kb type, which would better limit the amount of data we may end up reading in the index now that row index entry may be of more widely different sizes. However, I don't see an easy way to merge those 2 settings into 1 if that was what you were hinting to. So, this isn't quite true I really just meant that the committed code had no interaction with the compression code. You are right though that having compression block not aligned with data boundaries could mean that sometimes we'll have to decompress one more block that would be strictly necessary, but that's more a drawback of having compression blocks not being aligned to any data boundaries than anything related to this actual patch. Don't know how much that matters in practice though.
          Hide
          Stu Hood added a comment -

          We ended up making compression at the SequentialWriter level, i.e. transparent from the data and index format, so this end up having no interaction whatsoever with compression.

          So, this isn't quite true: due to the column/tuple index being unaware of blocks, and vice-versa, the current scheme has to deal with alignment mismatches: tuples which fall onto block boundaries involve reading two blocks worth of data in order to decompress one tuple. Not a very frequent occurrence when the tuple size is small enough... but anyway.

          Show
          Stu Hood added a comment - We ended up making compression at the SequentialWriter level, i.e. transparent from the data and index format, so this end up having no interaction whatsoever with compression. So, this isn't quite true: due to the column/tuple index being unaware of blocks, and vice-versa, the current scheme has to deal with alignment mismatches: tuples which fall onto block boundaries involve reading two blocks worth of data in order to decompress one tuple. Not a very frequent occurrence when the tuple size is small enough... but anyway.
          Hide
          Jonathan Ellis added a comment -

          Ah, what I really meant was getColumnIndexSize. It seems like the next step should be to treat rows and columns more uniformly – we have the row index_interval measured in rows, but for columns we have column_index_size_in_kb.

          Show
          Jonathan Ellis added a comment - Ah, what I really meant was getColumnIndexSize. It seems like the next step should be to treat rows and columns more uniformly – we have the row index_interval measured in rows, but for columns we have column_index_size_in_kb.
          Hide
          Sylvain Lebresne added a comment -

          Yes, that method is dead code but that's actually not new from this patch. I've created CASSANDRA-4076.

          Show
          Sylvain Lebresne added a comment - Yes, that method is dead code but that's actually not new from this patch. I've created CASSANDRA-4076 .
          Hide
          Jonathan Ellis added a comment -

          I see getIndexedReadBufferSizeInKB is still around, is that dead code now?

          Show
          Jonathan Ellis added a comment - I see getIndexedReadBufferSizeInKB is still around, is that dead code now?
          Hide
          Sylvain Lebresne added a comment -

          You're right, I'm pushed the trivial fix with commit 65c33fa.

          Show
          Sylvain Lebresne added a comment - You're right, I'm pushed the trivial fix with commit 65c33fa.
          Hide
          Dave Brosius added a comment -

          Perhaps i am reading this wrong, but this patch appears to make SSTableIdentityIterator constructor now expect sstable to not be null, which is not the case from this constructor

          public SSTableIdentityIterator(CFMetaData metadata, DataInput file, DecoratedKey key, long dataStart, long dataSize, IColumnSerializer.Flag flag)
          throws IOException

          { this(metadata, file, key, dataStart, dataSize, false, null, flag); }

          which is called from here IncomingStreamReader.streamIn

          I would expect an NPE in this case.

          Show
          Dave Brosius added a comment - Perhaps i am reading this wrong, but this patch appears to make SSTableIdentityIterator constructor now expect sstable to not be null, which is not the case from this constructor public SSTableIdentityIterator(CFMetaData metadata, DataInput file, DecoratedKey key, long dataStart, long dataSize, IColumnSerializer.Flag flag) throws IOException { this(metadata, file, key, dataStart, dataSize, false, null, flag); } which is called from here IncomingStreamReader.streamIn I would expect an NPE in this case.
          Hide
          Sylvain Lebresne added a comment -

          Committed, thanks

          Show
          Sylvain Lebresne added a comment - Committed, thanks
          Hide
          Stu Hood added a comment -

          And truth is, I think it's a nice property that we should probably maintain unless we have a good reason not too

          Totally agreed. My mind blipped here, and for some reason I thought that you had removed the metadata in addition to the BF/column-index.

          +1 #shipit

          Show
          Stu Hood added a comment - And truth is, I think it's a nice property that we should probably maintain unless we have a good reason not too Totally agreed. My mind blipped here, and for some reason I thought that you had removed the metadata in addition to the BF/column-index. +1 #shipit
          Hide
          Sylvain Lebresne added a comment -

          Pushed a new on the github branch to address following remarks:

          • ColumnIndexer
            • Consider renaming ColumnIndexer -> ColumnIndex, and moving all creation logic into factory functions... ColumnIndexer.Result is redundant
            • Will creating a BloomFilter with estimated size 0 do the right thing here? Perhaps explicitly ask for the empty bloom filter?

          Agreed, that much cleaner, done.

          Consider adding an EMPTY/LIVE singleton DeletionInfo

          Not really related to this patch but good idea, done.

          the null checks in create() probably mean create() should be two overloaded methods

          We could be truth is create() is only ever called from one place where it takes all it's argument. In fact the null check should never be trigger in current code, but I figured it doesn't cost much to code defensively in case someone wants to use null instead of an empty index.

          the if (dis instanceof FileDataInput) skipFully logic should be static somewhere

          Right, that was copy-pasted code from the old ColumnIndexer but in fact FileUtils.skipBytesFully() already does the right thing so I just remove the whole if using that.

          This class is probably in the wrong package

          I've put it in db because the old ColumnIndexer (whose RowIndexEntry inherit some of the code) was there too. Overall the db package already has sstable related stuffs so I don't care too much and feel lazy to change it at the time I'm writing this. But I may move it during commit if I remember to do it.

          • IndexedSliceReader
            file = originalInput == null ? sstable.getFileDataInput(positionToSeek) : originalInput;
            file.seek(positionToSeek);
            

            could probably be cleaner

          I'm open to suggestions

          sstable.decodeKey(ByteBufferUtil.readWithShortLength(file)); -> ByteBufferUtil.skipShortLength

          Done

          When we decide not to store metadata in the file, the datafile is no longer enough to recover the data in case of lost indexes, which is probably fine if streaming has headed in a direction where we always send the index as well. Once we've passed that point of no-return, we could also move metadata out for narrow rows and consider removing the key from the data file row header, leaving the datafile containing nothing but columns (minor compression bonus)

          The thing is, we don't yet send the index while streaming. Even with this patch, the data has still enough data to recompute the index file. And truth is, I think it's a nice property that we should probably maintain unless we have a good reason not too (it is true that with compression, the data file alone is not enough as you need to know what is the compression and the block size, but those are property of the CF even if we loose the compression component).

          Show
          Sylvain Lebresne added a comment - Pushed a new on the github branch to address following remarks: ColumnIndexer Consider renaming ColumnIndexer -> ColumnIndex, and moving all creation logic into factory functions... ColumnIndexer.Result is redundant Will creating a BloomFilter with estimated size 0 do the right thing here? Perhaps explicitly ask for the empty bloom filter? Agreed, that much cleaner, done. Consider adding an EMPTY/LIVE singleton DeletionInfo Not really related to this patch but good idea, done. the null checks in create() probably mean create() should be two overloaded methods We could be truth is create() is only ever called from one place where it takes all it's argument. In fact the null check should never be trigger in current code, but I figured it doesn't cost much to code defensively in case someone wants to use null instead of an empty index. the if (dis instanceof FileDataInput) skipFully logic should be static somewhere Right, that was copy-pasted code from the old ColumnIndexer but in fact FileUtils.skipBytesFully() already does the right thing so I just remove the whole if using that. This class is probably in the wrong package I've put it in db because the old ColumnIndexer (whose RowIndexEntry inherit some of the code) was there too. Overall the db package already has sstable related stuffs so I don't care too much and feel lazy to change it at the time I'm writing this. But I may move it during commit if I remember to do it. IndexedSliceReader file = originalInput == null ? sstable.getFileDataInput(positionToSeek) : originalInput; file.seek(positionToSeek); could probably be cleaner I'm open to suggestions sstable.decodeKey(ByteBufferUtil.readWithShortLength(file)); -> ByteBufferUtil.skipShortLength Done When we decide not to store metadata in the file, the datafile is no longer enough to recover the data in case of lost indexes, which is probably fine if streaming has headed in a direction where we always send the index as well. Once we've passed that point of no-return, we could also move metadata out for narrow rows and consider removing the key from the data file row header, leaving the datafile containing nothing but columns (minor compression bonus) The thing is, we don't yet send the index while streaming. Even with this patch, the data has still enough data to recompute the index file. And truth is, I think it's a nice property that we should probably maintain unless we have a good reason not too (it is true that with compression, the data file alone is not enough as you need to know what is the compression and the block size, but those are property of the CF even if we loose the compression component).
          Hide
          Stu Hood added a comment -
          • ColumnIndexer
            • Consider renaming ColumnIndexer -> ColumnIndex, and moving all creation logic into factory functions... ColumnIndexer.Result is redundant
            • Will creating a BloomFilter with estimated size 0 do the right thing here? Perhaps explicitly ask for the empty bloom filter?
          • DeletionInfo
            • Consider adding an EMPTY/LIVE singleton DeletionInfo
          • RowIndexEntry
            • the null checks in create() probably mean create() should be two overloaded methods, one of which checks the columnIndex size
            • the if (dis instanceof FileDataInput) skipFully logic should be static somewhere, and should definitely use DataInput.skipBytes in a loop rather than allocating a buffer
            • This class is probably in the wrong package... it's very sstable specific. Also, class javadoc.
          • IndexedSliceReader
            • file = originalInput == null ? sstable.getFileDataInput(positionToSeek) : originalInput;
              file.seek(positionToSeek);

              could probably be cleaner

          • SimpleSliceReader
            • sstable.decodeKey(ByteBufferUtil.readWithShortLength(file)); -> ByteBufferUtil.skipShortLength
          • General comments
            • When we decide not to store metadata in the file, the datafile is no longer enough to recover the data in case of lost indexes, which is probably fine if streaming has headed in a direction where we always send the index as well. Once we've passed that point of no-return, we could also move metadata out for narrow rows and consider removing the key from the data file row header, leaving the datafile containing nothing but columns (minor compression bonus).

          This looks good Sylvain... moving the bloom filter into the index as well looks like it ended up being a simple win.

          Show
          Stu Hood added a comment - ColumnIndexer Consider renaming ColumnIndexer -> ColumnIndex, and moving all creation logic into factory functions... ColumnIndexer.Result is redundant Will creating a BloomFilter with estimated size 0 do the right thing here? Perhaps explicitly ask for the empty bloom filter? DeletionInfo Consider adding an EMPTY/LIVE singleton DeletionInfo RowIndexEntry the null checks in create() probably mean create() should be two overloaded methods, one of which checks the columnIndex size the if (dis instanceof FileDataInput) skipFully logic should be static somewhere, and should definitely use DataInput.skipBytes in a loop rather than allocating a buffer This class is probably in the wrong package... it's very sstable specific. Also, class javadoc. IndexedSliceReader file = originalInput == null ? sstable.getFileDataInput(positionToSeek) : originalInput; file.seek(positionToSeek); could probably be cleaner SimpleSliceReader sstable.decodeKey(ByteBufferUtil.readWithShortLength(file)); -> ByteBufferUtil.skipShortLength General comments When we decide not to store metadata in the file, the datafile is no longer enough to recover the data in case of lost indexes, which is probably fine if streaming has headed in a direction where we always send the index as well. Once we've passed that point of no-return, we could also move metadata out for narrow rows and consider removing the key from the data file row header, leaving the datafile containing nothing but columns (minor compression bonus). This looks good Sylvain... moving the bloom filter into the index as well looks like it ended up being a simple win.
          Hide
          Sylvain Lebresne added a comment -

          Would it simplify things to just say "row deletion times are always in the index" and avoid having to have two code paths for that?

          Not really, at least not unless we also keep a column index entry too for the narrow case. Otherwise we have to seek to the beginning of the row anyway so we have two code paths anyway (and we'll end up deserializing the deletion times anyway). Besides, we also have to be backward compatible so in fact we wouldn't really gain much, since the new narrow case is almost the same as the old narrow.
          Now it is true that after this patch it could be worth making the distinction of static CF versus dynamic CF rather than narrow versus wide as it would make sense to keep a column index entry in the dynamic case (even if the row is narrow for that sstable), but I think this warrant some experimentation first and I'd be in favor of leaving that to a follow up ticket.

          Show
          Sylvain Lebresne added a comment - Would it simplify things to just say "row deletion times are always in the index" and avoid having to have two code paths for that? Not really, at least not unless we also keep a column index entry too for the narrow case. Otherwise we have to seek to the beginning of the row anyway so we have two code paths anyway (and we'll end up deserializing the deletion times anyway). Besides, we also have to be backward compatible so in fact we wouldn't really gain much, since the new narrow case is almost the same as the old narrow. Now it is true that after this patch it could be worth making the distinction of static CF versus dynamic CF rather than narrow versus wide as it would make sense to keep a column index entry in the dynamic case (even if the row is narrow for that sstable), but I think this warrant some experimentation first and I'd be in favor of leaving that to a follow up ticket.
          Hide
          Jonathan Ellis added a comment -

          for narrow [rows], we'll end up seeking at the beginning of the row, where the deletion times are anyway

          Would it simplify things to just say "row deletion times are always in the index" and avoid having to have two code paths for that?

          Show
          Jonathan Ellis added a comment - for narrow [rows] , we'll end up seeking at the beginning of the row, where the deletion times are anyway Would it simplify things to just say "row deletion times are always in the index" and avoid having to have two code paths for that?
          Hide
          Sylvain Lebresne added a comment -

          For wide rows, the index entry also ship with the row deletion times

          Why just wide rows?

          Because for narrow ones, we'll end up seeking at the beginning of the row, where the deletion times are anyway.

          How did this work out, did it end up simplifying compression?

          We ended up making compression at the SequentialWriter level, i.e. transparent from the data and index format, so this end up having no interaction whatsoever with compression.

          Show
          Sylvain Lebresne added a comment - For wide rows, the index entry also ship with the row deletion times Why just wide rows? Because for narrow ones, we'll end up seeking at the beginning of the row, where the deletion times are anyway. How did this work out, did it end up simplifying compression? We ended up making compression at the SequentialWriter level, i.e. transparent from the data and index format, so this end up having no interaction whatsoever with compression.
          Hide
          Jonathan Ellis added a comment -

          For wide rows, the index entry also ship with the row deletion times

          Why just wide rows?

          I'm in favor of trying this (honestly more because I believe this could make checksumming and compression easier to add than anything else)

          How did this work out, did it end up simplifying compression?

          Show
          Jonathan Ellis added a comment - For wide rows, the index entry also ship with the row deletion times Why just wide rows? I'm in favor of trying this (honestly more because I believe this could make checksumming and compression easier to add than anything else) How did this work out, did it end up simplifying compression?
          Hide
          Sylvain Lebresne added a comment -

          I've put a version of this issue at https://github.com/pcmanus/cassandra/commits/2319_index_promotion (against current trunk). Contrarily to the previously attached patches, this doesn't change the file format much. It pretty literally do what the issue title said: it promotes the columns index from the data file to the index file. Note that the patch is split in 3 commits that have some form of logical separation but the code only compile with all 3 commits.

          So this remove the column index and bloom filter from the row header in the data file and move them in the index file along with the (key,position) pair. There is a number of choices/details worth mentioning:

          • Only wide rows have a column index and bloom filter. So one difference with the current implementation is that skinny rows have no column bloom filter. I figure that it's probably not worth the space in the index file in that latter case (but I'm fine discussing that point)
          • The key cache now keeps the whole information from the index file for a given row. This means that for wide rows, column index and bf are cached along with the position. Which is imo a good thing, but does mean the size of a key cache entry is not constant anymore (The estimation of the key cache memory size will have to be modified accordingly but the current patch don't do it).
          • For wide rows, the index entry also ship with the row deletion times. This is necessary since we won't seek at the beginning of the row anymore.
          • In the column indexes, offsets are relating to the beginning of the row in the data file rather than from the beginning of the index as is the case now.

          Some other implementation points:

          • EchoedRow is removed. It would be possible to echo rows following this patch but we would need to echo the column index too so that felt complicated enough that it could be left to a later ticket if we consider it worth it.
          • I didn't found a non overly complicated/inefficient way to implement this patch without using seek() instead of just file marks. So in particular MappedFileDataInput gets a seek() method, even though that method throw an exception if we seek outside the segment (which should never happen).

          I did a short (and honestly not very scientific) benchmark of a time series like workload with a number of thread inserting time series columns in a bunch of rows and other threads reading the tail of those rows (as expected, the performance degrades with more sstables added and improve with compaction). As soon as more than more than 1 sstable was present, the performance with this patch was around 30-40% better than without the patch. I'll note that the test was very short and with everything on local host, so again the exact benefits may vary, but the ability to discard sstables based on index infos (saving a seek) seems to be a clear boost in that case.

          I didn't saw any noticeable difference (neither good or bad) on a normal stress, as should be expected.

          Note that this patch paves the way to removing the two phases compaction of LazilyCompactedRow, but that is left to a follow up ticket.

          Show
          Sylvain Lebresne added a comment - I've put a version of this issue at https://github.com/pcmanus/cassandra/commits/2319_index_promotion (against current trunk). Contrarily to the previously attached patches, this doesn't change the file format much. It pretty literally do what the issue title said: it promotes the columns index from the data file to the index file. Note that the patch is split in 3 commits that have some form of logical separation but the code only compile with all 3 commits. So this remove the column index and bloom filter from the row header in the data file and move them in the index file along with the (key,position) pair. There is a number of choices/details worth mentioning: Only wide rows have a column index and bloom filter. So one difference with the current implementation is that skinny rows have no column bloom filter. I figure that it's probably not worth the space in the index file in that latter case (but I'm fine discussing that point) The key cache now keeps the whole information from the index file for a given row. This means that for wide rows, column index and bf are cached along with the position. Which is imo a good thing, but does mean the size of a key cache entry is not constant anymore (The estimation of the key cache memory size will have to be modified accordingly but the current patch don't do it). For wide rows, the index entry also ship with the row deletion times. This is necessary since we won't seek at the beginning of the row anymore. In the column indexes, offsets are relating to the beginning of the row in the data file rather than from the beginning of the index as is the case now. Some other implementation points: EchoedRow is removed. It would be possible to echo rows following this patch but we would need to echo the column index too so that felt complicated enough that it could be left to a later ticket if we consider it worth it. I didn't found a non overly complicated/inefficient way to implement this patch without using seek() instead of just file marks. So in particular MappedFileDataInput gets a seek() method, even though that method throw an exception if we seek outside the segment (which should never happen). I did a short (and honestly not very scientific) benchmark of a time series like workload with a number of thread inserting time series columns in a bunch of rows and other threads reading the tail of those rows (as expected, the performance degrades with more sstables added and improve with compaction). As soon as more than more than 1 sstable was present, the performance with this patch was around 30-40% better than without the patch. I'll note that the test was very short and with everything on local host, so again the exact benefits may vary, but the ability to discard sstables based on index infos (saving a seek) seems to be a clear boost in that case. I didn't saw any noticeable difference (neither good or bad) on a normal stress, as should be expected. Note that this patch paves the way to removing the two phases compaction of LazilyCompactedRow, but that is left to a follow up ticket.
          Hide
          Ryan King added a comment -

          I haven't followed that ticket closely, but I think the answer is yes. For wide row use cases this patch lets you eliminate SStables with only the info in the index (because we know what range(s) of columns for a row are in that file).

          Show
          Ryan King added a comment - I haven't followed that ticket closely, but I think the answer is yes. For wide row use cases this patch lets you eliminate SStables with only the info in the index (because we know what range(s) of columns for a row are in that file).
          Hide
          Jonathan Ellis added a comment -

          Would this allow us to do CASSANDRA-2498 style optimizations on slice queries?

          Show
          Jonathan Ellis added a comment - Would this allow us to do CASSANDRA-2498 style optimizations on slice queries?
          Hide
          Stu Hood added a comment - - edited

          Attaching a version of this patch that has been rebased atop 674. Together they exhibit very good performance.

          Regarding the key-cache: in this latest revision the SSTableReader will only cache narrow rows (based on the number of blocks they contain). IMO, this is a reasonable temporary middle ground, since wide rows can be eliminated using only the index, and narrow rows continue to enjoy the benefits of the key cache. The long term goal would still be to cache the block headers for the file individually.

          EDIT: The fundamental change in this revision is that in order to gain random access to a portion of a row, an SSTableReader.getPositions method now returns a collection of row headers, which are essentially equivalent to the row index in the existing format. This information is enough to eliminate a datafile.

          Show
          Stu Hood added a comment - - edited Attaching a version of this patch that has been rebased atop 674. Together they exhibit very good performance. Regarding the key-cache: in this latest revision the SSTableReader will only cache narrow rows (based on the number of blocks they contain). IMO, this is a reasonable temporary middle ground, since wide rows can be eliminated using only the index, and narrow rows continue to enjoy the benefits of the key cache. The long term goal would still be to cache the block headers for the file individually. EDIT: The fundamental change in this revision is that in order to gain random access to a portion of a row, an SSTableReader.getPositions method now returns a collection of row headers, which are essentially equivalent to the row index in the existing format. This information is enough to eliminate a datafile.
          Hide
          Stu Hood added a comment - - edited

          Rebased 2319 without changes: applies atop CASSANDRA-2336 and CASSANDRA-2398.

          Our internal deployment of the UUID based counters is blocked on good read performance for compressed wide rows, so our next step will be to integrate this patch with CASSANDRA-674.

          Show
          Stu Hood added a comment - - edited Rebased 2319 without changes: applies atop CASSANDRA-2336 and CASSANDRA-2398 . Our internal deployment of the UUID based counters is blocked on good read performance for compressed wide rows, so our next step will be to integrate this patch with CASSANDRA-674 .
          Hide
          Stu Hood added a comment - - edited

          Output from some stress.java runs is attached as version-*.txt:

          • version-f: trunk
          • version-g: promoted index without LZF, but with type specific compression (for offsets and value lengths)
          • version-g-lzf: promoted index with all the features of 2398

          A slightly larger scale wide row test is described in promotion.pdf

          • 90/10 writes/reads for 6 hours
            • appending to 30,000 keys
            • reading the tail of the row
          • sstables-per-read histogram for a wide-row workload
          • trunk (unpromoted) vs 2319 (promoted)
          • ~12GB of data, ~3GB of page cache

          Two takeaways:

          1. Promotion allowed one less sstable to be accessed on average
            • At the end of both runs, the largest sstable was 6GB, which was also the only file large enough for promotion to kick in (for 30,000 keys and column_index_size=64KB, it takes an sstable larger than 1.92GB for the nested index to start promoting column names)
          2. the promoted index was able to accomplish 25% more reads during the time period covered by the test, likely due to hitting one less file
          Show
          Stu Hood added a comment - - edited Output from some stress.java runs is attached as version-*.txt: version-f: trunk version-g: promoted index without LZF, but with type specific compression (for offsets and value lengths) version-g-lzf: promoted index with all the features of 2398 A slightly larger scale wide row test is described in promotion.pdf 90/10 writes/reads for 6 hours appending to 30,000 keys reading the tail of the row sstables-per-read histogram for a wide-row workload trunk (unpromoted) vs 2319 (promoted) ~12GB of data, ~3GB of page cache Two takeaways: Promotion allowed one less sstable to be accessed on average At the end of both runs, the largest sstable was 6GB, which was also the only file large enough for promotion to kick in (for 30,000 keys and column_index_size=64KB, it takes an sstable larger than 1.92GB for the nested index to start promoting column names) the promoted index was able to accomplish 25% more reads during the time period covered by the test, likely due to hitting one less file
          Hide
          Stu Hood added a comment -

          The attached patch depends on 2336 and 2398, and implements a compressed, promoted index (called NestedIndex) containing enough information to eliminate wide rows without seeking into the data file. Only some of the features from the issue description are actually implemented: the primary goal of this first version was to begin eliminating sstables for wide-row (time series) usecases.

          The nested index contains nearly as much information as our current row header: the only thing missing at this point is the bloom filter (which I think needs more thought to get right, and wasn't critical for our slice usecase).

          Narrow rows (rows < column_index_size) are represented in the index by their key, offset and two bits to indicate that neither metadata nor column names have been stored in the index. Rows wider than column_index_size will have an entry containing their metadata and one or more column entries, with a bit per column entry to indicate their ownership.

          The key cache stores RowHeaders, which have a base implementation that is essentially just a boxed long offset (matching our existing impl in memory usage). Wide rows use the NestedRowHeader subclass, which additionally contains row metadata and the min and max columns in the row, which is enough to eliminate a row for slices. The intention is that as our file format migrates toward being block based, RowHeader will evolve into a BlockHeader describing a seekable point in the data file (and possibly cached in a sorted structure as described above).

          Data in the index mostly uses the design from [FileFormatDesignDoc], and is compressed using either type specific compression or LZF (via CASSANDRA-2398). The end result is that for a narrow-row usecase and simple keys, the index is half the size of the existing implementation: complex keys will likely see larger benefits.

          Show
          Stu Hood added a comment - The attached patch depends on 2336 and 2398, and implements a compressed, promoted index (called NestedIndex) containing enough information to eliminate wide rows without seeking into the data file. Only some of the features from the issue description are actually implemented: the primary goal of this first version was to begin eliminating sstables for wide-row (time series) usecases. The nested index contains nearly as much information as our current row header: the only thing missing at this point is the bloom filter (which I think needs more thought to get right, and wasn't critical for our slice usecase). Narrow rows (rows < column_index_size) are represented in the index by their key, offset and two bits to indicate that neither metadata nor column names have been stored in the index. Rows wider than column_index_size will have an entry containing their metadata and one or more column entries, with a bit per column entry to indicate their ownership. The key cache stores RowHeaders, which have a base implementation that is essentially just a boxed long offset (matching our existing impl in memory usage). Wide rows use the NestedRowHeader subclass, which additionally contains row metadata and the min and max columns in the row, which is enough to eliminate a row for slices. The intention is that as our file format migrates toward being block based, RowHeader will evolve into a BlockHeader describing a seekable point in the data file (and possibly cached in a sorted structure as described above). Data in the index mostly uses the design from [ FileFormatDesignDoc ], and is compressed using either type specific compression or LZF (via CASSANDRA-2398 ). The end result is that for a narrow-row usecase and simple keys, the index is half the size of the existing implementation: complex keys will likely see larger benefits.
          Hide
          Stu Hood added a comment -

          Regarding the sorted cache structure: it looks like it could be accomplished by using a MapMaker LRU map, and an eviction listener to maintain a CSLM: fairly heavy, but...

          Show
          Stu Hood added a comment - Regarding the sorted cache structure: it looks like it could be accomplished by using a MapMaker LRU map, and an eviction listener to maintain a CSLM: fairly heavy, but...
          Hide
          Stu Hood added a comment -

          A solution for the key cache is to allow for fuzzy cache entry matches via a sorted cache structure (if it existed, something like ConcurrentLinkedSkipListMap would be ideal).

          The key cache as it exists gives us exact matches for keys, but when the resolution of the cache increases to columns, the chance of hitting the same column twice (while reasonably high) is not high enough. Ideally we'd be able to fuzzily hit a nearby/next-highest cache entry that represent a range or block of columns.

          An example: the cache contains an entry for columns in the range:

          {("user1","entry0100"), ("user1","entry0200")}

          if a query comes in for a slice starting from ("user1", "entry0150"), we would perform a fuzzy/floor lookup in the cache and hit our entry. A lookup that doesn't fall into the range covered by a cache entry would be a miss, and would result in reading from the index, and the smallest range bounding the lookup being added to the cache.

          Show
          Stu Hood added a comment - A solution for the key cache is to allow for fuzzy cache entry matches via a sorted cache structure (if it existed, something like ConcurrentLinkedSkipListMap would be ideal). The key cache as it exists gives us exact matches for keys, but when the resolution of the cache increases to columns, the chance of hitting the same column twice (while reasonably high) is not high enough. Ideally we'd be able to fuzzily hit a nearby/next-highest cache entry that represent a range or block of columns. An example: the cache contains an entry for columns in the range: {("user1","entry0100"), ("user1","entry0200")} if a query comes in for a slice starting from ("user1", "entry0150"), we would perform a fuzzy/floor lookup in the cache and hit our entry. A lookup that doesn't fall into the range covered by a cache entry would be a miss, and would result in reading from the index, and the smallest range bounding the lookup being added to the cache.
          Hide
          Sylvain Lebresne added a comment -

          Agreed... my point is simply that the number of columns-per-key and the number of keys are inversely proportional: if you have more columns-per-key, you have less keys, and vice-versa. The index will grow proportionally with the total number of columns, not with the number of keys.

          I do not share your confidence that this is axiomatic. It is certainly not axiomatic to the data model. Anyway, that was just a remark, not a criticism of the approach.

          Yea... the key cache as it exists does not necessarily need to change, but at some point we'll want to update it to include the improvements from this ticket.

          Maybe there is a misunderstanding here. I assumed that promoting the row index implied removing the row index (in the favor of a richer sstable index). And even though a first iteration of this doesn't necessary imply this removal, I'll still assume it because I believe this would be weird to keep in the long run even if we keep it in the short run.

          So if you don't have a row index, caching row key position as the actual key cache does will be counter-productive for any non-narrow row, since looking at the sstable index would give you closer to the column. So it would make the key cache as it exists only useful for narrow row (which makes it less useful though not useless).

          It depends on the number of unique queries to the row, but I'm willing to bet that the number of unique queries to a row is relatively low.

          Take time series (which I doubt can be called a niche use case). If the start of your slice query depends on the current time, almost all the query will be unique. Or if you page on the time series and it have a reasonably high rate of inserts, then the pages will be always changing and thus will be your query. Given how long it took me to come up with those two examples (that I did personally used btw, it's not just my imagination running wild), I suspect there is a number of other similar cases.

          Will those be a minority of all the queries on wide rows ? I don't know, probably for some people but maybe not for others. People come up with new ways to use the Cassandra data model all the time, let's not base our reflexion on unchecked assumptions of the kind of queries people do.

          Is that a big deal considering that in promoting the row index we will be at 2 seeks for those case but we're already at 2 seeks on a key cache hit ? Probably not (though for the pleasure of nitpicking I'll add that the 2 seeks in the current case of a key cache hit are closer on disk that how the 2 seeks with the promoted index would be).

          Am I willing to not keep those case in mind because Stu is willing to bet it doesn't matter ? Certainly not (which doesn't mean I don't love you).

          Anyways, I'm in favor of trying this (honestly more because I believe this could make checksumming and compression easier to add than anything else).

          Show
          Sylvain Lebresne added a comment - Agreed... my point is simply that the number of columns-per-key and the number of keys are inversely proportional: if you have more columns-per-key, you have less keys, and vice-versa. The index will grow proportionally with the total number of columns, not with the number of keys. I do not share your confidence that this is axiomatic. It is certainly not axiomatic to the data model. Anyway, that was just a remark, not a criticism of the approach. Yea... the key cache as it exists does not necessarily need to change, but at some point we'll want to update it to include the improvements from this ticket. Maybe there is a misunderstanding here. I assumed that promoting the row index implied removing the row index (in the favor of a richer sstable index). And even though a first iteration of this doesn't necessary imply this removal, I'll still assume it because I believe this would be weird to keep in the long run even if we keep it in the short run. So if you don't have a row index, caching row key position as the actual key cache does will be counter-productive for any non-narrow row, since looking at the sstable index would give you closer to the column. So it would make the key cache as it exists only useful for narrow row (which makes it less useful though not useless). It depends on the number of unique queries to the row, but I'm willing to bet that the number of unique queries to a row is relatively low. Take time series (which I doubt can be called a niche use case). If the start of your slice query depends on the current time, almost all the query will be unique. Or if you page on the time series and it have a reasonably high rate of inserts, then the pages will be always changing and thus will be your query. Given how long it took me to come up with those two examples (that I did personally used btw, it's not just my imagination running wild), I suspect there is a number of other similar cases. Will those be a minority of all the queries on wide rows ? I don't know, probably for some people but maybe not for others. People come up with new ways to use the Cassandra data model all the time, let's not base our reflexion on unchecked assumptions of the kind of queries people do. Is that a big deal considering that in promoting the row index we will be at 2 seeks for those case but we're already at 2 seeks on a key cache hit ? Probably not (though for the pleasure of nitpicking I'll add that the 2 seeks in the current case of a key cache hit are closer on disk that how the 2 seeks with the promoted index would be). Am I willing to not keep those case in mind because Stu is willing to bet it doesn't matter ? Certainly not (which doesn't mean I don't love you). Anyways, I'm in favor of trying this (honestly more because I believe this could make checksumming and compression easier to add than anything else).
          Hide
          Stu Hood added a comment -

          > What is the definition of a narrow row?
          I expect that the definition will still be approximately "less than column_index_size_in_kb". For those users, the change would have no effect, since they didn't have a row index in the first place.

          Show
          Stu Hood added a comment - > What is the definition of a narrow row? I expect that the definition will still be approximately "less than column_index_size_in_kb". For those users, the change would have no effect, since they didn't have a row index in the first place.
          Hide
          Edward Capriolo added a comment -

          Is the plan to let users chose type on a per CF basis? If so by version 0.8.0 will a user chose between standard | compressed | wide when creating a CF.

          I think the use case and description makes sense however

          The index will grow proportionally with the total number of columns, not with the number of keys.

          This worries me. We tend to do getSlice for entire keys. So having these large indexes, does not bring this use case much.

          For narrow rows, this change would have no effect, as they will not reach the threshold for indexing anyway.

          What is the definition of a narrow row?

          Show
          Edward Capriolo added a comment - Is the plan to let users chose type on a per CF basis? If so by version 0.8.0 will a user chose between standard | compressed | wide when creating a CF. I think the use case and description makes sense however The index will grow proportionally with the total number of columns, not with the number of keys. This worries me. We tend to do getSlice for entire keys. So having these large indexes, does not bring this use case much. For narrow rows, this change would have no effect, as they will not reach the threshold for indexing anyway. What is the definition of a narrow row?
          Hide
          Stu Hood added a comment - - edited

          > I'm not saying anything else. What I'm saying is that there is potentially
          > order of magnitudes more 'tuple' than there are keys.
          Agreed... my point is simply that the number of columns-per-key and the number of keys are inversely proportional: if you have more columns-per-key, you have less keys, and vice-versa. The index will grow proportionally with the total number of columns, not with the number of keys.

          > Also, as tjake remarked, it is unclear how to update the key cache with that proposal.
          Yea... the key cache as it exists does not necessarily need to change, but at some point we'll want to update it to include the improvements from this ticket.
          > You could cache column position ('tuple') directly, but that will be much less useful.
          I'm not sure that this would be less useful: promoting the index takes us from 3 to 2 seeks for wide rows: making this particular change to the key cache would take it from 2 to 1 seek for wide rows (get cached position, seek directly to column). It depends on the number of unique queries to the row, but I'm willing to bet that the number of unique queries to a row is relatively low.

          Show
          Stu Hood added a comment - - edited > I'm not saying anything else. What I'm saying is that there is potentially > order of magnitudes more 'tuple' than there are keys. Agreed... my point is simply that the number of columns-per-key and the number of keys are inversely proportional: if you have more columns-per-key, you have less keys, and vice-versa. The index will grow proportionally with the total number of columns, not with the number of keys. > Also, as tjake remarked, it is unclear how to update the key cache with that proposal. Yea... the key cache as it exists does not necessarily need to change, but at some point we'll want to update it to include the improvements from this ticket. > You could cache column position ('tuple') directly, but that will be much less useful. I'm not sure that this would be less useful: promoting the index takes us from 3 to 2 seeks for wide rows: making this particular change to the key cache would take it from 2 to 1 seek for wide rows (get cached position, seek directly to column). It depends on the number of unique queries to the row, but I'm willing to bet that the number of unique queries to a row is relatively low.
          Hide
          Sylvain Lebresne added a comment -

          The important thing to remember is that the distinction between columns and keys should be very fuzzy: columns are a suffix on keys, and treating them otherwise leads to complications. In this case, we shouldn't be holding every 128th "key" in memory, but instead every 128th-512th tuple: that way wide rows are handled naturally.

          I'm not saying anything else. What I'm saying is that there is potentially order of magnitudes more 'tuple' than there are keys. So far I doubt many people have ever changed the index_interval value. I suppose this will change if we do this (I mean, we have advertised you can have 2 billions columns per rows after all ), and we may even be willing to move index_interval configurable per-cf. In turns, this will be more things to consider for the user and a bigger chance for them to get OOM if they are not careful. Probably nothing horrible, but let just make sure we do understand a maximum of the pros and cons.

          Also, as tjake remarked, it is unclear how to update the key cache with that proposal. You could cache column position ('tuple') directly, but that will be much less useful. Maybe the key cache could be kept but limited to skinny rows. Something to consider anyway.

          Show
          Sylvain Lebresne added a comment - The important thing to remember is that the distinction between columns and keys should be very fuzzy: columns are a suffix on keys, and treating them otherwise leads to complications. In this case, we shouldn't be holding every 128th "key" in memory, but instead every 128th-512th tuple: that way wide rows are handled naturally. I'm not saying anything else. What I'm saying is that there is potentially order of magnitudes more 'tuple' than there are keys. So far I doubt many people have ever changed the index_interval value. I suppose this will change if we do this (I mean, we have advertised you can have 2 billions columns per rows after all ), and we may even be willing to move index_interval configurable per-cf. In turns, this will be more things to consider for the user and a bigger chance for them to get OOM if they are not careful. Probably nothing horrible, but let just make sure we do understand a maximum of the pros and cons. Also, as tjake remarked, it is unclear how to update the key cache with that proposal. You could cache column position ('tuple') directly, but that will be much less useful. Maybe the key cache could be kept but limited to skinny rows. Something to consider anyway.
          Hide
          T Jake Luciani added a comment -

          This would make the key cache much less effective no? also the index file would now contain an entry for every row rather than every N

          Show
          T Jake Luciani added a comment - This would make the key cache much less effective no? also the index file would now contain an entry for every row rather than every N
          Hide
          Stu Hood added a comment -

          > I think this would also mean we can remove the back-seek from sstable writing.
          > In which case I am a huge fan in principle.
          I suspect that this is more challenging: the back seek also writes the row length. We'd need to move to a blocked design like the one on 674 to replace the row length with a block length in the data file.

          > This will remove as a consequence the row bloom filter
          Not necessarily, but I think this ticket does highlight the fact that the column bloom filter is ill-positioned: they can prevent the 3rd seek, but only for names queries which (I suspect) are less likely on wide rows. Nonetheless, I can imagine wanting to do a point query for a secondary index to determine whether a particular row matches the index, so we should probably consider promoting it as well.

          > There will be new trade-offs: either you index 'often' or the index becomes potentially much bigger
          The important thing to remember is that the distinction between columns and keys should be very fuzzy: columns are a suffix on keys, and treating them otherwise leads to complications. In this case, we shouldn't be holding every 128th "key" in memory, but instead every 128th-512th tuple: that way wide rows are handled naturally.

          This also normalizes our indexing: the size of your index depends on the total number of columns, instead of on the width and number of your rows.

          Show
          Stu Hood added a comment - > I think this would also mean we can remove the back-seek from sstable writing. > In which case I am a huge fan in principle. I suspect that this is more challenging: the back seek also writes the row length. We'd need to move to a blocked design like the one on 674 to replace the row length with a block length in the data file. > This will remove as a consequence the row bloom filter Not necessarily, but I think this ticket does highlight the fact that the column bloom filter is ill-positioned: they can prevent the 3rd seek, but only for names queries which (I suspect) are less likely on wide rows. Nonetheless, I can imagine wanting to do a point query for a secondary index to determine whether a particular row matches the index, so we should probably consider promoting it as well. > There will be new trade-offs: either you index 'often' or the index becomes potentially much bigger The important thing to remember is that the distinction between columns and keys should be very fuzzy: columns are a suffix on keys, and treating them otherwise leads to complications. In this case, we shouldn't be holding every 128th "key" in memory, but instead every 128th-512th tuple : that way wide rows are handled naturally. This also normalizes our indexing: the size of your index depends on the total number of columns, instead of on the width and number of your rows.
          Hide
          Sylvain Lebresne added a comment -

          I'm reasonably in favor of this too, given that it would also give us a good place to add checksumming (which should be a reasonably short term priority imho) and compression.

          There is however a few things to considered:

          • This will remove as a consequence the row bloom filter (maybe it makes sense to promote it too in some form to the sstable level, but I'm not sure). Row bloom filter is not the most useful trick we have around (only useful for names query for starter). Still something to keep in mind.
          • There will be new trade-offs: either you index 'often' or the index becomes potentially much bigger (in which case you push the problem to the sparse in-memory index). I suspect it will be harder to have a 'one configuration fits all'.
          Show
          Sylvain Lebresne added a comment - I'm reasonably in favor of this too, given that it would also give us a good place to add checksumming (which should be a reasonably short term priority imho) and compression. There is however a few things to considered: This will remove as a consequence the row bloom filter (maybe it makes sense to promote it too in some form to the sstable level, but I'm not sure). Row bloom filter is not the most useful trick we have around (only useful for names query for starter). Still something to keep in mind. There will be new trade-offs: either you index 'often' or the index becomes potentially much bigger (in which case you push the problem to the sparse in-memory index). I suspect it will be harder to have a 'one configuration fits all'.
          Hide
          Jonathan Ellis added a comment -

          I think this would also mean we can remove the back-seek from sstable writing. In which case I am a huge fan in principle.

          Show
          Jonathan Ellis added a comment - I think this would also mean we can remove the back-seek from sstable writing. In which case I am a huge fan in principle.

            People

            • Assignee:
              Sylvain Lebresne
              Reporter:
              Stu Hood
              Reviewer:
              Stu Hood
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development