Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Later
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      Bitmap indexes are a very efficient structure for dealing with immutable data. We can take advantage of the fact that SSTables are immutable by attaching them directly to SSTables as a new component (supported by CASSANDRA-1471).

      1. 0.7-1472-v5.tgz
        67 kB
        Stu Hood
      2. 0.7-1472-v6.tgz
        70 kB
        Stu Hood
      3. 1472-v3.tgz
        52 kB
        Stu Hood
      4. 1472-v4.tgz
        57 kB
        Stu Hood
      5. 1472-v5.tgz
        68 kB
        Stu Hood
      6. anatomy.png
        73 kB
        Stu Hood
      7. ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1472-rebased-to-0.7-branch.txt
        276 kB
        T Jake Luciani
      8. ASF.LICENSE.NOT.GRANTED--0019-Rename-bugfixes-and-fileclose.txt
        15 kB
        T Jake Luciani
      9. v4-bench-c32.txt
        3 kB
        Stu Hood

        Issue Links

          Activity

          Stu Hood created issue -
          Stu Hood made changes -
          Field Original Value New Value
          Attachment 0001-Add-OpenBitSet.patch [ 12453891 ]
          Attachment 0002-Optionally-preserve-observe-deserialized-row-in-orde.patch [ 12453892 ]
          Attachment 0003-Add-BitmapIndexWriter.-Naive-buffers-all-contents-to.patch [ 12453893 ]
          Stu Hood made changes -
          Attachment 0004-Add-BitmapIndexReader-and-BMIR.Iterator.patch [ 12453894 ]
          Attachment 0005-Move-getIndexScanPosition-into-IndexSummary-and-add-.patch [ 12453895 ]
          Attachment 0006-Add-IndexType.KEYS_BITMAP-and-use-it-to-open-BitmapI.patch [ 12453896 ]
          Stu Hood made changes -
          Attachment 0007-Remove-SSTableUtils-dependency-on-writer.append-byte.patch [ 12453897 ]
          Attachment 0008-Add-SSTableReader.scan-and-a-simple-test.patch [ 12453898 ]
          Stu Hood made changes -
          Link This issue is blocked by CASSANDRA-1471 [ CASSANDRA-1471 ]
          Hide
          Stu Hood added a comment -
          • 0001 - Lucene's OpenBitSet class added in utils.obs, with minimal changes and the intention to submit it to Google's Guava
          • 0002 - ColumnObserver class which collects values for a column (in order to index them, in this case). Hooks ColumnObservers into compaction (TODO: but doesn't handle SSTable recovery yet)
          • 0003 - Implementation of a binned bitmap index writer using Avro persistence. TODOs:
            • Currently buffers entire index in memory during SSTable writing: should flush and merge (in a 'scratch' SSTable component, or a separate scratch location?)
            • Uses naive bin selection: techniques to improve this are mentioned in the original paper. [1]
          • 0004 - Reader/iterator for binned bitmap index files
          • 0005 - IndexSummary operation to lookup rowids by an AbstractBound
          • 0006 - Adds the KEYS_BITMAP index type, and uses the new Component infrastructure to load them
          • 0007 - Removes test.SSTableUtils dependency on the BMT write path, which had caused bugs in the past, and isn't yet supported for this implementation of secondary indexing
          • 0008 - Implements SSTableReader.scan, a low level interface to be incorporated behind CFS.scan. Provides an iterator over the set of keys which probably match the given index expression. The paper specifies a really neat way to remove these false positives, which we can implement at a later date. TODO:
            • Not incorporated into ColumnFamilyStore.scan at all yet.
            • Needs support for ORing indexes together to support operators other than IndexOperator.EQ
            • Needs support for ANDing indexes to take advantage of multiple expressions

          [1] http://scholar.google.com/scholar?q=Breaking+the+curse+of+high+cardinality

          Show
          Stu Hood added a comment - 0001 - Lucene's OpenBitSet class added in utils.obs, with minimal changes and the intention to submit it to Google's Guava 0002 - ColumnObserver class which collects values for a column (in order to index them, in this case). Hooks ColumnObservers into compaction (TODO: but doesn't handle SSTable recovery yet) 0003 - Implementation of a binned bitmap index writer using Avro persistence. TODOs: Currently buffers entire index in memory during SSTable writing: should flush and merge (in a 'scratch' SSTable component, or a separate scratch location?) Uses naive bin selection: techniques to improve this are mentioned in the original paper. [1] 0004 - Reader/iterator for binned bitmap index files 0005 - IndexSummary operation to lookup rowids by an AbstractBound 0006 - Adds the KEYS_BITMAP index type, and uses the new Component infrastructure to load them 0007 - Removes test.SSTableUtils dependency on the BMT write path, which had caused bugs in the past, and isn't yet supported for this implementation of secondary indexing 0008 - Implements SSTableReader.scan, a low level interface to be incorporated behind CFS.scan. Provides an iterator over the set of keys which probably match the given index expression. The paper specifies a really neat way to remove these false positives, which we can implement at a later date. TODO: Not incorporated into ColumnFamilyStore.scan at all yet. Needs support for ORing indexes together to support operators other than IndexOperator.EQ Needs support for ANDing indexes to take advantage of multiple expressions [1] http://scholar.google.com/scholar?q=Breaking+the+curse+of+high+cardinality
          Stu Hood made changes -
          Attachment 0008-Add-SSTableReader.scan-and-a-simple-test.patch [ 12453898 ]
          Stu Hood made changes -
          Attachment 0007-Remove-SSTableUtils-dependency-on-writer.append-byte.patch [ 12453897 ]
          Stu Hood made changes -
          Attachment 0006-Add-IndexType.KEYS_BITMAP-and-use-it-to-open-BitmapI.patch [ 12453896 ]
          Stu Hood made changes -
          Attachment 0005-Move-getIndexScanPosition-into-IndexSummary-and-add-.patch [ 12453895 ]
          Stu Hood made changes -
          Attachment 0004-Add-BitmapIndexReader-and-BMIR.Iterator.patch [ 12453894 ]
          Stu Hood made changes -
          Attachment 0003-Add-BitmapIndexWriter.-Naive-buffers-all-contents-to.patch [ 12453893 ]
          Stu Hood made changes -
          Attachment 0003-Add-BitmapIndexWriter.-Naive-buffers-all-contents-to.patch [ 12453914 ]
          Attachment 0004-Add-BitmapIndexReader-and-SegmentIterator-for-iterat.patch [ 12453915 ]
          Attachment 0005-Move-getIndexScanPosition-into-IndexSummary-and-add-.patch [ 12453916 ]
          Hide
          Stu Hood added a comment -

          Updates to 0003-0008 to fix a dumb mistake in bin expansion, and switch bin management to arraylist + binarysearch.

          Show
          Stu Hood added a comment - Updates to 0003-0008 to fix a dumb mistake in bin expansion, and switch bin management to arraylist + binarysearch.
          Stu Hood made changes -
          Attachment 0006-Add-IndexType.KEYS_BITMAP-and-use-it-to-open-BitmapI.patch [ 12453917 ]
          Attachment 0007-Remove-SSTableUtils-dependency-on-writer.append-byte.patch [ 12453918 ]
          Attachment 0008-Add-SSTableReader.scan-and-a-simple-test.patch [ 12453919 ]
          Stu Hood made changes -
          Attachment 0008-Add-SSTableReader.scan-and-a-simple-test.patch [ 12453919 ]
          Stu Hood made changes -
          Attachment 0007-Remove-SSTableUtils-dependency-on-writer.append-byte.patch [ 12453918 ]
          Stu Hood made changes -
          Attachment 0006-Add-IndexType.KEYS_BITMAP-and-use-it-to-open-BitmapI.patch [ 12453917 ]
          Stu Hood made changes -
          Attachment 0005-Move-getIndexScanPosition-into-IndexSummary-and-add-.patch [ 12453916 ]
          Stu Hood made changes -
          Attachment 0004-Add-BitmapIndexReader-and-SegmentIterator-for-iterat.patch [ 12453915 ]
          Stu Hood made changes -
          Attachment 0003-Add-BitmapIndexWriter.-Naive-buffers-all-contents-to.patch [ 12453914 ]
          Stu Hood made changes -
          Attachment 0003-Add-BitmapIndexWriter.-Naive-buffers-all-contents-to.patch [ 12453985 ]
          Attachment 0004-Add-BitmapIndexReader-and-SegmentIterator-for-iterat.patch [ 12453986 ]
          Attachment 0005-Move-getIndexScanPosition-into-IndexSummary-and-add-.patch [ 12453987 ]
          Stu Hood made changes -
          Attachment 0006-Add-IndexType.KEYS_BITMAP-and-use-it-to-open-BitmapI.patch [ 12453988 ]
          Attachment 0007-Remove-SSTableUtils-dependency-on-writer.append-byte.patch [ 12453989 ]
          Attachment 0008-Add-SSTableReader.scan-and-a-simple-test.patch [ 12453990 ]
          Hide
          Stu Hood added a comment -

          Updates 0003 through 0008 to clean up class structure and improve the tests. Adds 0009:

          • Adds JoiningIterator for OR and AND operations, to join multiple indexes or bins from the same index.
          • Uses JoiningIterator.OR to implement index range queries.

          This is approaching reviewability. TODOs, in order of importance:

          • Implement flush/merge
          • Integrate with SSTable recovery
          • Incorporate into CFS.scan
          • Improve bin selection
          Show
          Stu Hood added a comment - Updates 0003 through 0008 to clean up class structure and improve the tests. Adds 0009: Adds JoiningIterator for OR and AND operations, to join multiple indexes or bins from the same index. Uses JoiningIterator.OR to implement index range queries. This is approaching reviewability. TODOs, in order of importance: Implement flush/merge Integrate with SSTable recovery Incorporate into CFS.scan Improve bin selection
          Stu Hood made changes -
          Attachment 0009-Add-JoiningIterator-to-implement-GLT-E-will-also-joi.patch [ 12453991 ]
          Stu Hood made changes -
          Attachment 0009-Add-JoiningIterator-to-implement-GLT-E-will-also-joi.patch [ 12453991 ]
          Stu Hood made changes -
          Attachment 0008-Add-SSTableReader.scan-and-a-simple-test.patch [ 12453990 ]
          Stu Hood made changes -
          Attachment 0007-Remove-SSTableUtils-dependency-on-writer.append-byte.patch [ 12453989 ]
          Stu Hood made changes -
          Attachment 0006-Add-IndexType.KEYS_BITMAP-and-use-it-to-open-BitmapI.patch [ 12453988 ]
          Stu Hood made changes -
          Attachment 0005-Move-getIndexScanPosition-into-IndexSummary-and-add-.patch [ 12453987 ]
          Stu Hood made changes -
          Attachment 0004-Add-BitmapIndexReader-and-SegmentIterator-for-iterat.patch [ 12453986 ]
          Stu Hood made changes -
          Attachment 0003-Add-BitmapIndexWriter.-Naive-buffers-all-contents-to.patch [ 12453985 ]
          Stu Hood made changes -
          Attachment 0002-Optionally-preserve-observe-deserialized-row-in-orde.patch [ 12453892 ]
          Stu Hood made changes -
          Attachment 0001-Add-OpenBitSet.patch [ 12453891 ]
          Hide
          Stu Hood added a comment -

          Attaching a third version of the patch. Since the last revision:

          • 0001 through 0009 were updated to fix bugs, but didn't experience any significant changes
          • Integrated KEYS_BITMAP indexes into ColumnFamilyStore.scan (0012, 0013)
            • Added abstract db.secindex.SecondaryIndex, with KeysIndex and KeysBitmapIndex subclasses
              • SecondaryIndex.certainty(expression) returns false if the index can return false positives
              • SecondaryIndex.iterator(bounds, expression) iterates keys matching the index
            • KeysBitmapIndex implements the probable key iteration by merging the results from the sstable bitmap indexes with the results of a brute-force filtering of all memtables (see TODOs)
          • Indexes larger than memory are supported (0010)
            • Bitmap indexes are recreated during usage of SSTableWriter (no change there)
            • The 'io.sstable.bitidx.ScratchStack' object implements merging for bitidxs which is very similar to SSTable compaction, but it occurs synchronously while the SSTableWriter is in use:
              1. A segment of the index is built in memory, flushed to disk, and added to a stack
              2. When enough segments of the same size exist, the stack merges them
            • Although IO is still mostly sequential, SSTable write performance drops by about 25% per index added. Haven't looked into bottlenecks here.

          TODOs, in order of importance:

          • Integrate rebuilding bitidxs into SSTable recovery
          • Replace Memtable brute-force filtering with an in-memory index attached to the Memtable
          • Improve bin selection

          I think this is probably ready for review if anyone has time: incorporating KEYS_BITMAP indexes into SSTable recovery is probably the last blocker.

          Show
          Stu Hood added a comment - Attaching a third version of the patch. Since the last revision: 0001 through 0009 were updated to fix bugs, but didn't experience any significant changes Integrated KEYS_BITMAP indexes into ColumnFamilyStore.scan ( 0012 , 0013 ) Added abstract db.secindex.SecondaryIndex, with KeysIndex and KeysBitmapIndex subclasses SecondaryIndex.certainty(expression) returns false if the index can return false positives SecondaryIndex.iterator(bounds, expression) iterates keys matching the index KeysBitmapIndex implements the probable key iteration by merging the results from the sstable bitmap indexes with the results of a brute-force filtering of all memtables (see TODOs) Indexes larger than memory are supported ( 0010 ) Bitmap indexes are recreated during usage of SSTableWriter (no change there) The 'io.sstable.bitidx.ScratchStack' object implements merging for bitidxs which is very similar to SSTable compaction, but it occurs synchronously while the SSTableWriter is in use: A segment of the index is built in memory, flushed to disk, and added to a stack When enough segments of the same size exist, the stack merges them Although IO is still mostly sequential, SSTable write performance drops by about 25% per index added. Haven't looked into bottlenecks here. TODOs, in order of importance: Integrate rebuilding bitidxs into SSTable recovery Replace Memtable brute-force filtering with an in-memory index attached to the Memtable Improve bin selection I think this is probably ready for review if anyone has time: incorporating KEYS_BITMAP indexes into SSTable recovery is probably the last blocker.
          Stu Hood made changes -
          Attachment 1472-v3.tgz [ 12454918 ]
          Stu Hood made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Stu Hood added a comment -

          Attaching a diagram of the call hierarchy for KEYS_BITMAP indexes.

          Show
          Stu Hood added a comment - Attaching a diagram of the call hierarchy for KEYS_BITMAP indexes.
          Stu Hood made changes -
          Attachment anatomy.png [ 12454957 ]
          Stu Hood made changes -
          Link This issue is blocked by CASSANDRA-1415 [ CASSANDRA-1415 ]
          Hide
          Stu Hood added a comment -

          Linking with 1415, which refactors SSTable recovery, and adds support for Migrations creating indexes.

          Show
          Stu Hood added a comment - Linking with 1415, which refactors SSTable recovery, and adds support for Migrations creating indexes.
          Stu Hood made changes -
          Link This issue is related to CASSANDRA-1531 [ CASSANDRA-1531 ]
          Hide
          Stu Hood added a comment -

          Attaching v4:

          • Cleaned up the join between primary and secondary indexes: now accomplished by the SSTR.PrimaryToSecondary iterator
          • Turned on lightweight compression by default, since it gave faster numbers on my CPU bound machine
          • Fixed lots of bugs related to larger than memory indexes
          • Added 0014 to skip short byte arrays rather than allocating any memory
          Show
          Stu Hood added a comment - Attaching v4: Cleaned up the join between primary and secondary indexes: now accomplished by the SSTR.PrimaryToSecondary iterator Turned on lightweight compression by default, since it gave faster numbers on my CPU bound machine Fixed lots of bugs related to larger than memory indexes Added 0014 to skip short byte arrays rather than allocating any memory
          Stu Hood made changes -
          Attachment 1472-v4.tgz [ 12455223 ]
          Attachment anatomy.png [ 12455224 ]
          Stu Hood made changes -
          Attachment anatomy.png [ 12454957 ]
          Stu Hood made changes -
          Attachment anatomy.png [ 12455224 ]
          Stu Hood made changes -
          Attachment anatomy.png [ 12455225 ]
          Hide
          Stu Hood added a comment -

          Attaching some benchmark numbers generated by CASSANDRA-1531.

          v4-bench-c32.txt is for cardinality == 32, which is a bit of a magic number in the current implementation of KEYS_BITMAP indexes: it's the maximum number of bins that our current implementation will create, meaning that every value gets its own bin, so there are no false positives returned by the sstables.

          v4-bench-c50.txt is for cardinality == 50, meaning that some of the bins are multi-valued, leading to a larger number of false positives. Since our bin selection is still naive, it is generating lumpy buckets: you can see a long tail for KEYS_BITMAP reads, since one thread got stuck with higher cardinality bins.


          TODOs:

          • Integrate with 1415 now that it is in trunk
          • Improve bin selection to eliminate lumpy bins (as seen with c==50)
          • Replace Memtable brute-force filtering with an in-memory index attached to the Memtable (I didn't do any tests with memtables, but I'm assuming this should still be a priority)
          • Implement the "OrBiC" projection described in the paper to eliminate false positives
            • Multi-valued bins would have their values projected into another SSTable component
          Show
          Stu Hood added a comment - Attaching some benchmark numbers generated by CASSANDRA-1531 . v4-bench-c32.txt is for cardinality == 32, which is a bit of a magic number in the current implementation of KEYS_BITMAP indexes: it's the maximum number of bins that our current implementation will create, meaning that every value gets its own bin, so there are no false positives returned by the sstables. v4-bench-c50.txt is for cardinality == 50, meaning that some of the bins are multi-valued, leading to a larger number of false positives. Since our bin selection is still naive, it is generating lumpy buckets: you can see a long tail for KEYS_BITMAP reads, since one thread got stuck with higher cardinality bins. TODOs: Integrate with 1415 now that it is in trunk Improve bin selection to eliminate lumpy bins (as seen with c==50) Replace Memtable brute-force filtering with an in-memory index attached to the Memtable (I didn't do any tests with memtables, but I'm assuming this should still be a priority) Implement the "OrBiC" projection described in the paper to eliminate false positives Multi-valued bins would have their values projected into another SSTable component
          Stu Hood made changes -
          Attachment v4-bench-c32.txt [ 12455227 ]
          Attachment v4-bench-c50.txt [ 12455228 ]
          Stu Hood made changes -
          Attachment v4-bench-c50.txt [ 12455228 ]
          Stu Hood made changes -
          Attachment v4-bench-c32.txt [ 12455227 ]
          Hide
          Stu Hood added a comment -

          Reattaching benchmarks with a baseline no-index run for comparison.

          Show
          Stu Hood added a comment - Reattaching benchmarks with a baseline no-index run for comparison.
          Stu Hood made changes -
          Attachment v4-bench-c50.txt [ 12455229 ]
          Attachment v4-bench-c32.txt [ 12455230 ]
          Hide
          Jonathan Ellis added a comment -

          What's your benchmark setup? I would have expected KEYS index to have a bigger effect on writes which makes me wonder if something's off.

          Show
          Jonathan Ellis added a comment - What's your benchmark setup? I would have expected KEYS index to have a bigger effect on writes which makes me wonder if something's off.
          Hide
          Stu Hood added a comment -

          SSD, dual core @ 2.40GHz, 2GB ram

          Show
          Stu Hood added a comment - SSD, dual core @ 2.40GHz, 2GB ram
          Stu Hood made changes -
          Attachment v4-bench-c50.txt [ 12455229 ]
          Stu Hood made changes -
          Attachment v4-bench-c32.txt [ 12455230 ]
          Stu Hood made changes -
          Attachment 1472-v4.tgz [ 12455223 ]
          Hide
          Stu Hood added a comment -

          Rebased v4 for trunk, added a test for BitmapIndexWriter, and reran the C32 benchmark. Still planning to improve the bin selection this week.

          Show
          Stu Hood added a comment - Rebased v4 for trunk, added a test for BitmapIndexWriter, and reran the C32 benchmark. Still planning to improve the bin selection this week.
          Stu Hood made changes -
          Attachment 1472-v4.tgz [ 12455592 ]
          Attachment v4-bench-c32.txt [ 12455593 ]
          Hide
          Jonathan Ellis added a comment -

          patch 06 fails to apply for me

          Show
          Jonathan Ellis added a comment - patch 06 fails to apply for me
          Stu Hood made changes -
          Attachment 1472-v4.tgz [ 12455592 ]
          Hide
          Stu Hood added a comment -

          Rebased v4.

          Show
          Stu Hood added a comment - Rebased v4.
          Stu Hood made changes -
          Attachment 1472-v4.tgz [ 12456391 ]
          Hide
          Jonathan Ellis added a comment -

          why does avro make sense here?

          Show
          Jonathan Ellis added a comment - why does avro make sense here?
          Hide
          Stu Hood added a comment -

          > why does avro make sense here?
          Forwards compatibility, varint encoding, one-liner compression (until we get around to implementing something smarter), and in general, it has a very nice Java API. Finally, I'm probably going to push to get the Avro datafile format into our SSTables, so this was a bit of a warmup.

          Show
          Stu Hood added a comment - > why does avro make sense here? Forwards compatibility, varint encoding, one-liner compression (until we get around to implementing something smarter), and in general, it has a very nice Java API. Finally, I'm probably going to push to get the Avro datafile format into our SSTables, so this was a bit of a warmup.
          Hide
          Stu Hood added a comment -

          I think repair / sstable recovery is the last real blocker for getting this merged, so I can work on that this weekend if you think the rest is alright.

          Show
          Stu Hood added a comment - I think repair / sstable recovery is the last real blocker for getting this merged, so I can work on that this weekend if you think the rest is alright.
          Hide
          Jonathan Ellis added a comment -

          I feel really bad about this because I encouraged Stu to write the code but unless someone else has time to review this week we'll have to push this to 0.7.1.

          Show
          Jonathan Ellis added a comment - I feel really bad about this because I encouraged Stu to write the code but unless someone else has time to review this week we'll have to push this to 0.7.1.
          Jonathan Ellis made changes -
          Fix Version/s 0.7.1 [ 12315199 ]
          Fix Version/s 0.7.0 [ 12315212 ]
          Hide
          Stu Hood added a comment -

          Attaching v5, which continues migrating secondary index support into the db.secindex package in 0016 and 0017, and adds support for adding and removing KEYS_BITMAP indexes in 0014-0018.

          Adding a bitmap index is accomplished via a new SSTable.clone method (which makes a hardlinked copy of an SSTable with a new generation number) and improvements to SSTableWriter.Builder to allow for conditionally writing components.

          TODOs (repeated from previous comments) to be tackled post-merge:

          • Better bin selection
          • Replace memtable filtering with in-memory index
          • Implement OrBiC projection

          This is 100% ready for review now.

          Show
          Stu Hood added a comment - Attaching v5, which continues migrating secondary index support into the db.secindex package in 0016 and 0017 , and adds support for adding and removing KEYS_BITMAP indexes in 0014 - 0018 . Adding a bitmap index is accomplished via a new SSTable.clone method (which makes a hardlinked copy of an SSTable with a new generation number) and improvements to SSTableWriter.Builder to allow for conditionally writing components. TODOs (repeated from previous comments) to be tackled post-merge: Better bin selection Replace memtable filtering with in-memory index Implement OrBiC projection This is 100% ready for review now.
          Stu Hood made changes -
          Attachment 1472-v5.tgz [ 12457946 ]
          Hide
          Nick Bailey added a comment -

          Just briefly looking at this, one good thing to include would be some jmx metrics along with the bitmap index. One of the nice things of the cf implementation is those come for free. Probably would go in another ticket though.

          Show
          Nick Bailey added a comment - Just briefly looking at this, one good thing to include would be some jmx metrics along with the bitmap index. One of the nice things of the cf implementation is those come for free. Probably would go in another ticket though.
          Stu Hood made changes -
          Attachment 1472-v5.tgz [ 12457946 ]
          Hide
          Stu Hood added a comment -

          Rebased v5.

          Show
          Stu Hood added a comment - Rebased v5.
          Stu Hood made changes -
          Attachment 1472-v5.tgz [ 12458531 ]
          Hide
          Stu Hood added a comment -

          Another TODO item for a future ticket is to push down evaluation of multiple index clauses to the bitmap indexes where possible.

          Show
          Stu Hood added a comment - Another TODO item for a future ticket is to push down evaluation of multiple index clauses to the bitmap indexes where possible.
          Hide
          Stu Hood added a comment -

          Waiting to hear back from Dragos, but it's possible that I revived a variant of CASSANDRA-1623 with this rendition.

          Show
          Stu Hood added a comment - Waiting to hear back from Dragos, but it's possible that I revived a variant of CASSANDRA-1623 with this rendition.
          Hide
          Stu Hood added a comment -

          Rebased 1472-v5 without changes... waiting until this hits trunk to make any other changes.

          Show
          Stu Hood added a comment - Rebased 1472-v5 without changes... waiting until this hits trunk to make any other changes.
          Stu Hood made changes -
          Attachment 1472-v5.tgz [ 12465328 ]
          Stu Hood made changes -
          Attachment 1472-v5.tgz [ 12458531 ]
          Jonathan Ellis made changes -
          Assignee Stu Hood [ stuhood ]
          Reviewer tjake
          T Jake Luciani made changes -
          Link This issue is blocked by CASSANDRA-1337 [ CASSANDRA-1337 ]
          T Jake Luciani made changes -
          Link This issue is blocked by CASSANDRA-1337 [ CASSANDRA-1337 ]
          T Jake Luciani made changes -
          Link This issue blocks CASSANDRA-1337 [ CASSANDRA-1337 ]
          T Jake Luciani made changes -
          Link This issue blocks CASSANDRA-1339 [ CASSANDRA-1339 ]
          Hide
          Stu Hood added a comment -

          Working on rebasing... should post something this weekend at the latest.

          Show
          Stu Hood added a comment - Working on rebasing... should post something this weekend at the latest.
          Stu Hood made changes -
          Attachment 1472-v5.tgz [ 12465328 ]
          Hide
          Stu Hood added a comment -

          Rebased 1472-v5 with no serious changes.

          Rather than adding more features to this particular patch, I punted on online index addition/removal: the current implementation logs warnings that the index will not be available until after restart, at which point the previously described db.secindex.KeysBitmapIndex.preload method synchronously performs the indexing.

          Show
          Stu Hood added a comment - Rebased 1472-v5 with no serious changes. Rather than adding more features to this particular patch, I punted on online index addition/removal: the current implementation logs warnings that the index will not be available until after restart, at which point the previously described db.secindex.KeysBitmapIndex.preload method synchronously performs the indexing.
          Stu Hood made changes -
          Attachment 1472-v5.tgz [ 12467525 ]
          Hide
          T Jake Luciani added a comment -

          I can't seem to figure out how to apply this patchset? git apply just throws errors. this is against cassandra-0.7 branch correct?

          I untarred the dir and ran "git apply 1472/*"

          Show
          T Jake Luciani added a comment - I can't seem to figure out how to apply this patchset? git apply just throws errors. this is against cassandra-0.7 branch correct? I untarred the dir and ran "git apply 1472/*"
          Hide
          Jonathan Ellis added a comment -

          you want "git am," apply is just for a single patch

          Show
          Jonathan Ellis added a comment - you want "git am," apply is just for a single patch
          Hide
          T Jake Luciani added a comment - - edited

          Now i get this:

          error: java/org/apache/cassandra/db/CompactionManager.java: does not exist in index
          error: java/org/apache/cassandra/io/AbstractCompactedRow.java: does not exist in index
          error: java/org/apache/cassandra/io/CompactionIterator.java: does not exist in index
          error: java/org/apache/cassandra/io/LazilyCompactedRow.java: does not exist in index
          error: java/org/apache/cassandra/io/PrecompactedRow.java: does not exist in index
          error: java/org/apache/cassandra/io/sstable/SSTableIdentityIterator.java: does not exist in index
          error: java/org/apache/cassandra/io/sstable/SSTableWriter.java: does not exist in index

          The 0001 patch header looks like this:

          .../org/apache/cassandra/db/CompactionManager.java | 23 +++-
          .../apache/cassandra/io/AbstractCompactedRow.java | 6 +-
          .../org/apache/cassandra/io/ColumnObserver.java | 124 ++++++++++++++++++++
          .../apache/cassandra/io/CompactionIterator.java | 15 ++-
          .../apache/cassandra/io/LazilyCompactedRow.java | 9 +-
          .../org/apache/cassandra/io/PrecompactedRow.java | 23 +++-
          .../io/sstable/SSTableIdentityIterator.java | 13 ++-
          .../apache/cassandra/io/sstable/SSTableWriter.java | 22 ++++-
          .../cassandra/io/LazilyCompactedRowTest.java | 19 +++-

          Show
          T Jake Luciani added a comment - - edited Now i get this: error: java/org/apache/cassandra/db/CompactionManager.java: does not exist in index error: java/org/apache/cassandra/io/AbstractCompactedRow.java: does not exist in index error: java/org/apache/cassandra/io/CompactionIterator.java: does not exist in index error: java/org/apache/cassandra/io/LazilyCompactedRow.java: does not exist in index error: java/org/apache/cassandra/io/PrecompactedRow.java: does not exist in index error: java/org/apache/cassandra/io/sstable/SSTableIdentityIterator.java: does not exist in index error: java/org/apache/cassandra/io/sstable/SSTableWriter.java: does not exist in index The 0001 patch header looks like this: .../org/apache/cassandra/db/CompactionManager.java | 23 +++- .../apache/cassandra/io/AbstractCompactedRow.java | 6 +- .../org/apache/cassandra/io/ColumnObserver.java | 124 ++++++++++++++++++++ .../apache/cassandra/io/CompactionIterator.java | 15 ++- .../apache/cassandra/io/LazilyCompactedRow.java | 9 +- .../org/apache/cassandra/io/PrecompactedRow.java | 23 +++- .../io/sstable/SSTableIdentityIterator.java | 13 ++- .../apache/cassandra/io/sstable/SSTableWriter.java | 22 ++++- .../cassandra/io/LazilyCompactedRowTest.java | 19 +++-
          Hide
          Jonathan Ellis added a comment -

          looks like the patch was generated from some weird non-root directory, you probably need some combination of -p or --directory (as in git-apply)

          Show
          Jonathan Ellis added a comment - looks like the patch was generated from some weird non-root directory, you probably need some combination of -p or --directory (as in git-apply)
          Hide
          Stu Hood added a comment -

          Attaching a version of 1472-v5 rebased for the 0.7 branch.

          Show
          Stu Hood added a comment - Attaching a version of 1472-v5 rebased for the 0.7 branch.
          Stu Hood made changes -
          Attachment 0.7-1472-v5.tgz [ 12467616 ]
          Hide
          T Jake Luciani added a comment -

          This breaks the secondaryCleanupTest

          Testcase: testSecondaryCleanup(org.apache.cassandra.db.SecondaryIndexCleanupTest): Caused an ERROR
          [junit] org.apache.cassandra.db.ColumnFamilyStore.buildSecondaryIndexes(Ljava/util/Collection;Ljava/util/SortedSet;)V
          [junit] java.lang.NoSuchMethodError: org.apache.cassandra.db.ColumnFamilyStore.buildSecondaryIndexes(Ljava/util/Collection;Ljava/util/SortedSet;)V
          [junit] at org.apache.cassandra.db.SecondaryIndexCleanupTest.fillCF(SecondaryIndexCleanupTest.java:142)
          [junit] at org.apache.cassandra.db.SecondaryIndexCleanupTest.testSecondaryCleanup(SecondaryIndexCleanupTest.java:91)

          Show
          T Jake Luciani added a comment - This breaks the secondaryCleanupTest Testcase: testSecondaryCleanup(org.apache.cassandra.db.SecondaryIndexCleanupTest): Caused an ERROR [junit] org.apache.cassandra.db.ColumnFamilyStore.buildSecondaryIndexes(Ljava/util/Collection;Ljava/util/SortedSet;)V [junit] java.lang.NoSuchMethodError: org.apache.cassandra.db.ColumnFamilyStore.buildSecondaryIndexes(Ljava/util/Collection;Ljava/util/SortedSet;)V [junit] at org.apache.cassandra.db.SecondaryIndexCleanupTest.fillCF(SecondaryIndexCleanupTest.java:142) [junit] at org.apache.cassandra.db.SecondaryIndexCleanupTest.testSecondaryCleanup(SecondaryIndexCleanupTest.java:91)
          Hide
          T Jake Luciani added a comment -

          [junit] WARN 09:29:31,942 Online addition of a KEYS_BITMAP index is not yet supported! Please restart this node to initialize the index.

          On the fly indexes is something you should fix in this ticket.

          Show
          T Jake Luciani added a comment - [junit] WARN 09:29:31,942 Online addition of a KEYS_BITMAP index is not yet supported! Please restart this node to initialize the index. On the fly indexes is something you should fix in this ticket.
          Hide
          T Jake Luciani added a comment -

          Overall, this is a really great refactoring of the secondary index stuff!

          After reviewing CASSANDRA-1916 I was disappointed with the secondary index code, this makes me much happier!

          Notes on the work so far:

          • I renamed KEYS_BITMAP to just BITMAP, fixed some spots that could leak files, and fixed a compaction bug related to 1916 with testcase.
          • There are some changes in here that seem to be bug fixes for other issues, specifically the changes to CFMetaData.java, this should probably go into it's own ticket?
          • I see in SSTableWriter that BMT will fail on secondary indexed CFs now. Why fail though? Can't they just be built on restart?
          • The whole BitmapIndexWriter Scratch space has me slightly concerned. I think writing lots of segment files to disk then merging them to one seems, well like more compaction. I guess it's the downside of not using SSTables like keys does.
          • AVRO, I don't see the value here. The structures are so simple any you bypass it for bitset's anyway. The value of using our BRAF is you have all the work to avoid polluting the page cache done so this code doesn't affect live reads. As it stands this is going to go against all that work. I think you should serialize this natively.
          • I mentioned above that on the fly indexes should be allowed, however this can happen in a subsequent ticket if you prefer.
          • As Nick mentioned it would be nice to have some stats on the index available in JMX, for a subsequent ticket.
          • I think this implementation should probably be the only secondary index format we support (What's the value of keeping KEYS over this?) We can remove KEYS after.

          But this is really nicely thought out, good stuff!

          Show
          T Jake Luciani added a comment - Overall, this is a really great refactoring of the secondary index stuff! After reviewing CASSANDRA-1916 I was disappointed with the secondary index code, this makes me much happier! Notes on the work so far: I renamed KEYS_BITMAP to just BITMAP, fixed some spots that could leak files, and fixed a compaction bug related to 1916 with testcase. There are some changes in here that seem to be bug fixes for other issues, specifically the changes to CFMetaData.java, this should probably go into it's own ticket? I see in SSTableWriter that BMT will fail on secondary indexed CFs now. Why fail though? Can't they just be built on restart? The whole BitmapIndexWriter Scratch space has me slightly concerned. I think writing lots of segment files to disk then merging them to one seems, well like more compaction. I guess it's the downside of not using SSTables like keys does. AVRO, I don't see the value here. The structures are so simple any you bypass it for bitset's anyway. The value of using our BRAF is you have all the work to avoid polluting the page cache done so this code doesn't affect live reads. As it stands this is going to go against all that work. I think you should serialize this natively. I mentioned above that on the fly indexes should be allowed, however this can happen in a subsequent ticket if you prefer. As Nick mentioned it would be nice to have some stats on the index available in JMX, for a subsequent ticket. I think this implementation should probably be the only secondary index format we support (What's the value of keeping KEYS over this?) We can remove KEYS after. But this is really nicely thought out, good stuff!
          T Jake Luciani made changes -
          Attachment 0019-Rename-bugfixes-and-fileclose.txt [ 12467919 ]
          Hide
          Stu Hood added a comment -

          > I renamed KEYS_BITMAP to just BITMAP, fixed some spots that could leak files, and fixed a compaction bug related to 1916 with testcase.
          I incorporated your changes into the latest tarball as 0018, and fixed some silliness in 0019 and 0020.

          > There are some changes in here that seem to be bug fixes for other issues, specifically the changes to CFMetaData.java
          Dropped from this patch, and added on CASSANDRA-1962

          > I see in SSTableWriter that BMT will fail on secondary indexed CFs now. Why fail though? Can't they just be built on restart?
          Yes, probably: but the naive approach is not very elegant, since when we see the first BMT append, we'll already have the secondary indexes open, so we need to null them out. A better approach would need to indicate to the SSTW constructor/factory that we were intending to write without certain component types... I think this can go in another ticket?

          > The whole BitmapIndexWriter Scratch space has me slightly concerned.
          There is an alternative to the layout I've implemented here, but it is slower for the most common query type (equality on one bucket), and only slightly faster for extremely general index queries (LT/GT involving most/all of the buckets). We can measure the actual overhead on a single sstable if you'd like.

          > AVRO, I don't see the value here. [...] The value of using our BRAF is you have all the work to avoid polluting the page cache
          I could go either way on this point: on one hand, this is an extremely simple structure. On the other hand, we get large benefits from compression here, and I'm fairly certain we should use Avro for the rest of the sstable.

          Also, it's very simple to use our FileDataInput implementations here via Avro's SeekableInput interface, so we don't necessarily need to throw away any effort. See https://github.com/stuhood/cassandra/commit/1a5c9115cb1410519eff15dd3089772b1e550ae7

          > I mentioned above that on the fly indexes should be allowed, however this can happen in a subsequent ticket if you prefer.
          Yes, I'd prefer that. It will likely be the highest priority of the 4-5 tickets we need to create if/when this issue goes in.

          > As Nick mentioned it would be nice to have some stats on the index available in JMX, for a subsequent ticket.
          Agreed.

          > I think this implementation should probably be the only secondary index format we support (What's the value of keeping KEYS over this?)
          Agreed, pending the optimizations mentioned in previous comments.

          Show
          Stu Hood added a comment - > I renamed KEYS_BITMAP to just BITMAP, fixed some spots that could leak files, and fixed a compaction bug related to 1916 with testcase. I incorporated your changes into the latest tarball as 0018, and fixed some silliness in 0019 and 0020. > There are some changes in here that seem to be bug fixes for other issues, specifically the changes to CFMetaData.java Dropped from this patch, and added on CASSANDRA-1962 > I see in SSTableWriter that BMT will fail on secondary indexed CFs now. Why fail though? Can't they just be built on restart? Yes, probably: but the naive approach is not very elegant, since when we see the first BMT append, we'll already have the secondary indexes open, so we need to null them out. A better approach would need to indicate to the SSTW constructor/factory that we were intending to write without certain component types... I think this can go in another ticket? > The whole BitmapIndexWriter Scratch space has me slightly concerned. There is an alternative to the layout I've implemented here, but it is slower for the most common query type (equality on one bucket), and only slightly faster for extremely general index queries (LT/GT involving most/all of the buckets). We can measure the actual overhead on a single sstable if you'd like. > AVRO, I don't see the value here. [...] The value of using our BRAF is you have all the work to avoid polluting the page cache I could go either way on this point: on one hand, this is an extremely simple structure. On the other hand, we get large benefits from compression here, and I'm fairly certain we should use Avro for the rest of the sstable. Also, it's very simple to use our FileDataInput implementations here via Avro's SeekableInput interface, so we don't necessarily need to throw away any effort. See https://github.com/stuhood/cassandra/commit/1a5c9115cb1410519eff15dd3089772b1e550ae7 > I mentioned above that on the fly indexes should be allowed, however this can happen in a subsequent ticket if you prefer. Yes, I'd prefer that. It will likely be the highest priority of the 4-5 tickets we need to create if/when this issue goes in. > As Nick mentioned it would be nice to have some stats on the index available in JMX, for a subsequent ticket. Agreed. > I think this implementation should probably be the only secondary index format we support (What's the value of keeping KEYS over this?) Agreed, pending the optimizations mentioned in previous comments.
          Stu Hood made changes -
          Attachment 0.7-1472-v6.tgz [ 12467994 ]
          Hide
          Stu Hood added a comment -

          Summary of tickets to open post commit:

          • Online indexing
          • Use sampling to improve bin selection and eliminate lumpy bins
          • Consider/benchmark replacing memtable brute-force-filtering with an in-memory secondary index
          • Implement OrBiC projection (note that this also benefits from compression and a block based format as provided by Avro)
          • Expose index metrics via JMX
          • Handle BMT gracefully
          Show
          Stu Hood added a comment - Summary of tickets to open post commit: Online indexing Use sampling to improve bin selection and eliminate lumpy bins Consider/benchmark replacing memtable brute-force-filtering with an in-memory secondary index Implement OrBiC projection (note that this also benefits from compression and a block based format as provided by Avro) Expose index metrics via JMX Handle BMT gracefully
          Hide
          Stu Hood added a comment - - edited

          >> I think this implementation should probably be the only secondary index format we support (What's the value of keeping KEYS over this?)
          > Agreed, pending the optimizations mentioned in previous comments.
          I need to retract this statement: I don't think we should remove the KEYS index, for a few reasons:

          1. The false positives generated by these bitmap indexes have worst case behaviour with rapidly changing data: if a row matches the index in any sstable, we have to read all sstables that contain the row in order to resolve the data and determine whether it actually matches. On the other hand, they will be damn near optimal when you are doing a smaller fraction of updates
          2. The reason the implementation is called KEYS is that it is a traditional index, containing only a pointer to the base data (a key). They are only 1 step away from being fully materialized views, which eliminates a seek at the cost of disk usage
          3. KEYS indexes will gain very nice benefits from the compression introduced by CASSANDRA-674 (since they contain value-sorted data)
          Show
          Stu Hood added a comment - - edited >> I think this implementation should probably be the only secondary index format we support (What's the value of keeping KEYS over this?) > Agreed, pending the optimizations mentioned in previous comments. I need to retract this statement: I don't think we should remove the KEYS index, for a few reasons: The false positives generated by these bitmap indexes have worst case behaviour with rapidly changing data: if a row matches the index in any sstable , we have to read all sstables that contain the row in order to resolve the data and determine whether it actually matches. On the other hand, they will be damn near optimal when you are doing a smaller fraction of updates The reason the implementation is called KEYS is that it is a traditional index, containing only a pointer to the base data (a key). They are only 1 step away from being fully materialized views, which eliminates a seek at the cost of disk usage KEYS indexes will gain very nice benefits from the compression introduced by CASSANDRA-674 (since they contain value-sorted data)
          Hide
          T Jake Luciani added a comment -

          Regarding AVRO again, I think we Cassandra developers need to decide in general if we are sticking with Avro or not.

          This relates to this issue, CASSANDRA-1015 and CASSANDRA-674

          Show
          T Jake Luciani added a comment - Regarding AVRO again, I think we Cassandra developers need to decide in general if we are sticking with Avro or not. This relates to this issue, CASSANDRA-1015 and CASSANDRA-674
          Hide
          Gary Dusbabek added a comment -

          In the context of CASSANDRA-1015, I think using avro is not the right choice. For interoperability between nodes, we will need to have the ability to interpret the bytes as they are being serialized/deserialized. Switching over to avro/thrift/msgpack removes that choice.

          I haven't reviewed the patches on this ticket, but I think we should avoid handing off serialization with respect to our on-disk storage. Once they become binary structures, the construction of which we don't care about because a library is doing it for us, is the moment we lose control of a very fundamental thing--knowing exactly what we're reading and writing.

          Show
          Gary Dusbabek added a comment - In the context of CASSANDRA-1015 , I think using avro is not the right choice. For interoperability between nodes, we will need to have the ability to interpret the bytes as they are being serialized/deserialized. Switching over to avro/thrift/msgpack removes that choice. I haven't reviewed the patches on this ticket, but I think we should avoid handing off serialization with respect to our on-disk storage. Once they become binary structures, the construction of which we don't care about because a library is doing it for us, is the moment we lose control of a very fundamental thing--knowing exactly what we're reading and writing.
          Hide
          Stu Hood added a comment -

          > Switching over to avro/thrift/msgpack removes that choice. [...] knowing exactly what we're reading and writing.
          Except that Avro has a specification. We can re-implement a file format identical to the specification (been there, done that for 674), but since there is an almost-standard for immutable data, I'd prefer not to invent that wheel.

          Show
          Stu Hood added a comment - > Switching over to avro/thrift/msgpack removes that choice. [...] knowing exactly what we're reading and writing. Except that Avro has a specification. We can re-implement a file format identical to the specification (been there, done that for 674), but since there is an almost-standard for immutable data, I'd prefer not to invent that wheel.
          Hide
          Gary Dusbabek added a comment -

          We can re-implement a file format identical to the specification...

          That's not so different than rolling our own serialization. At that point, we've gained nothing but interop (how much of a goal is that? If it is, it needs to be stated) by using avro structures and serialization.

          Show
          Gary Dusbabek added a comment - We can re-implement a file format identical to the specification... That's not so different than rolling our own serialization. At that point, we've gained nothing but interop (how much of a goal is that? If it is, it needs to be stated) by using avro structures and serialization.
          Hide
          Stu Hood added a comment -

          Gary, Jake: would it be reasonable to commit this patch with Avro's file format, and then to discuss replacing the file format in a separate issue, to be completed before 0.7.1? Essentially, writing our own file format is an optimization that we can do later, and I'd prefer to treat it as such.

          This question also preemptively applies to CASSANDRA-674: we gain a lot of agility by developing on top of a framework, and then removing the framework.

          Show
          Stu Hood added a comment - Gary, Jake: would it be reasonable to commit this patch with Avro's file format, and then to discuss replacing the file format in a separate issue, to be completed before 0.7.1? Essentially, writing our own file format is an optimization that we can do later, and I'd prefer to treat it as such. This question also preemptively applies to CASSANDRA-674 : we gain a lot of agility by developing on top of a framework, and then removing the framework.
          Hide
          T Jake Luciani added a comment -

          I'd be more inclined todo this if it was trunk, but I don't want this to go into 0.7 branch with all these unfinished tasks. If we did, this would end up a blocker.

          I'm happy to help change the serializers and complete the other major task (online indexing) starting next week.

          What do you think?

          Show
          T Jake Luciani added a comment - I'd be more inclined todo this if it was trunk, but I don't want this to go into 0.7 branch with all these unfinished tasks. If we did, this would end up a blocker. I'm happy to help change the serializers and complete the other major task (online indexing) starting next week. What do you think?
          Hide
          Stu Hood added a comment -

          > What do you think?
          I think that implementing a compressible block based file format is a non-trivial task, and that before we commit to re-implementing Avro's (in a bounded timeframe especially), we should review our requirements. This decision needs to be made for technical reasons and not grounded in NIH.

          After reviewing the Avro spec again, and having written a very similar file format for #674, there is nothing I would change about the format:

          • Has a header to store file format version, compression information and any other arbitrary data (including Avro's schema)
          • Is blocked based, with framing around the blocks for fast skipping, and with synchronization points for recovering a corrupt file
          • Implements object reuse: iterating over a file requires a single record object, which is re-filled with data in the file

          In the interest of full disclosure, Avro is lacking one serialization feature I would like to see (AVRO-679), but there is a fair chance it will be implemented in a future version, and until then we can trivially implement it above Avro.

          http://avro.apache.org/docs/1.4.1/spec.html#Object+Container+Files

          Show
          Stu Hood added a comment - > What do you think? I think that implementing a compressible block based file format is a non-trivial task, and that before we commit to re-implementing Avro's (in a bounded timeframe especially), we should review our requirements. This decision needs to be made for technical reasons and not grounded in NIH. After reviewing the Avro spec again, and having written a very similar file format for #674, there is nothing I would change about the format: Has a header to store file format version, compression information and any other arbitrary data (including Avro's schema) Is blocked based, with framing around the blocks for fast skipping, and with synchronization points for recovering a corrupt file Implements object reuse: iterating over a file requires a single record object, which is re-filled with data in the file In the interest of full disclosure, Avro is lacking one serialization feature I would like to see ( AVRO-679 ), but there is a fair chance it will be implemented in a future version, and until then we can trivially implement it above Avro. http://avro.apache.org/docs/1.4.1/spec.html#Object+Container+Files
          T Jake Luciani made changes -
          Hide
          Gary Dusbabek added a comment - - edited

          would it be reasonable to commit this patch with Avro's file format, and then to discuss replacing the file format in a separate issue

          Not a good idea, imo.

          writing our own file format is an optimization

          I don't think it is an optimization at all. It's a way of protecting us from ourselves. The only way to do that and still use avro is to create our own readers according to the spec, no?

          This decision needs to be made for technical reasons and not grounded in NIH.

          The reasons are technical. The fact is that the way we use avro is constantly breaking on us (I'm referring to migrations). Some of this is probably self-inflicted. We don't have much experience yet with what happens when records change slightly or radically over time and how avro copes with that. The experience we do have leads me to believe that unless we are extremely cautious, we'll end up in a hole. See CASSANDRA-2001 for an example. Two seemly innocuous changes have both managed to break migration deserialization.

          Show
          Gary Dusbabek added a comment - - edited would it be reasonable to commit this patch with Avro's file format, and then to discuss replacing the file format in a separate issue Not a good idea, imo. writing our own file format is an optimization I don't think it is an optimization at all. It's a way of protecting us from ourselves. The only way to do that and still use avro is to create our own readers according to the spec, no? This decision needs to be made for technical reasons and not grounded in NIH. The reasons are technical. The fact is that the way we use avro is constantly breaking on us (I'm referring to migrations). Some of this is probably self-inflicted. We don't have much experience yet with what happens when records change slightly or radically over time and how avro copes with that. The experience we do have leads me to believe that unless we are extremely cautious, we'll end up in a hole. See CASSANDRA-2001 for an example. Two seemly innocuous changes have both managed to break migration deserialization.
          Hide
          Stu Hood added a comment -

          So. I don't want this one to get stale, but 674 is a higher priority for me this week. tjake: were you planning to implement a file format for use here?

          Show
          Stu Hood added a comment - So. I don't want this one to get stale, but 674 is a higher priority for me this week. tjake: were you planning to implement a file format for use here?
          Hide
          T Jake Luciani added a comment -

          Yeah, I'll take care of the first pass.

          Show
          T Jake Luciani added a comment - Yeah, I'll take care of the first pass.
          Jonathan Ellis made changes -
          Fix Version/s 0.7.1 [ 12315199 ]
          Jonathan Ellis made changes -
          Fix Version/s 0.7.2 [ 12316100 ]
          T Jake Luciani made changes -
          Link This issue blocks CASSANDRA-1337 [ CASSANDRA-1337 ]
          Brandon Williams made changes -
          Fix Version/s 0.7.3 [ 12316182 ]
          Fix Version/s 0.7.2 [ 12316100 ]
          Jonathan Ellis made changes -
          Fix Version/s 0.7.4 [ 12316241 ]
          Fix Version/s 0.7.3 [ 12316182 ]
          Hide
          Jeremy Hanna added a comment -

          Any update on this? Seems like this really opens up possibilities for secondary indexes.

          Show
          Jeremy Hanna added a comment - Any update on this? Seems like this really opens up possibilities for secondary indexes.
          Hide
          T Jake Luciani added a comment -

          I'm supposed to be working on a reusable file format as an alternative of avro. I havent done this yet but I do have a plan that I can get feedback on.

          Lucene has a nice easy DataInput DataOutput classes with Vint, Vlong, and Checksums. We could start with this add the Varray stuff Stu had showed me previously. We did something similar with the OpenBitSet (adopting the source into our codebase).

          http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/store/

          As for compression I was thinking we should look at adopting Tatu's nice LZF implementation that is also ASL.

          https://github.com/ning/compress

          We would use these classes as a foundation for a DataWriter and DataReader class...

          What am I missing?

          Show
          T Jake Luciani added a comment - I'm supposed to be working on a reusable file format as an alternative of avro. I havent done this yet but I do have a plan that I can get feedback on. Lucene has a nice easy DataInput DataOutput classes with Vint, Vlong, and Checksums. We could start with this add the Varray stuff Stu had showed me previously. We did something similar with the OpenBitSet (adopting the source into our codebase). http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/store/ As for compression I was thinking we should look at adopting Tatu's nice LZF implementation that is also ASL. https://github.com/ning/compress We would use these classes as a foundation for a DataWriter and DataReader class... What am I missing?
          Hide
          Stu Hood added a comment -

          Sounds reasonable, but there is more to think about for the format. And to reiterate: I really don't think we should be going down this path.

          > What am I missing?
          You'll need to put some thought into how blocks will be recorded in the file: each block would need the equivalent of opening a new (Lucene) ChecksumIndexInput, because you need to be able to checksum at a smaller granularity than "the entire stream".

          Also, the Lucene classes don't give any thought to compression, so you need to consider where in the stack you will add it. The compressed data should likely be checksummed, rather than the raw data, so I think you'll need to port the LZF code to fit in the stack between the caller and the checksumming.

          Show
          Stu Hood added a comment - Sounds reasonable, but there is more to think about for the format. And to reiterate: I really don't think we should be going down this path. > What am I missing? You'll need to put some thought into how blocks will be recorded in the file: each block would need the equivalent of opening a new (Lucene) ChecksumIndexInput, because you need to be able to checksum at a smaller granularity than "the entire stream". Also, the Lucene classes don't give any thought to compression, so you need to consider where in the stack you will add it. The compressed data should likely be checksummed, rather than the raw data, so I think you'll need to port the LZF code to fit in the stack between the caller and the checksumming.
          Hide
          Jonathan Ellis added a comment -

          Does it really need to be this complicated? It sounds like we're starting down a boil-the-oceans path. I'd rather do something simple that works and then optimize as necessary.

          Show
          Jonathan Ellis added a comment - Does it really need to be this complicated? It sounds like we're starting down a boil-the-oceans path. I'd rather do something simple that works and then optimize as necessary.
          Hide
          Stu Hood added a comment -

          Both the buckets and the OrBiC projection need compression. Checksumming is arguably not necessary for this ticket, but will be for 674, so a design that doesn't allow for it will give us a serious headache.

          Show
          Stu Hood added a comment - Both the buckets and the OrBiC projection need compression. Checksumming is arguably not necessary for this ticket, but will be for 674, so a design that doesn't allow for it will give us a serious headache.
          Hide
          T Jake Luciani added a comment -

          I'm definitely not wanting to boil the ocean here. We could take the easy approach but at the cost of supporting something we know will soon change.

          Perhaps this should all go into another set of tickets that we make blockers to this issue.

          Overall I am thinking about our current file format issues and what we know we need: checksums + compression. The lucene and compression code I mentioned is quite straight forward so I think it would be pretty easy to get something working here, and it would serve as a basis for our new sstable format whatever that ends up being or simply adding these features to our current format.

          The main blocker of this ticket is the avro dependency so just trying to get enough functionality to support the most basics of a disk format here.

          Show
          T Jake Luciani added a comment - I'm definitely not wanting to boil the ocean here. We could take the easy approach but at the cost of supporting something we know will soon change. Perhaps this should all go into another set of tickets that we make blockers to this issue. Overall I am thinking about our current file format issues and what we know we need: checksums + compression. The lucene and compression code I mentioned is quite straight forward so I think it would be pretty easy to get something working here, and it would serve as a basis for our new sstable format whatever that ends up being or simply adding these features to our current format. The main blocker of this ticket is the avro dependency so just trying to get enough functionality to support the most basics of a disk format here.
          Hide
          Stu Hood added a comment - - edited

          tjake: Yea: opening a separate ticket to discuss a generic file format makes sense.


          I had a realization about this implementation of secondary indexes: I was originally thinking we'd be able to push all boolean queries down to the indexes on a per-sstable basis, but this is unfortunately not the case. We will not be able to push 'AND' on separate indexes down to the sstables themselves: we'd need to join the index from all sstables, since a row might contain one clause in one sstable, and another clause in another sstable.

          EDIT: This is roughly equivalent to what we'd need to do with a KEYS index (seek-wise), meaning that the advantage is mostly in space utilization and lack of locks.
          EDIT2: So, there is a way to execute AND queries directly per SSTable, but it involves some uncertainty. For a particular row, if a value involved in a multi-clause-query is NULL in a particular SSTable, then you have to accept the row as a possible match, and resolve the uncertainty later. I'm sure there is a way to incoporate the CASSANDRA-2498 timestamp resolution as well, although it doesn't occur to me at the moment.

          Show
          Stu Hood added a comment - - edited tjake: Yea: opening a separate ticket to discuss a generic file format makes sense. I had a realization about this implementation of secondary indexes: I was originally thinking we'd be able to push all boolean queries down to the indexes on a per-sstable basis, but this is unfortunately not the case. We will not be able to push 'AND' on separate indexes down to the sstables themselves: we'd need to join the index from all sstables, since a row might contain one clause in one sstable, and another clause in another sstable. EDIT: This is roughly equivalent to what we'd need to do with a KEYS index (seek-wise), meaning that the advantage is mostly in space utilization and lack of locks. EDIT2: So, there is a way to execute AND queries directly per SSTable, but it involves some uncertainty. For a particular row, if a value involved in a multi-clause-query is NULL in a particular SSTable, then you have to accept the row as a possible match, and resolve the uncertainty later. I'm sure there is a way to incoporate the CASSANDRA-2498 timestamp resolution as well, although it doesn't occur to me at the moment.
          Hide
          T Jake Luciani added a comment -

          This might be off topic a bit but was wondering if you had thought about using Lucene indexes Stu?

          1. You would get the lucene search api
          2. You could keep a single index per CF and let lucene manage the segment file management.
          3. index.optimize() could be run along size compaction
          4. while a SSTable was in the memtable form we could rely on a RamDirectory then merge with the on disk index after SSTable flush.

          The tricky part would be turing the validators into Analyzers I think...

          Show
          T Jake Luciani added a comment - This might be off topic a bit but was wondering if you had thought about using Lucene indexes Stu? 1. You would get the lucene search api 2. You could keep a single index per CF and let lucene manage the segment file management. 3. index.optimize() could be run along size compaction 4. while a SSTable was in the memtable form we could rely on a RamDirectory then merge with the on disk index after SSTable flush. The tricky part would be turing the validators into Analyzers I think...
          Jonathan Ellis made changes -
          Fix Version/s 0.8 [ 12314820 ]
          Fix Version/s 0.7.4 [ 12316241 ]
          Jonathan Ellis made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Jonathan Ellis made changes -
          Fix Version/s 1.0 [ 12316349 ]
          Fix Version/s 0.8 [ 12314820 ]
          Hide
          Feng added a comment -

          Stu,

          Since the bitmap index isn't a good fit for rapidly changing data, is there any plan for implementing a b-tree index? Currently the lack of range scan for secondary indexes is only thing preventing me from using this otherwise fantastic product. Thanks.

          Show
          Feng added a comment - Stu, Since the bitmap index isn't a good fit for rapidly changing data, is there any plan for implementing a b-tree index? Currently the lack of range scan for secondary indexes is only thing preventing me from using this otherwise fantastic product. Thanks.
          Hide
          donal zang added a comment -

          I just have some test on the hash way secondary indexes, and found it really slow both for insertion and query, and there's no count APIs.
          So I really hope there will be some more efficient way to do the secondary index.
          Though people are using CF way to do secondary index by there own (http://www.anuff.com/2010/07/secondary-indexes-in-cassandra.html), but this loose the map-reduce advantage, and users may have to do lots of things by themselves.

          Show
          donal zang added a comment - I just have some test on the hash way secondary indexes, and found it really slow both for insertion and query, and there's no count APIs. So I really hope there will be some more efficient way to do the secondary index. Though people are using CF way to do secondary index by there own ( http://www.anuff.com/2010/07/secondary-indexes-in-cassandra.html ), but this loose the map-reduce advantage, and users may have to do lots of things by themselves.
          Hide
          Jonathan Ellis added a comment -

          Donal is referring to his mailing list thread where I point out that read-before-update is required for ANY approach to indexes. There is no way around that, and there is no way around it killing your performance when your data set exceeds what you can cache. (If anything the current, CF-based indexes are a best-case here because at least the bloom filter will avoid i/o on inserts, as opposed to updates.)

          Switching to SSDs is the only magic wand you'll find.

          Show
          Jonathan Ellis added a comment - Donal is referring to his mailing list thread where I point out that read-before-update is required for ANY approach to indexes. There is no way around that, and there is no way around it killing your performance when your data set exceeds what you can cache. (If anything the current, CF-based indexes are a best-case here because at least the bloom filter will avoid i/o on inserts, as opposed to updates.) Switching to SSDs is the only magic wand you'll find.
          Hide
          Stu Hood added a comment -

          There is no way around that, and there is no way around it killing your performance when your data set exceeds what you can cache.

          This index implementation does not require read before write.

          Show
          Stu Hood added a comment - There is no way around that, and there is no way around it killing your performance when your data set exceeds what you can cache. This index implementation does not require read before write.
          Hide
          Jonathan Ellis added a comment -

          Then how does it differentiate between "new row X has indexed value Y" and "old row X has indexed value Z and no longer has indexed value Y?"

          Show
          Jonathan Ellis added a comment - Then how does it differentiate between "new row X has indexed value Y" and "old row X has indexed value Z and no longer has indexed value Y?"
          Hide
          Stu Hood added a comment - - edited

          By returning the row as a probable match, and resolving using the base data (the brute force filtering): this is why it reads are less performant for high update ratios.

          With the timestamp work from CASSANDRA-2498, we'd be able to push slightly more down to this index (especially if we added timestamps to it). If we matched a particular sstable with the index, and the match has a higher timestamp than any other file, then we can return immediately with 2498.

          Show
          Stu Hood added a comment - - edited By returning the row as a probable match, and resolving using the base data (the brute force filtering): this is why it reads are less performant for high update ratios. With the timestamp work from CASSANDRA-2498 , we'd be able to push slightly more down to this index (especially if we added timestamps to it). If we matched a particular sstable with the index, and the match has a higher timestamp than any other file, then we can return immediately with 2498.
          Jonathan Ellis made changes -
          Assignee Stu Hood [ stuhood ]
          Stu Hood made changes -
          Link This issue is related to CASSANDRA-2897 [ CASSANDRA-2897 ]
          Jonathan Ellis made changes -
          Fix Version/s 1.0 [ 12316349 ]
          Hide
          Stu Hood added a comment -

          Had a discussion about this ticket tonight: now that CASSANDRA-2498 is in (per-sstable timestamp tracking), some optimizations I mentioned above are now possible. In particular, if a row matches a clause in the most recent SSTable, and the column for the clause could not have a higher version in an older file (by 2498), then the row is a precise (rather than probable) match.

          Show
          Stu Hood added a comment - Had a discussion about this ticket tonight: now that CASSANDRA-2498 is in (per-sstable timestamp tracking), some optimizations I mentioned above are now possible. In particular, if a row matches a clause in the most recent SSTable, and the column for the clause could not have a higher version in an older file (by 2498), then the row is a precise (rather than probable) match.
          Gavin made changes -
          Workflow no-reopen-closed, patch-avail [ 12519740 ] patch-available, re-open possible [ 12749005 ]
          Gavin made changes -
          Workflow patch-available, re-open possible [ 12749005 ] reopen-resolved, no closed status, patch-avail, testing [ 12753993 ]
          Hide
          Alexander Shutyaev added a comment -

          What is the current status of the issue? Is it planned to be implemented? If yes then are there any timeframes?

          Show
          Alexander Shutyaev added a comment - What is the current status of the issue? Is it planned to be implemented? If yes then are there any timeframes?
          Hide
          Jonathan Ellis added a comment -

          Nobody is actively working on this.

          Show
          Jonathan Ellis added a comment - Nobody is actively working on this.
          Jonathan Ellis made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Reviewer tjake
          Resolution Later [ 7 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Stu Hood
            • Votes:
              15 Vote for this issue
              Watchers:
              35 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development