Cassandra
  1. Cassandra
  2. CASSANDRA-4885

Remove or rework per-row bloom filters

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 2.0 beta 1
    • Component/s: Core
    • Labels:
      None

      Description

      Per-row bloom filters may be a misfeature.

      On small rows we don't create them.

      On large rows we essentially only do slice queries that can't take advantage of it.

      And on very large rows if we ever did deserialize it, the performance hit of doing so would outweigh the benefit of skipping the actual read.

      1. 0001-CASSANRDA-4885-Remove-per-row-bloom-filter.patch
        25 kB
        Jason Brown
      2. 0002-CASSANRDA-4885-update-test.patch
        2 kB
        Jason Brown
      3. 4885-v1.patch
        74 kB
        Jason Brown
      4. 4885-v2.patch
        21 kB
        Jason Brown

        Issue Links

          Activity

          Hide
          Jason Brown added a comment -

          OK, so I've removed the bloom filter from the row header and the key index. For the row header, on reads we just skip over the BF section that's on disk (if the sstable is any version lesser than the new 'ja'), and never write it out when serializing. For the index, more or less the same solution. I think I've caught the places in the logic to make sure it does the right thing (don't misread existing files, never write out the BF in new files).

          However, I did run into a problem with one of the unit tests. ScrubTest.testScrubFile() expects a file that is corrupted so it can then attempt to fix it. The existing corrupt file (test/data/corrupt-sstables/Keyspace1-Super5-f-2-Data.db) has a corrupted BF that throws an exception the unit test is expecting. Now that we're skipping over the BF in the row header, however, the exception never gets thrown and the test fails because the file is 'no longer corrupt' . I hacked up the code to trip a failure (by changing the columns count), but can somebody recommend a good way to create a corrupt file?

          Show
          Jason Brown added a comment - OK, so I've removed the bloom filter from the row header and the key index. For the row header, on reads we just skip over the BF section that's on disk (if the sstable is any version lesser than the new 'ja'), and never write it out when serializing. For the index, more or less the same solution. I think I've caught the places in the logic to make sure it does the right thing (don't misread existing files, never write out the BF in new files). However, I did run into a problem with one of the unit tests. ScrubTest.testScrubFile() expects a file that is corrupted so it can then attempt to fix it. The existing corrupt file (test/data/corrupt-sstables/Keyspace1-Super5-f-2-Data.db) has a corrupted BF that throws an exception the unit test is expecting. Now that we're skipping over the BF in the row header, however, the exception never gets thrown and the test fails because the file is 'no longer corrupt' . I hacked up the code to trip a failure (by changing the columns count), but can somebody recommend a good way to create a corrupt file?
          Hide
          Jonathan Ellis added a comment -

          Maybe reopen it at the halfway point after writing a healthy one, and scribble over 10% of it with zeros?

          Show
          Jonathan Ellis added a comment - Maybe reopen it at the halfway point after writing a healthy one, and scribble over 10% of it with zeros?
          Hide
          Jason Brown added a comment -

          Yeah, I figured the 'whack it in the knee caps with a baseball bat' technique would work , wasn't sure if there would be a more controlled way. Digging out the sports equipment now...

          Show
          Jason Brown added a comment - Yeah, I figured the 'whack it in the knee caps with a baseball bat' technique would work , wasn't sure if there would be a more controlled way. Digging out the sports equipment now...
          Hide
          Jason Brown added a comment -

          Ok, took some work, but (re-)corrupted the target data file to trigger the scrub. I tried deleting large blocks, scribbling over vast swathes, and even minor surgical changes (changing length of col names), but, in the end, the best was to just overwrite the first 64 bytes with 0x00.

          Show
          Jason Brown added a comment - Ok, took some work, but (re-)corrupted the target data file to trigger the scrub. I tried deleting large blocks, scribbling over vast swathes, and even minor surgical changes (changing length of col names), but, in the end, the best was to just overwrite the first 64 bytes with 0x00.
          Hide
          Sylvain Lebresne added a comment -

          On large rows we essentially only do slice queries

          Now that I think about that, with CQL3, we do potentially do name queries on wide rows fairly often (basically anytime you request just one CQL3 row in a table that have a composite PK, and unless there is a collection). So maybe it could be worth being a bit prudent with completely removing this too quickly.

          For instance, maybe we could have a columnBF_fp_chance like for table-level BF and recognize a special value of that (say, 1) as don't generate any BF at all? That could even be the default, but at least someone that say often require to test the existence of a CQL3 row in a table with composite PK could activate the column BF and probably have some benefit from it.

          Show
          Sylvain Lebresne added a comment - On large rows we essentially only do slice queries Now that I think about that, with CQL3, we do potentially do name queries on wide rows fairly often (basically anytime you request just one CQL3 row in a table that have a composite PK, and unless there is a collection). So maybe it could be worth being a bit prudent with completely removing this too quickly. For instance, maybe we could have a columnBF_fp_chance like for table-level BF and recognize a special value of that (say, 1) as don't generate any BF at all? That could even be the default, but at least someone that say often require to test the existence of a CQL3 row in a table with composite PK could activate the column BF and probably have some benefit from it.
          Hide
          Jonathan Ellis added a comment - - edited

          We could use our knowledge of the cql3 schema to only generate BF for wide rows...

          But I still think it should default to off, cql3 wide rows are almost always used to support denormalized, clustered resultsets, which are still going to correspond to slices.

          Show
          Jonathan Ellis added a comment - - edited We could use our knowledge of the cql3 schema to only generate BF for wide rows... But I still think it should default to off, cql3 wide rows are almost always used to support denormalized, clustered resultsets, which are still going to correspond to slices.
          Hide
          Sylvain Lebresne added a comment -

          But I still think it should default to off

          I agree. And that's also why I'm not sure we should try to be too smart here because the usefulness of the BF depends more on how the table is used than of how the schema look like. As said above, I'd be in favor of a columnBF_FP_chance with a default to off since 1) it similar to our current row BF tunable so it will be familiar to users and 2) it's fine-grained which cannot hurt. But open to discussion, just seem the best option so far to me.

          Show
          Sylvain Lebresne added a comment - But I still think it should default to off I agree. And that's also why I'm not sure we should try to be too smart here because the usefulness of the BF depends more on how the table is used than of how the schema look like. As said above, I'd be in favor of a columnBF_FP_chance with a default to off since 1) it similar to our current row BF tunable so it will be familiar to users and 2) it's fine-grained which cannot hurt. But open to discussion, just seem the best option so far to me.
          Hide
          Jason Brown added a comment - - edited

          This seems reasonable to me. Let me do some more reading/research, and I think this should be doable. Let me rework my patch, then.

          Show
          Jason Brown added a comment - - edited This seems reasonable to me. Let me do some more reading/research, and I think this should be doable. Let me rework my patch, then.
          Hide
          Jonathan Ellis added a comment -

          Still looking at this, Jason?

          Show
          Jonathan Ellis added a comment - Still looking at this, Jason?
          Hide
          Jason Brown added a comment -

          Yes, I was hoping to get back to this one next week. Sorry for the delay.

          Show
          Jason Brown added a comment - Yes, I was hoping to get back to this one next week. Sorry for the delay.
          Hide
          Jonathan Ellis added a comment -

          Ping again. I can ask Aleksey to take a look at this if you like. (In a bit of a hurry since it's blocking CASSANDRA-4180.)

          Show
          Jonathan Ellis added a comment - Ping again. I can ask Aleksey to take a look at this if you like. (In a bit of a hurry since it's blocking CASSANDRA-4180 .)
          Hide
          Jason Brown added a comment -

          I'm about 98% done - just need to fix one last unit test. I will post a patch today.

          Show
          Jason Brown added a comment - I'm about 98% done - just need to fix one last unit test. I will post a patch today.
          Hide
          Jason Brown added a comment -

          Patch 4885-v1: Added a new field in Descriptor to declare if the sstable version contains an extra byte per row (which declares if the row will have a column-level bloom filter). Main functionality for writing or not writing out bf lies in the RIE.serialize()/deserialize(), and ColumnIndex.Builder. Most other changes are tooling support (cql3, cqlsh, and cli), and the attendent changes in avro and thrift.

          Some of the the bulk in the patch (oh, maybe 60% or so) is just the diff in the auto-generated thrift classes. Wasn't sure if I should have excluded them or not.

          Show
          Jason Brown added a comment - Patch 4885-v1: Added a new field in Descriptor to declare if the sstable version contains an extra byte per row (which declares if the row will have a column-level bloom filter). Main functionality for writing or not writing out bf lies in the RIE.serialize()/deserialize(), and ColumnIndex.Builder. Most other changes are tooling support (cql3, cqlsh, and cli), and the attendent changes in avro and thrift. Some of the the bulk in the patch (oh, maybe 60% or so) is just the diff in the auto-generated thrift classes. Wasn't sure if I should have excluded them or not.
          Hide
          Jonathan Ellis added a comment -

          [marking Patch Available so it shows up in my review queue]

          Show
          Jonathan Ellis added a comment - [marking Patch Available so it shows up in my review queue]
          Hide
          Jason Brown added a comment -

          Strange, I thought it was Patch Available. Perhaps that's why I haven't gotten any feedback for a while . I'll be more careful in the future about the status.

          Show
          Jason Brown added a comment - Strange, I thought it was Patch Available. Perhaps that's why I haven't gotten any feedback for a while . I'll be more careful in the future about the status.
          Hide
          Jonathan Ellis added a comment -

          LGTM in principle. Details:

          • Should bump the Thrift interface minor version
          • Don't need to update the avro definition, it's only used for upgrading 1.0 schemas
          • Probably cleaner to use an AlwaysPresentFilter rather than special-casing nulls
          • Similarly, should deserialize to APF. This will allow cleaning up some special-casing of BF in SSTNamesIterator.

          Nits:

          • space before open paren of if, please.
          • the BF byte in RIE.serialize would be clearer as a single writeByte(ternary expression) to make more clear that we always write that byte
          Show
          Jonathan Ellis added a comment - LGTM in principle. Details: Should bump the Thrift interface minor version Don't need to update the avro definition, it's only used for upgrading 1.0 schemas Probably cleaner to use an AlwaysPresentFilter rather than special-casing nulls Similarly, should deserialize to APF. This will allow cleaning up some special-casing of BF in SSTNamesIterator. Nits: space before open paren of if , please. the BF byte in RIE.serialize would be clearer as a single writeByte(ternary expression) to make more clear that we always write that byte
          Hide
          Jonathan Ellis added a comment -

          Sylvain Lebresne do you still think we should provide this as an option? After letting it simmer for a couple months I'd still like to just rip it out:

          • a rare simplification of our increasingly baroque storage engine code
          • even if you're in the rare case of reading single (CQL) rows from a large partition, it's far from clear that a BF is going to be a win for you, since you have to deserialize the BF in its entirety to do any queries on it. So probably only useful for "wide partitions, but not too wide." Pretty narrow benefit and one I'll bet few users will get right.
          • BF computation is part of the GC pain that LCR causes (CASSANDRA-5344)
          Show
          Jonathan Ellis added a comment - Sylvain Lebresne do you still think we should provide this as an option? After letting it simmer for a couple months I'd still like to just rip it out: a rare simplification of our increasingly baroque storage engine code even if you're in the rare case of reading single (CQL) rows from a large partition, it's far from clear that a BF is going to be a win for you, since you have to deserialize the BF in its entirety to do any queries on it. So probably only useful for "wide partitions, but not too wide." Pretty narrow benefit and one I'll bet few users will get right. BF computation is part of the GC pain that LCR causes ( CASSANDRA-5344 )
          Hide
          Jason Brown added a comment -

          After writing the code that allows the option, and now thinking on it for a while, I agree with Jonathan about removing the row-level BFs and the ability to tweak it. I'm not fully convinced that having another knob to play with will, in the end, help users to elicit more performance. I'm afraid it might cause confusion (yet another know) or they'll spin their wheels trying to use it.

          Show
          Jason Brown added a comment - After writing the code that allows the option, and now thinking on it for a while, I agree with Jonathan about removing the row-level BFs and the ability to tweak it. I'm not fully convinced that having another knob to play with will, in the end, help users to elicit more performance. I'm afraid it might cause confusion (yet another know) or they'll spin their wheels trying to use it.
          Hide
          Sylvain Lebresne added a comment -

          Let's say that I don't feel extremely strongly either way. On the one side, I agree that they are almost surely of limited use in practice, but at the same time, keeping it a disabled-by-default option for one or two versions wouldn't cost us much and seems safer to me. If no-one complains of a performance drop, then cool, we drop them in a few releases. But if some people do experience a performance degradation in some of their workload, then at least we have the option to check if it is indeed due to columns BF. And if it is, we might learn they are more useful than we though.

          a rare simplification of our increasingly baroque storage engine code

          I hear you. But at the same time columns BF don't add much complexity. Besides, all I'm suggesting is a more incremental/prudent way to remove them.

          in the rare case of reading single (CQL) rows from a large partition

          I note that it can also help for static rows as this might save you from checking the data file at all in some cases. Not to pretend that this would be very common either.

          Show
          Sylvain Lebresne added a comment - Let's say that I don't feel extremely strongly either way. On the one side, I agree that they are almost surely of limited use in practice, but at the same time, keeping it a disabled-by-default option for one or two versions wouldn't cost us much and seems safer to me. If no-one complains of a performance drop, then cool, we drop them in a few releases. But if some people do experience a performance degradation in some of their workload, then at least we have the option to check if it is indeed due to columns BF. And if it is, we might learn they are more useful than we though. a rare simplification of our increasingly baroque storage engine code I hear you. But at the same time columns BF don't add much complexity. Besides, all I'm suggesting is a more incremental/prudent way to remove them. in the rare case of reading single (CQL) rows from a large partition I note that it can also help for static rows as this might save you from checking the data file at all in some cases. Not to pretend that this would be very common either.
          Hide
          Jason Brown added a comment -

          While I lean more towards removing the row-level BFs for the code simplification and knob reduction sakes, I think Sylvain Lebresne raises a good point here. To that side of the argument, Jonathan Ellis, if we default to 'off' for the BF and use the AlwaysPresentFilter, won't that alleviate the BF computation that that you are concerned about wrt LCR? AFP looks rather performant .

          Show
          Jason Brown added a comment - While I lean more towards removing the row-level BFs for the code simplification and knob reduction sakes, I think Sylvain Lebresne raises a good point here. To that side of the argument, Jonathan Ellis , if we default to 'off' for the BF and use the AlwaysPresentFilter, won't that alleviate the BF computation that that you are concerned about wrt LCR? AFP looks rather performant .
          Hide
          Jonathan Ellis added a comment -

          Here is how I put it in perspective:

          If we were writing Cassandra from scratch today, would we include an option to have row-level BF? No. It might help with some niche workloads but you can come up with workloads that row cache benefits too. On balance neither was a good idea, and most of the time if you let people enable them they will make things worse. Let's put them out of their misery.

          Show
          Jonathan Ellis added a comment - Here is how I put it in perspective: If we were writing Cassandra from scratch today, would we include an option to have row-level BF? No. It might help with some niche workloads but you can come up with workloads that row cache benefits too. On balance neither was a good idea, and most of the time if you let people enable them they will make things worse. Let's put them out of their misery.
          Hide
          Jason Brown added a comment -

          OK, will revise my patch to remove the row-level BFs, rather than make them optional. Patch should come out less than 75k this time .

          Show
          Jason Brown added a comment - OK, will revise my patch to remove the row-level BFs, rather than make them optional. Patch should come out less than 75k this time .
          Hide
          Sylvain Lebresne added a comment -

          If we were writing Cassandra from scratch today

          But we're not. I do think that providing an option and removing it is different from never implementing it in the first place because in the former case some people may have start relying on it. I also don't share your confidence that row-level BF are only ever useful in some niche workloads that we don't care about (I do suspect that's largely true but I'm not totally confident). Anyway, I can agree to disagree so do go on. After all, I don't disagree they should be disabled by default, I just disagree that it's not worth keeping them as an option just in case we had under-evaluated how useful they can be.

          you can come up with workloads that row cache benefits too.

          Well, I happen to disagree about the removal of the row cache (until we have something better that is), so I guess I'm coherent here.

          Show
          Sylvain Lebresne added a comment - If we were writing Cassandra from scratch today But we're not. I do think that providing an option and removing it is different from never implementing it in the first place because in the former case some people may have start relying on it. I also don't share your confidence that row-level BF are only ever useful in some niche workloads that we don't care about (I do suspect that's largely true but I'm not totally confident). Anyway, I can agree to disagree so do go on. After all, I don't disagree they should be disabled by default, I just disagree that it's not worth keeping them as an option just in case we had under-evaluated how useful they can be. you can come up with workloads that row cache benefits too. Well, I happen to disagree about the removal of the row cache (until we have something better that is), so I guess I'm coherent here.
          Hide
          Jason Brown added a comment -

          Attached new patch that removed the row-level BFs. Also, changed places where we defreeze the BF to just skip the bloom filter as defreeze will build up the BF in memory (only just to throw it away because it was never referenced - and now not needed).

          Show
          Jason Brown added a comment - Attached new patch that removed the row-level BFs. Also, changed places where we defreeze the BF to just skip the bloom filter as defreeze will build up the BF in memory (only just to throw it away because it was never referenced - and now not needed).
          Hide
          Jonathan Ellis added a comment -

          Ship it!

          Show
          Jonathan Ellis added a comment - Ship it!
          Hide
          Jason Brown added a comment -

          Committed! Thanks

          Show
          Jason Brown added a comment - Committed! Thanks
          Hide
          Carl Yeksigian added a comment -

          This won't upgrade properly.

          The current IndexHelper.skipBloomFilter() only handles Type.SHA SSTable bloom filters. This is because the number of bytes that are used depends on the scheme. If the schema version is not SHA, we need to read the byte length from the output, then skip that many bytes. This means that when the update occurs, if a row filter was written, we will not skip over it properly. Upgrading from 1.2 to 2.0 will cause CorruptSSTableException since we aren't advancing far enough.

          I'm reopening and posting a patch which works for this case; it will not currently work with the scrub test.

          Show
          Carl Yeksigian added a comment - This won't upgrade properly. The current IndexHelper.skipBloomFilter() only handles Type.SHA SSTable bloom filters. This is because the number of bytes that are used depends on the scheme. If the schema version is not SHA, we need to read the byte length from the output, then skip that many bytes. This means that when the update occurs, if a row filter was written, we will not skip over it properly. Upgrading from 1.2 to 2.0 will cause CorruptSSTableException since we aren't advancing far enough. I'm reopening and posting a patch which works for this case; it will not currently work with the scrub test.
          Hide
          Jonathan Ellis added a comment -

          Is this still a problem or was it fixed by the CASSANDRA-5385 stuff?

          Show
          Jonathan Ellis added a comment - Is this still a problem or was it fixed by the CASSANDRA-5385 stuff?
          Hide
          Jason Brown added a comment -

          CASSANDRA-5385 resolved this guy, as well.

          Show
          Jason Brown added a comment - CASSANDRA-5385 resolved this guy, as well.

            People

            • Assignee:
              Jason Brown
              Reporter:
              Jonathan Ellis
              Reviewer:
              Jonathan Ellis
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development