Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: regionserver, wal
    • Labels:
      None

      Description

      While HBASE-4608 allows for Key-intelligent log compression & custom extensions, Hadoop already allows for native compression support with SequenceFile. We should allow the user to enable this functionality through the config. Besides backwards compatibility & existing stabilized code, allowing users to enable record-level WAL support will allow support value compression.

        Issue Links

          Activity

          Hide
          apurtell Andrew Purtell added a comment -

          Stale issue. Reopen if still relevant.

          Show
          apurtell Andrew Purtell added a comment - Stale issue. Reopen if still relevant.
          Hide
          sershe Sergey Shelukhin added a comment -

          note that HBASE-7413 removes the usage of sequence files in favor of PB+cells in WAL.
          Can we put this compression in the new compression pipeline that was added to IPC (see buildCellBlock) and reuse it in WAL instead?

          Show
          sershe Sergey Shelukhin added a comment - note that HBASE-7413 removes the usage of sequence files in favor of PB+cells in WAL. Can we put this compression in the new compression pipeline that was added to IPC (see buildCellBlock) and reuse it in WAL instead?
          Hide
          nspiegelberg Nicolas Spiegelberg added a comment -

          [stack] I think we want to limit this to record compression, since that is the exact same granularity as a transaction. I'm not sure about how to perform reliable transaction recovery with block compression. I think it would require larger records (probably 4k+) for big impact, but note that a lot of the compression algorithms we use default to uncompressed if it notices that compression doesn't hurt.

          For some followup information: I pushed this change to our dark launch, high-throughput cluster. It performed the expected 3x compression and added <1ms to latency (no throughput difference). The HLogs now take 3x longer to roll, so you may need to tune that some other configs if log roll speed matters. See the 89-fb patch at: http://svn.apache.org/viewvc?view=revision&revision=1465082

          Show
          nspiegelberg Nicolas Spiegelberg added a comment - [stack] I think we want to limit this to record compression, since that is the exact same granularity as a transaction. I'm not sure about how to perform reliable transaction recovery with block compression. I think it would require larger records (probably 4k+) for big impact, but note that a lot of the compression algorithms we use default to uncompressed if it notices that compression doesn't hurt. For some followup information: I pushed this change to our dark launch, high-throughput cluster. It performed the expected 3x compression and added <1ms to latency (no throughput difference). The HLogs now take 3x longer to roll, so you may need to tune that some other configs if log roll speed matters. See the 89-fb patch at: http://svn.apache.org/viewvc?view=revision&revision=1465082
          Hide
          stack stack added a comment -

          So, we should make it so you could do either. Wouldn't we have to do record-based compression if we want replication to work (my guess is that it would get messed up if block-based was enabled). Would have to have big records for this to have an effect – > 10k or so? Seems like a small change for big win to me.

          Show
          stack stack added a comment - So, we should make it so you could do either. Wouldn't we have to do record-based compression if we want replication to work (my guess is that it would get messed up if block-based was enabled). Would have to have big records for this to have an effect – > 10k or so? Seems like a small change for big win to me.
          Hide
          nspiegelberg Nicolas Spiegelberg added a comment -

          Andrew Purtell Let's do the latter for feature isolation.

          Show
          nspiegelberg Nicolas Spiegelberg added a comment - Andrew Purtell Let's do the latter for feature isolation.
          Hide
          apurtell Andrew Purtell added a comment -

          Did you want to merge this change into HBASE-7544? The core change is trivial (~20 lines)

          Nicolas Spiegelberg Or we could make the change here and I can fix up there.

          Show
          apurtell Andrew Purtell added a comment - Did you want to merge this change into HBASE-7544 ? The core change is trivial (~20 lines) Nicolas Spiegelberg Or we could make the change here and I can fix up there.
          Hide
          nspiegelberg Nicolas Spiegelberg added a comment -

          Andrew Purtell: It looks like the SequenceFileLogWriter without encryption "crypto/main/without" does not have this compression. Did you want to merge this change into HBASE-7544? The core change is trivial (~20 lines)

          Show
          nspiegelberg Nicolas Spiegelberg added a comment - Andrew Purtell : It looks like the SequenceFileLogWriter without encryption "crypto/main/without" does not have this compression. Did you want to merge this change into HBASE-7544 ? The core change is trivial (~20 lines)
          Hide
          apurtell Andrew Purtell added a comment -

          Stack SF can compress either by record or block. I enable SequenceFile.RECORD compression as part of HBASE-7544 and drop in an encryption codec masquerading as a compression codec, and it works as expected.

          Show
          apurtell Andrew Purtell added a comment - Stack SF can compress either by record or block. I enable SequenceFile.RECORD compression as part of HBASE-7544 and drop in an encryption codec masquerading as a compression codec, and it works as expected.
          Hide
          stack stack added a comment -

          Nicolas Spiegelberg SF compresses by the block. A block may be made of many entries. Enabling compression means we could lose more data – up to a whole block rather than just the last entry – if we fail just as we are about to round out the block (or we've written a complete block and the compressor is still working on flushing it out). Is your thinking this is ok as long as the configuration carries sufficient warnings? Thanks.

          Show
          stack stack added a comment - Nicolas Spiegelberg SF compresses by the block. A block may be made of many entries. Enabling compression means we could lose more data – up to a whole block rather than just the last entry – if we fail just as we are about to round out the block (or we've written a complete block and the compressor is still working on flushing it out). Is your thinking this is ok as long as the configuration carries sufficient warnings? Thanks.
          Hide
          nspiegelberg Nicolas Spiegelberg added a comment -

          In high bandwidth systems (our transactional records average 10KB), this give substantial size savings. Preliminary results show 3x compression on LZO with minimal throughput savings. This got more attention recently because it subsequently facilitates longer log retention for backup & replication purposes.

          Show
          nspiegelberg Nicolas Spiegelberg added a comment - In high bandwidth systems (our transactional records average 10KB), this give substantial size savings. Preliminary results show 3x compression on LZO with minimal throughput savings. This got more attention recently because it subsequently facilitates longer log retention for backup & replication purposes.

            People

            • Assignee:
              Unassigned
              Reporter:
              nspiegelberg Nicolas Spiegelberg
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development