Hadoop Common
  1. Hadoop Common
  2. HADOOP-54

SequenceFile should compress blocks, not individual entries

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2.0
    • Fix Version/s: 0.6.0
    • Component/s: io
    • Labels:
      None

      Description

      SequenceFile will optionally compress individual values. But both compression and performance would be much better if sequences of keys and values are compressed together. Sync marks should only be placed between blocks. This will require some changes to MapFile too, so that all file positions stored there are the positions of blocks, not entries within blocks. Probably this can be accomplished by adding a getBlockStartPosition() method to SequenceFile.Writer.

      1. enum-54.patch
        85 kB
        Owen O'Malley
      2. SequenceFile.20060821.patch
        85 kB
        Arun C Murthy
      3. SequenceFile.20060821.perfomance.txt
        4 kB
        Arun C Murthy
      4. SequenceFile.20060822.tgz
        17 kB
        Arun C Murthy
      5. SequenceFile.updated.final.patch
        84 kB
        Arun C Murthy
      6. SequenceFile20060824.tgz
        17 kB
        Arun C Murthy
      7. SequenceFiles.final.patch
        80 kB
        Arun C Murthy
      8. SequenceFiles.patch
        67 kB
        Arun C Murthy
      9. SequenceFilesII.patch
        77 kB
        Arun C Murthy
      10. VIntCompressionResults.txt
        6 kB
        Arun C Murthy

        Issue Links

          Activity

          Hide
          eric baldeschwieler added a comment -

          I've been thinking about this. I think we should use a very simple scheme like the current sequence file. The writer would take a configurable buffer, maybe 10 meg by default, fill that with key/values and then compress them. The result would be stored with start markers like the current sequence file, so the partitioning logic could remain unchanged.

          Within a block, I think we should compress the keys together in a block and the values in a following block. Both blocks' lengths should be stored, so that one can quickly scan the keys and then scan only as far as needed in the value blocks. This would allow very efficient sub-sampling when you have large data blocks. Which could be a life saver in some of our typical apps. You'll also get better compression, since you'll be compressing like items together.

          Thoughts?

          Show
          eric baldeschwieler added a comment - I've been thinking about this. I think we should use a very simple scheme like the current sequence file. The writer would take a configurable buffer, maybe 10 meg by default, fill that with key/values and then compress them. The result would be stored with start markers like the current sequence file, so the partitioning logic could remain unchanged. Within a block, I think we should compress the keys together in a block and the values in a following block. Both blocks' lengths should be stored, so that one can quickly scan the keys and then scan only as far as needed in the value blocks. This would allow very efficient sub-sampling when you have large data blocks. Which could be a life saver in some of our typical apps. You'll also get better compression, since you'll be compressing like items together. Thoughts?
          Hide
          eric baldeschwieler added a comment -

          Also, we should be able to specify a custom compressor for keys or values.

          Show
          eric baldeschwieler added a comment - Also, we should be able to specify a custom compressor for keys or values.
          Hide
          Doug Cutting added a comment -

          I like this proposal. Long-term, custom compressors make good sense, but for the first version let's just use a fixed algorithm and add that as a separate pass, ok? That could be a separate issue in Jira.

          Show
          Doug Cutting added a comment - I like this proposal. Long-term, custom compressors make good sense, but for the first version let's just use a fixed algorithm and add that as a separate pass, ok? That could be a separate issue in Jira.
          Hide
          Arun C Murthy added a comment -

          Here are some thoughts on how to go about this... inputs are much appreciated!

          <thoughts>

          The key idea is to compress blocks and not individual 'values' as is done (optionally) today in a SequenceFile.

          The plan is to have a configurable buffer (say default of 1MB? or 10MB?), fill it up with key/value pairs and then compress them. When the buffer is (almost) full we compress the keys together into a block and values into the following block and then write them out to file along with necessary headers and markers. The point of compressing keys and values separately (as Eric points out) is:
          a) (hopefully) better compression since 'like' items are compressed together.
          b) if need be, iterating over 'keys' itself is faster since we don't need to uncompress 'values'.

          We could also write out 'sync' markers everytime the whole key-value compressed buffer is written out to dfs or we can double up the sync and end-of-block marker, thus these sync-markers also double up as end-of-compressed-block-markers. Of course the 'sync' marker is similar to the one used today in SequenceFiles. (thoughts?)

          E.g.
          a) configured buffer size - 4096b
          b) key/values
          keyA - 32b, valueA - 1024b
          keyB - 64b, valueB - 2048b
          c) compressedSize(keyA+keyB) - 75b
          d) compressedSize(valueA+valueB) - 2500b

          On disk:

          no. of k/v pairs (c) (d)

             

          ------------------------------------------------------------------------------------------------------------------------------------

          sync-marker 2 32 64 75 1024 2048 2500 compressedKeys (blob) compressedValues (blob) sync-marker

          ------------------------------------------------------------------------------------------------------------------------------------

           

          ______ ______
          (key sizes) (value sizes)

          Non-graphical version:
          -------------------------
          sync,2,32,64,75,1024,2048,2500,compressedKeys,compressedValuesBlob,sync.

          Clarification: All lengths above are uncompressed and as is on disk. (write/read via writeInt/readInt apis).

          Points to ponder:
          -----------------
          a) Since we need to store keys and values in separate (compressed) blocks on disk does it make sense to:
          i) use 2 buffers (one for each) to store them before compressing i.e. before we hit the configured (1/10/* MB) limit (both buffers combined) - fairly simple to implement!
          ii) interleave them in the buffer and then make 2 passes to compress and write to disk.
          b) Strategy for buffer-size vis-a-vis dfs-block-size? Should we pad out after compressing and before writing to disk? Better to ignore this for now and let dfs handle it better in future?
          c) Thoughts by Eric/Doug w.r.t custom compressors for keys/values.

          </thoughts>

          thanks,
          Arun

          Show
          Arun C Murthy added a comment - Here are some thoughts on how to go about this... inputs are much appreciated! <thoughts> The key idea is to compress blocks and not individual 'values' as is done (optionally) today in a SequenceFile. The plan is to have a configurable buffer (say default of 1MB? or 10MB?), fill it up with key/value pairs and then compress them. When the buffer is (almost) full we compress the keys together into a block and values into the following block and then write them out to file along with necessary headers and markers. The point of compressing keys and values separately (as Eric points out) is: a) (hopefully) better compression since 'like' items are compressed together. b) if need be, iterating over 'keys' itself is faster since we don't need to uncompress 'values'. We could also write out 'sync' markers everytime the whole key-value compressed buffer is written out to dfs or we can double up the sync and end-of-block marker, thus these sync-markers also double up as end-of-compressed-block-markers. Of course the 'sync' marker is similar to the one used today in SequenceFiles. (thoughts?) E.g. a) configured buffer size - 4096b b) key/values keyA - 32b, valueA - 1024b keyB - 64b, valueB - 2048b c) compressedSize(keyA+keyB) - 75b d) compressedSize(valueA+valueB) - 2500b On disk: no. of k/v pairs (c) (d)     ------------------------------------------------------------------------------------------------------------------------------------ sync-marker 2 32 64 75 1024 2048 2500 compressedKeys (blob) compressedValues (blob) sync-marker ------------------------------------------------------------------------------------------------------------------------------------   ______ ______ (key sizes) (value sizes) Non-graphical version: ------------------------- sync,2,32,64,75,1024,2048,2500,compressedKeys,compressedValuesBlob,sync. Clarification: All lengths above are uncompressed and as is on disk. (write/read via writeInt/readInt apis). Points to ponder: ----------------- a) Since we need to store keys and values in separate (compressed) blocks on disk does it make sense to: i) use 2 buffers (one for each) to store them before compressing i.e. before we hit the configured (1/10/* MB) limit (both buffers combined) - fairly simple to implement! ii) interleave them in the buffer and then make 2 passes to compress and write to disk. b) Strategy for buffer-size vis-a-vis dfs-block-size? Should we pad out after compressing and before writing to disk? Better to ignore this for now and let dfs handle it better in future? c) Thoughts by Eric/Doug w.r.t custom compressors for keys/values. </thoughts> thanks, Arun
          Hide
          Bryan Pendleton added a comment -

          Another feature that might be useful: include an (optional) un-compressed copy of the last key in a given compression spill. Why? Because, for sorted SequenceFiles (specifically, for the data file of a MapFile), when seeking through to find a given key/value, knowing the last key in a given chunk allows skipping of decompressing the entire key array.

          It should, of course, be optional, both because keys potentially be large, and because SequenceFiles aren't all sorted. But, in the MapFile case, it could reduce the cost of finding a hit during lookups.

          Might also be useful to try to do some DFS block-aligning, to avoid block requests and CRC calculations for data that's not really going to be used. That sounds like it might be tricky to get, though, because default GFS blocks are so large, and we're probably talking about much smaller compression chunks. Does DFS have variable-length block writing yet?

          Show
          Bryan Pendleton added a comment - Another feature that might be useful: include an (optional) un-compressed copy of the last key in a given compression spill. Why? Because, for sorted SequenceFiles (specifically, for the data file of a MapFile), when seeking through to find a given key/value, knowing the last key in a given chunk allows skipping of decompressing the entire key array. It should, of course, be optional, both because keys potentially be large, and because SequenceFiles aren't all sorted. But, in the MapFile case, it could reduce the cost of finding a hit during lookups. Might also be useful to try to do some DFS block-aligning, to avoid block requests and CRC calculations for data that's not really going to be used. That sounds like it might be tricky to get, though, because default GFS blocks are so large, and we're probably talking about much smaller compression chunks. Does DFS have variable-length block writing yet?
          Hide
          Arun C Murthy added a comment -

          Couple of implementation thoughts/refinements (thanks by Eric) I wanted to bring to everyone's attention:

          a) We could use BytesWritables for keys/values (which maintain their own sizes), thus the header would then only be:
          sync,no.of k/v pairs, compressed-keys-size, compressed-values-size, compressedKeysBlob, compressedValuesBlob, sync

          b) Use the 'zero-compressed-int-length' that Milind is exporting from recordio for the 'integer' values in the header to save some bytes.

          Thoughts? Other refinements?

          thanks,
          Arun

          Show
          Arun C Murthy added a comment - Couple of implementation thoughts/refinements (thanks by Eric) I wanted to bring to everyone's attention: a) We could use BytesWritables for keys/values (which maintain their own sizes), thus the header would then only be: sync,no.of k/v pairs, compressed-keys-size, compressed-values-size, compressedKeysBlob, compressedValuesBlob, sync b) Use the 'zero-compressed-int-length' that Milind is exporting from recordio for the 'integer' values in the header to save some bytes. Thoughts? Other refinements? thanks, Arun
          Hide
          Owen O'Malley added a comment -

          Ok, I talked with Eric about this.

          The thought about BytesWritable was just about trying to get rid of the redundant lengths in the SequenceFile format. (Both the Writable object and the SequenceFile encode the object's length.) This redundancy is caused because Writable.readFields is not given the length and yet the SequenceFile wants to be able to know the length so that it can copy/skip keys and values during copy and sorting. Unfortunately, to remove the redundancy would require extending the Writable interface, which would break a lot of application code. (We could make an optional extension that provides a mechanism for getting the length out of the serialized form or copying the raw bytes of the proper length and allow Writable types to implement the sub-interface if they want to have smaller files.)

          For now, I'd suggest mixing in the length in a zero compressed format.

          So inside the compressed key block would look like:
          <key1-length><key1-bytes><key2-length><key2-bytes><key3-length><key3-bytes>

          And the inside of the compressed value block would look like:
          <value1-length><value1-bytes><value2-length><value2-bytes><value3-length><value3-bytes>

          The blocks would look like:
          <sync><num-records><compressed-key-bytes><compressed-value-bytes><compressed-keys-data><compressed-values-data>

          With the blocks padding upto io.sequencefile.pad.percent to align to the DFS block boundary.

          Does that sound reasonable?

          Show
          Owen O'Malley added a comment - Ok, I talked with Eric about this. The thought about BytesWritable was just about trying to get rid of the redundant lengths in the SequenceFile format. (Both the Writable object and the SequenceFile encode the object's length.) This redundancy is caused because Writable.readFields is not given the length and yet the SequenceFile wants to be able to know the length so that it can copy/skip keys and values during copy and sorting. Unfortunately, to remove the redundancy would require extending the Writable interface, which would break a lot of application code. (We could make an optional extension that provides a mechanism for getting the length out of the serialized form or copying the raw bytes of the proper length and allow Writable types to implement the sub-interface if they want to have smaller files.) For now, I'd suggest mixing in the length in a zero compressed format. So inside the compressed key block would look like: <key1-length><key1-bytes><key2-length><key2-bytes><key3-length><key3-bytes> And the inside of the compressed value block would look like: <value1-length><value1-bytes><value2-length><value2-bytes><value3-length><value3-bytes> The blocks would look like: <sync><num-records><compressed-key-bytes><compressed-value-bytes><compressed-keys-data><compressed-values-data> With the blocks padding upto io.sequencefile.pad.percent to align to the DFS block boundary. Does that sound reasonable?
          Hide
          Doug Cutting added a comment -

          SequenceFile already has an API for reading and writing raw keys and values:

          http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.Reader.html#next(org.apache.hadoop.io.DataOutputBuffer)
          http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.Writer.html#append(byte[],%20int,%20int,%20int)

          So it's easy to write an InputFormat or OutputFormat for entries that do not encode their own length.

          Writable implementations must encode their length in order to be easily nestable: if a key contains a struct with string, int and long values, then the lenghts of the nested strings must be explicitly written.

          If we're really worried about redundant lengths, we could add a RawWritable sub-interface that adds something like rawLength(), rawWrite(DataOutput), and rawRead(DataInput, int length). (This is roughly what Owen referred to.)

          I don't think we should worry about padding to DFS block boundaries in the first implementation, but rather leave that as a subsequent optimization.

          I'm +1 for Owen's pro[osed format.

          Show
          Doug Cutting added a comment - SequenceFile already has an API for reading and writing raw keys and values: http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.Reader.html#next(org.apache.hadoop.io.DataOutputBuffer ) http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.Writer.html#append(byte[],%20int,%20int,%20int ) So it's easy to write an InputFormat or OutputFormat for entries that do not encode their own length. Writable implementations must encode their length in order to be easily nestable: if a key contains a struct with string, int and long values, then the lenghts of the nested strings must be explicitly written. If we're really worried about redundant lengths, we could add a RawWritable sub-interface that adds something like rawLength(), rawWrite(DataOutput), and rawRead(DataInput, int length). (This is roughly what Owen referred to.) I don't think we should worry about padding to DFS block boundaries in the first implementation, but rather leave that as a subsequent optimization. I'm +1 for Owen's pro[osed format.
          Hide
          Arun C Murthy added a comment -

          +1 for Owen's proposal.

          An unrelated issue: the 'append' method in SequenceFile.Writer is passed 2 Writables: key and value. The Writable interface doesn't have a 'getLength' interface. This means one would have to write out the key/value to a temporary buffer to actually figure out it's 'length'. The lengths are particularly relevant here to ensure that the key/value pair can be put into the keyBuffer/valueBuffer without violating the 'configured' maxBufferSize...

          To get around this issue: how about making the 'configured' bufferSize the 'lower_bound' instead of the 'upper_bound'? This will ensure we can write out the key/value and then check the buffer size, and if need be go ahead and compress etc. This will save the construction of the temporary buffer for getting the key/value lengths. Related gain: it's far simpler with this scheme to deal with outlier/rouge keys/values which are larger than bufferSize itself.

          Logical next step: make this 'bufferSize' configurable per SequenceFile, this will let applications control it depending on the sizes of their keys/values. I propose to introduce a new constructor with this as an argument for SequenceFile.Writer. This will then be written out as a part of the file-header (along with compression details) and the SequenceFile.Reader can pick this up and read accordingly. (Of course there will be a system-wide default if unspecified per file).

          Thoughts?

          thanks,
          Arun

          Show
          Arun C Murthy added a comment - +1 for Owen's proposal. An unrelated issue: the 'append' method in SequenceFile.Writer is passed 2 Writables: key and value. The Writable interface doesn't have a 'getLength' interface. This means one would have to write out the key/value to a temporary buffer to actually figure out it's 'length'. The lengths are particularly relevant here to ensure that the key/value pair can be put into the keyBuffer/valueBuffer without violating the 'configured' maxBufferSize... To get around this issue: how about making the 'configured' bufferSize the 'lower_bound' instead of the 'upper_bound'? This will ensure we can write out the key/value and then check the buffer size, and if need be go ahead and compress etc. This will save the construction of the temporary buffer for getting the key/value lengths. Related gain: it's far simpler with this scheme to deal with outlier/rouge keys/values which are larger than bufferSize itself. Logical next step: make this 'bufferSize' configurable per SequenceFile, this will let applications control it depending on the sizes of their keys/values. I propose to introduce a new constructor with this as an argument for SequenceFile.Writer. This will then be written out as a part of the file-header (along with compression details) and the SequenceFile.Reader can pick this up and read accordingly. (Of course there will be a system-wide default if unspecified per file). Thoughts? thanks, Arun
          Hide
          Arun C Murthy added a comment -

          Actually we have an 'oops' here... afaics there isn't a way to even construct
          <key1-length><key1-bytes><key2-length><key2-bytes><key3-length><key3-bytes>
          <value1-length><value1-bytes><value2-length><value2-bytes><value3-length><value3-bytes>
          without constructing a temporary object from the Writable key/value, since there isn't a way to figure out the key/value length at all.

          I feel constructing a temporary object will be huge overhead i.e. introduce an extra copy from Writable to temp-buffer and then from Writable/tempBuffer to keyBuffer/valueBuffer for compression...

          Any way to do this without an extra copy?

          -

          One alternative I can think of to avoid the extra copy is slightly convoluted, though this still won't be able to construct Owen's proposal as-is. (Will look very similar to my older proposal)

          Maintain 2 auxillary arrays which keep track of actual key/value lengths. The way to maintain lengths is to compute difference in size of keyBuffer/valBuffer before and after insertion of each key/value and then store that difference.

          E.g.

          int oldKeyBufferLength = keyBuffer.length();
          key.write(keyBuffer);
          int newKeyBufferLength = keyBuffer.length();
          keySizes.insert(newKeyBufferLength - oldKeyBufferLength); //save the last key size

          // same for 'val'

          if((newKeyBufferLength + newValueBufferLength) > minBufferSize)

          { //lower_bound instead of 'higher_bound' // Compress both keyBuffer and valueBuffer // Write out keySizes array to disk in zero-compressed format // Write out valueSizes array to disk in zero-compressed format // Write out compressedKeyBufferSize in zero-compressed format // Write out compressed keyBuffer // Write out compressedValueBufferSize in zero-compressed format // Write out compressed valueBuffer // Reset keyBuffer, valueBuffer, keySizes and valueSizes }

          else

          { // Done return; }

          -

          Appreciate any inputs/alternatives/refinements...

          thanks,
          Arun

          Show
          Arun C Murthy added a comment - Actually we have an 'oops' here... afaics there isn't a way to even construct <key1-length><key1-bytes><key2-length><key2-bytes><key3-length><key3-bytes> <value1-length><value1-bytes><value2-length><value2-bytes><value3-length><value3-bytes> without constructing a temporary object from the Writable key/value, since there isn't a way to figure out the key/value length at all. I feel constructing a temporary object will be huge overhead i.e. introduce an extra copy from Writable to temp-buffer and then from Writable/tempBuffer to keyBuffer/valueBuffer for compression... Any way to do this without an extra copy? - One alternative I can think of to avoid the extra copy is slightly convoluted, though this still won't be able to construct Owen's proposal as-is. (Will look very similar to my older proposal) Maintain 2 auxillary arrays which keep track of actual key/value lengths. The way to maintain lengths is to compute difference in size of keyBuffer/valBuffer before and after insertion of each key/value and then store that difference. E.g. int oldKeyBufferLength = keyBuffer.length(); key.write(keyBuffer); int newKeyBufferLength = keyBuffer.length(); keySizes.insert(newKeyBufferLength - oldKeyBufferLength); //save the last key size // same for 'val' if((newKeyBufferLength + newValueBufferLength) > minBufferSize) { //lower_bound instead of 'higher_bound' // Compress both keyBuffer and valueBuffer // Write out keySizes array to disk in zero-compressed format // Write out valueSizes array to disk in zero-compressed format // Write out compressedKeyBufferSize in zero-compressed format // Write out compressed keyBuffer // Write out compressedValueBufferSize in zero-compressed format // Write out compressed valueBuffer // Reset keyBuffer, valueBuffer, keySizes and valueSizes } else { // Done return; } - Appreciate any inputs/alternatives/refinements... thanks, Arun
          Hide
          Doug Cutting added a comment -

          > Any way to do this without an extra copy?

          There's no way to length-prefix things without knowing the length, and the length is only known after the value is serialized. I doubt the extra copy will significantly affect overall write performance.

          Note that folks who use the raw API, directly passing bytes rather than Writables, should not suffer the extra-copy penalty when adding items to a file.

          Show
          Doug Cutting added a comment - > Any way to do this without an extra copy? There's no way to length-prefix things without knowing the length, and the length is only known after the value is serialized. I doubt the extra copy will significantly affect overall write performance. Note that folks who use the raw API, directly passing bytes rather than Writables, should not suffer the extra-copy penalty when adding items to a file.
          Hide
          eric baldeschwieler added a comment -

          I think this is enough of an argument to return to a format with key lengths segregated though. We proposed interleaved because we thought it would be simpler to code. Since it clearly will not be, we might as sell segregate lengths. This has the extra advantage that it will compress better (because we will be grouping like data together).

          I think this brings us full circle back to arun's original proposal (but all wiser). So I'm now proposing:

          sync-marker
          numPairs (delta compressed int)
          KeyLenBlockCompressedLen (delta compressed int)
          KeyBlockCompressedLen (delta compressed int)
          ValueLenBlockCompressedLen (delta compressed int)
          ValueBlockCompressedLen (delta compressed int)
          keyLengths... (gzipped & delta compressed)
          keys... (gzipped or custom compressed)
          valueLengths... (gzipped & delta compressed)
          values... (gzipped or custom compressed)

          This will allow one to scan the keys and skip the values if desired. It will yield pretty good compression. Its not that hard to understand if well documented...

          I like arun's suggestion of considering the target block size the minimum. That will keep things simpler.

          I suggest we proceed this way, unless anyone objects?

          Show
          eric baldeschwieler added a comment - I think this is enough of an argument to return to a format with key lengths segregated though. We proposed interleaved because we thought it would be simpler to code. Since it clearly will not be, we might as sell segregate lengths. This has the extra advantage that it will compress better (because we will be grouping like data together). I think this brings us full circle back to arun's original proposal (but all wiser). So I'm now proposing: sync-marker numPairs (delta compressed int) KeyLenBlockCompressedLen (delta compressed int) KeyBlockCompressedLen (delta compressed int) ValueLenBlockCompressedLen (delta compressed int) ValueBlockCompressedLen (delta compressed int) keyLengths... (gzipped & delta compressed) keys... (gzipped or custom compressed) valueLengths... (gzipped & delta compressed) values... (gzipped or custom compressed) This will allow one to scan the keys and skip the values if desired. It will yield pretty good compression. Its not that hard to understand if well documented... – I like arun's suggestion of considering the target block size the minimum. That will keep things simpler. I suggest we proceed this way, unless anyone objects?
          Hide
          Doug Cutting added a comment -

          Re documentation: perhaps we should add file-format documentation to the javadoc as a part of this change. This could look something like Lucene's file-formats documentation (except simpler). That mixes a pseudo-formal BNF-like syntax with commentary:

          http://lucene.apache.org/java/docs/fileformats.html

          This should link to other javadoc to describe things like the zero-compressed int format.

          Eric's proposal looks good to me. I assume zero-compressed ints are meant most places he says 'delta compressed'. Is that right? It may make sense to delta compress the lists of key and value lengths, but it probably does not make sense to delta compress numPairs and the block lengths.

          Show
          Doug Cutting added a comment - Re documentation: perhaps we should add file-format documentation to the javadoc as a part of this change. This could look something like Lucene's file-formats documentation (except simpler). That mixes a pseudo-formal BNF-like syntax with commentary: http://lucene.apache.org/java/docs/fileformats.html This should link to other javadoc to describe things like the zero-compressed int format. Eric's proposal looks good to me. I assume zero-compressed ints are meant most places he says 'delta compressed'. Is that right? It may make sense to delta compress the lists of key and value lengths, but it probably does not make sense to delta compress numPairs and the block lengths.
          Hide
          Arun C Murthy added a comment -

          Here's a test run analysing compression of VInts (exported from recordio).

          Essentially we get almost 50% savings (either with zlib/gzip) of compressed-VInts v/s uncompressed raw integers on disk (4bytes per int) ...

          Sounds like a good reason to go ahead with Eric's proposal to compress not only keys/values but also keyLengthsBlock/valueLengthsBlock? I'll go ahead with this for now unless anyone objects...

          thanks,
          Arun

          Show
          Arun C Murthy added a comment - Here's a test run analysing compression of VInts (exported from recordio). Essentially we get almost 50% savings (either with zlib/gzip) of compressed-VInts v/s uncompressed raw integers on disk (4bytes per int) ... Sounds like a good reason to go ahead with Eric's proposal to compress not only keys/values but also keyLengthsBlock/valueLengthsBlock? I'll go ahead with this for now unless anyone objects... thanks, Arun
          Hide
          Doug Cutting added a comment -

          Just to be clear, delta compression means something different to me than zero-compression. The former represents a lists of integers with their differences, the latter elides leading zeros in integers. They're not exclusive. A sorted list of integers is smaller when delta compressed and zero-compressed. A random list of integers is probably not helped by delta compression, but is helped by zero compression. If values are in a narrow range, then delta compression may help. Thus it may be useful for lists of key lengths and value lengths.

          You provide some benchmarks showing the advantage of zero compression for random sequences. Eric said delta compression, but I think he meant zero compression. I agree that we should use zero compression everywhere. The only question is if we should also use delta compression anywhere.

          Show
          Doug Cutting added a comment - Just to be clear, delta compression means something different to me than zero-compression. The former represents a lists of integers with their differences, the latter elides leading zeros in integers. They're not exclusive. A sorted list of integers is smaller when delta compressed and zero-compressed. A random list of integers is probably not helped by delta compression, but is helped by zero compression. If values are in a narrow range, then delta compression may help. Thus it may be useful for lists of key lengths and value lengths. You provide some benchmarks showing the advantage of zero compression for random sequences. Eric said delta compression, but I think he meant zero compression. I agree that we should use zero compression everywhere. The only question is if we should also use delta compression anywhere.
          Hide
          eric baldeschwieler added a comment -

          I meant zero compressed. So we're all on the same page.

          (Of course recording key & value lengths rather than offsets is delta compression, so we are actually doing both)

          Show
          eric baldeschwieler added a comment - I meant zero compressed. So we're all on the same page. (Of course recording key & value lengths rather than offsets is delta compression, so we are actually doing both)
          Hide
          Arun C Murthy added a comment -

          Sounds good... looks like we're all on the same page; I'll get going on this.

          Appreciate both of you spending time... I'll also keep in mind Doug's thoughts on documenting the file-format for SequenceFile.

          thanks!
          Arun

          Show
          Arun C Murthy added a comment - Sounds good... looks like we're all on the same page; I'll get going on this. Appreciate both of you spending time... I'll also keep in mind Doug's thoughts on documenting the file-format for SequenceFile. thanks! Arun
          Hide
          Arun C Murthy added a comment -

          Issues which I came across while implementing the above proposal...

          1. The implementation of the public interface

          SequenceFile.Writer.append(byte[] data, int start, int length. int keyLength)

          as it exists today, does not honour 'deflateValues' i.e. it does not compress 'values' at all. I feel this is contrary to user's expectations since the other 'append' interface does compress values and breaks the abstraction of 'compressed' sequence files. I propose we remedy this now and add necessary support for the same here too. (I understand that it might be a break with existing behaviour, but I feel we should correct this right-away... we need to fix it some time or the other.).

          I will also go ahead and add a 'rawAppend' public interface if the existing functionality (just write data to disc with heeding 'deflateValues') is deemed necessary.

          2. I propose we add a public interface:

          void flush() throws IOException

          to SequenceFile.Writer to let the user explictly compress and flush existing data in key/value buffers.

          This api will also be used from existing 'close' (flush remaining data in buffers) and 'append' (flush buffers to dfs after they exceed the configured size) apis internally... the only point of contention is whether I should make this api 'public'.

          3. Afaik I can't see a way to 'configure' the default 'minimum buffer size' since the SequenceFile class, as it exists, does not seem to have any access to a 'Configuration' object...
          (... in my previous life Owen pointed out that making a 'comparator' class implement the 'Configurable' interface ensured that it's 'configure' api would be called by the framework; will this trick work again?!)

          I don't want to hardcode any values for 'minimum buffer size' nor does the idea of adding a new constructor with a 'Configuration' object as one of the params look very appealing...

          -

          Thoughts?

          Show
          Arun C Murthy added a comment - Issues which I came across while implementing the above proposal... 1. The implementation of the public interface SequenceFile.Writer.append(byte[] data, int start, int length. int keyLength) as it exists today, does not honour 'deflateValues' i.e. it does not compress 'values' at all. I feel this is contrary to user's expectations since the other 'append' interface does compress values and breaks the abstraction of 'compressed' sequence files. I propose we remedy this now and add necessary support for the same here too. (I understand that it might be a break with existing behaviour, but I feel we should correct this right-away... we need to fix it some time or the other.). I will also go ahead and add a 'rawAppend' public interface if the existing functionality (just write data to disc with heeding 'deflateValues') is deemed necessary. 2. I propose we add a public interface: void flush() throws IOException to SequenceFile.Writer to let the user explictly compress and flush existing data in key/value buffers. This api will also be used from existing 'close' (flush remaining data in buffers) and 'append' (flush buffers to dfs after they exceed the configured size) apis internally... the only point of contention is whether I should make this api 'public'. 3. Afaik I can't see a way to 'configure' the default 'minimum buffer size' since the SequenceFile class, as it exists, does not seem to have any access to a 'Configuration' object... (... in my previous life Owen pointed out that making a 'comparator' class implement the 'Configurable' interface ensured that it's 'configure' api would be called by the framework; will this trick work again?!) I don't want to hardcode any values for 'minimum buffer size' nor does the idea of adding a new constructor with a 'Configuration' object as one of the params look very appealing... - Thoughts?
          Hide
          Owen O'Malley added a comment -

          The "raw" append/next interface for SequenceFile is intended to get the raw bytes from the file. Its intended use was for doing things like merging and sorting where the values don't need to be instantiated. So the lack of decompression was deliberate. However, in the switch to block compression, that doesn't make sense. In the new block compression reader and writer, just treat them as a key or value that has already been serialized for you.

          Show
          Owen O'Malley added a comment - The "raw" append/next interface for SequenceFile is intended to get the raw bytes from the file. Its intended use was for doing things like merging and sorting where the values don't need to be instantiated. So the lack of decompression was deliberate. However, in the switch to block compression, that doesn't make sense. In the new block compression reader and writer, just treat them as a key or value that has already been serialized for you.
          Hide
          Owen O'Malley added a comment -

          One more thing, the "append" isn't appending bytes, but a preserialized key/value pair. The interface is a little unfortunate, because it forces the key and value to be in the same buffer. A more general interface would be something like:

          append(byte[] key, int keyOffset, int keyLength, byte[] value, int valueOffset, int valueLength)

          The advantage of that interface is that if the application has the key and value in different buffers they don't need to be copied into a single buffer before being copied to the SequenceFile.Writer's buffer.

          Show
          Owen O'Malley added a comment - One more thing, the "append" isn't appending bytes, but a preserialized key/value pair. The interface is a little unfortunate, because it forces the key and value to be in the same buffer. A more general interface would be something like: append(byte[] key, int keyOffset, int keyLength, byte[] value, int valueOffset, int valueLength) The advantage of that interface is that if the application has the key and value in different buffers they don't need to be copied into a single buffer before being copied to the SequenceFile.Writer's buffer.
          Hide
          eric baldeschwieler added a comment -

          Sounds like we better be careful here.

          This raw interface is presumably used mainly by the framework? So we can probably change it without breaking the universe?

          I think we should probably change it to deal with "serializedRef" objects or some other new type that points to the buffer and keeps the info on if the data is compressed (and with which class).

          Otherwise application code is going to need to deal with tracking the state of each object and finding the right compress/decompress calls to make. A frequent scenario will be moving things from a perhaps block compressed file to an item compressed format for sorting. That should be efficient if possible.

          I like this approach because it makes what is going on very explicit, vs the current interface, which is obviously confusing. The alternative seems to be copious documentation in the classes and all use cases and frequent discussions on the list...

          Show
          eric baldeschwieler added a comment - Sounds like we better be careful here. This raw interface is presumably used mainly by the framework? So we can probably change it without breaking the universe? I think we should probably change it to deal with "serializedRef" objects or some other new type that points to the buffer and keeps the info on if the data is compressed (and with which class). Otherwise application code is going to need to deal with tracking the state of each object and finding the right compress/decompress calls to make. A frequent scenario will be moving things from a perhaps block compressed file to an item compressed format for sorting. That should be efficient if possible. I like this approach because it makes what is going on very explicit, vs the current interface, which is obviously confusing. The alternative seems to be copious documentation in the classes and all use cases and frequent discussions on the list...
          Hide
          eric baldeschwieler added a comment -

          Arun, what does flush do exactly? Does it create a block boundary? I'd vote for not expanding the interface until we have a compelling use case. Although I can see how this could arise.

          PS. I'm all for hardcoding a value of 2MB or something else sane for the block size. Keeping things simple is often much more important than making everything configurable. As long as the API lets you set the buffer size on creation, I'm happy. My experience with such systems is at some point the number of configuration options becomes a weakness, not a strength. Happy to be out voted on this point. (In this case)

          Show
          eric baldeschwieler added a comment - Arun, what does flush do exactly? Does it create a block boundary? I'd vote for not expanding the interface until we have a compelling use case. Although I can see how this could arise. PS. I'm all for hardcoding a value of 2MB or something else sane for the block size. Keeping things simple is often much more important than making everything configurable. As long as the API lets you set the buffer size on creation, I'm happy. My experience with such systems is at some point the number of configuration options becomes a weakness, not a strength. Happy to be out voted on this point. (In this case)
          Hide
          Owen O'Malley added a comment -

          -1 on adding flush to the public api.

          I just checked and the only users of SequenceFile.Writer.append(byte[], ...) in both Hadoop and Nutch are in SequenceFile itself. Once I have my RawSequenceInputFormat, it will also be a user of this interface.

          Show
          Owen O'Malley added a comment - -1 on adding flush to the public api. I just checked and the only users of SequenceFile.Writer.append(byte[], ...) in both Hadoop and Nutch are in SequenceFile itself. Once I have my RawSequenceInputFormat, it will also be a user of this interface.
          Hide
          Arun C Murthy added a comment -

          Rebuttals:

          1. append

          I like Owen's idea about generalising the interface to:

          append(byte[] key, int keyOffset, int keyLength, byte[] value, int valueOffset, int valueLength)

          with it's associated 'clarity' for user and advantage that it precludes an extra copy into a single buffer... couple of +1s and I'll take this path.

          @Owen: I understand that this interface currently appends a 'preserialized' key/value pair, but as you point out with 'compressed blocks' this gets worrisome in the long run (things like 'custom compression' will require 'serializedRef' like objects soon enough) ...

          How about letting the user pass in the preserialized key/value and then we will still go ahead and honour 'deflateValues' in the append? Honouring 'compress' directives will ensure consistent behaviour with rest of apis (read: append(Writable key, Writable value) and also uncompress in the SequenceFile.Reader.next call will ensure the what-you-store-is-what-you-get contract.

          Otherwise a true 'rawAppend' will mean (especially in 'compressed blocks' context) that I will need to create a 'block' with a single key/value pair and write out to disk ...

          Summarising: We can switch to the 'general' append interface and honour 'compress' directives in it... ensuring consistency & clarity. (I also volunteer to fix 'older' append calls in SequenceFile.java; Owen can then

          2. flush

          I should have worded things more carefully... I was looking to see if there is a compelling use case for this 'already'.
          Looks like there isn't... I'll drop this.

          (Adding a public 'flush' later is infinitely easier than adding now and removing later... )

          @Eric: Yep, the 'flush' does create a block boundary, it's used internally in 2 cases for now: (a) sizeof(keyBuffer+valueBuffer) exceeds minBlockSize (b) When the 'close' api is called.

          3. configuration

          I concur with need to keep things simple... I'll just hardcode a 'sane' value for now.
          (Yes, there is a way via the constructor to set the buffersize on creation.)

          (PS: I do hear bells in my head when I see, as it exists, SequenceFile.Reader gets a 'Configuration' object via the constructor and the 'symmetric' SequenceFile.Writer doesn't... but that's topic for another discussion.)

          Show
          Arun C Murthy added a comment - Rebuttals: 1. append I like Owen's idea about generalising the interface to: append(byte[] key, int keyOffset, int keyLength, byte[] value, int valueOffset, int valueLength) with it's associated 'clarity' for user and advantage that it precludes an extra copy into a single buffer... couple of +1s and I'll take this path. @Owen: I understand that this interface currently appends a 'preserialized' key/value pair, but as you point out with 'compressed blocks' this gets worrisome in the long run (things like 'custom compression' will require 'serializedRef' like objects soon enough) ... How about letting the user pass in the preserialized key/value and then we will still go ahead and honour 'deflateValues' in the append? Honouring 'compress' directives will ensure consistent behaviour with rest of apis (read: append(Writable key, Writable value) and also uncompress in the SequenceFile.Reader.next call will ensure the what-you-store-is-what-you-get contract. Otherwise a true 'rawAppend' will mean (especially in 'compressed blocks' context) that I will need to create a 'block' with a single key/value pair and write out to disk ... Summarising: We can switch to the 'general' append interface and honour 'compress' directives in it... ensuring consistency & clarity. (I also volunteer to fix 'older' append calls in SequenceFile.java; Owen can then 2. flush I should have worded things more carefully... I was looking to see if there is a compelling use case for this 'already'. Looks like there isn't... I'll drop this. (Adding a public 'flush' later is infinitely easier than adding now and removing later... ) @Eric: Yep, the 'flush' does create a block boundary, it's used internally in 2 cases for now: (a) sizeof(keyBuffer+valueBuffer) exceeds minBlockSize (b) When the 'close' api is called. 3. configuration I concur with need to keep things simple... I'll just hardcode a 'sane' value for now. (Yes, there is a way via the constructor to set the buffersize on creation.) (PS: I do hear bells in my head when I see, as it exists, SequenceFile.Reader gets a 'Configuration' object via the constructor and the 'symmetric' SequenceFile.Writer doesn't... but that's topic for another discussion.)
          Hide
          Doug Cutting added a comment -

          I suggest adding the binary append API suggested by Owen and deprecating the old binary append API, but making it work back-compatibly. Thus it should accept pre-compressed (if compression is enabled) values, de-compress them, then call the new append method. We should update all existing binary appends in Hadoop and prepare a patch for Nutch to do the same. Then we should file a bug to remove the deprecated method in the next release.

          We unfortunately lose the ability to move individual compressed values around. If a mapper does not touch values, it would be best to only decompress values on reduce nodes, rather than decompress and recompress them on map nodes, since compression can be computationally expensive. But I don't see how to avoid this if we want to compress multiple values together. I think this argues that we might still permit the existing single-value compression, since that might be most efficient for large-valued files that are not touched during maps.

          Also, please add a public SequenceFile.Writer() constructor that accepts a Configuration. We should probably also deprecate the unconfigured constructor and remove it in the next release. I agree with Eric that things can be over-configurable, but its easier to make them configurable in the code from the start, and only as-needed add them to hadoop-default.xml, so that folks who have not read the code can tweak them.

          I also agree that flush should not be public.

          Show
          Doug Cutting added a comment - I suggest adding the binary append API suggested by Owen and deprecating the old binary append API, but making it work back-compatibly. Thus it should accept pre-compressed (if compression is enabled) values, de-compress them, then call the new append method. We should update all existing binary appends in Hadoop and prepare a patch for Nutch to do the same. Then we should file a bug to remove the deprecated method in the next release. We unfortunately lose the ability to move individual compressed values around. If a mapper does not touch values, it would be best to only decompress values on reduce nodes, rather than decompress and recompress them on map nodes, since compression can be computationally expensive. But I don't see how to avoid this if we want to compress multiple values together. I think this argues that we might still permit the existing single-value compression, since that might be most efficient for large-valued files that are not touched during maps. Also, please add a public SequenceFile.Writer() constructor that accepts a Configuration. We should probably also deprecate the unconfigured constructor and remove it in the next release. I agree with Eric that things can be over-configurable, but its easier to make them configurable in the code from the start, and only as-needed add them to hadoop-default.xml, so that folks who have not read the code can tweak them. I also agree that flush should not be public.
          Hide
          Arun C Murthy added a comment -

          > I suggest adding the binary append API suggested by Owen and deprecating the old binary append API, but making it work back-compatibly. Thus it should accept pre-compressed (if compression is enabled) values, de-compress them, then call the new append method.

          My hunch is that we do not need to worry about 'pre-compressed' values since as of today both the 'raw' apis do not honour it... is this true?

          In-fact we could take the route in which 'append' compresses whatever data is passed along, thus possibly compressing data twice. With the 'symmetric' call to next (which decompresses) we give back the data which the user passed along in the first place... I did have a chat with Owen about this and we both felt this could work.

          > We unfortunately lose the ability to move individual compressed values around. If a mapper does not touch values, it would be best to only decompress values on reduce nodes, rather than decompress and recompress them on map nodes, since compression can be computationally expensive.

          I can see a way around this if it really will make a difference...

          We can take the path that values are decompressed only 'on demand' i.e. a series of calls to SequenceFile.Reader.next(Writable key) does not need to decompress 'valBuffer' (or even valLengthsBuffer). Hence when we read a compressed 'block' we need not decompress values till we see a call to either SequenceFile.Reader.next(Writable key, Writable value) or SequenceFile.Reader.next(DataOutputBuffer buffer).

          Implementing this 'lazy decompression' of values is slightly more complex... worth it?

          -

          PS:

          1. SequenceFile.Reader.next(DataOutputBuffer buffer) should be changed to
          SequenceFile.Reader.next(DataOutputBuffer keyBuffer, DataOutputBuffer valBuffer) similar to the 'raw' append api?

          2. Does it make sense to have compression configurable for both keys and values separately? i.e. let user specify (during creation) whether he preferes to compress 'keys' or 'values' or both? Overkill for now? Maybe makes sense once we move to custom compressors for each?

          Show
          Arun C Murthy added a comment - > I suggest adding the binary append API suggested by Owen and deprecating the old binary append API, but making it work back-compatibly. Thus it should accept pre-compressed (if compression is enabled) values, de-compress them, then call the new append method. My hunch is that we do not need to worry about 'pre-compressed' values since as of today both the 'raw' apis do not honour it... is this true? In-fact we could take the route in which 'append' compresses whatever data is passed along, thus possibly compressing data twice. With the 'symmetric' call to next (which decompresses) we give back the data which the user passed along in the first place... I did have a chat with Owen about this and we both felt this could work. > We unfortunately lose the ability to move individual compressed values around. If a mapper does not touch values, it would be best to only decompress values on reduce nodes, rather than decompress and recompress them on map nodes, since compression can be computationally expensive. I can see a way around this if it really will make a difference... We can take the path that values are decompressed only 'on demand' i.e. a series of calls to SequenceFile.Reader.next(Writable key) does not need to decompress 'valBuffer' (or even valLengthsBuffer). Hence when we read a compressed 'block' we need not decompress values till we see a call to either SequenceFile.Reader.next(Writable key, Writable value) or SequenceFile.Reader.next(DataOutputBuffer buffer). Implementing this 'lazy decompression' of values is slightly more complex... worth it? - PS: 1. SequenceFile.Reader.next(DataOutputBuffer buffer) should be changed to SequenceFile.Reader.next(DataOutputBuffer keyBuffer, DataOutputBuffer valBuffer) similar to the 'raw' append api? 2. Does it make sense to have compression configurable for both keys and values separately? i.e. let user specify (during creation) whether he preferes to compress 'keys' or 'values' or both? Overkill for now? Maybe makes sense once we move to custom compressors for each?
          Hide
          eric baldeschwieler added a comment -

          Folks...

          I'm happy with everything on this thread now, except for the raw API as discussed. Could folks please consider my suggestion in:

          http://issues.apache.org/jira/browse/HADOOP-54#action_12422716

          I think this addresses all the concerns about sometimes compressed data and avoiding the loose of current functionality etc. I also think that it removes a very dangerous ambiguity that the current & owen's version of raw API permit (what is raw? who can use it...).

          Please let me know what you think of this....

          Show
          eric baldeschwieler added a comment - Folks... I'm happy with everything on this thread now, except for the raw API as discussed. Could folks please consider my suggestion in: http://issues.apache.org/jira/browse/HADOOP-54#action_12422716 I think this addresses all the concerns about sometimes compressed data and avoiding the loose of current functionality etc. I also think that it removes a very dangerous ambiguity that the current & owen's version of raw API permit (what is raw? who can use it...). Please let me know what you think of this....
          Hide
          Owen O'Malley added a comment -

          Eric, I don't see how to implement both block compression, which is a huge win, and access to a pre-decompression representation. Especially if what you want to do with the pre-decompression representation is sorting or merging. Therefore, I was (and am) proposing that the "raw" access is a little less raw and that the byte[] representation is always decompressed. Am I missing something? This is an semantic change to the "raw" SequenceFile API, but I think it is required to get block-level compression.

          On a slight tangent, I think that the SequenceFile.Reader should not decompress the entire block but just enough to get the next key/value pair.

          Show
          Owen O'Malley added a comment - Eric, I don't see how to implement both block compression, which is a huge win, and access to a pre-decompression representation. Especially if what you want to do with the pre-decompression representation is sorting or merging. Therefore, I was (and am) proposing that the "raw" access is a little less raw and that the byte[] representation is always decompressed. Am I missing something? This is an semantic change to the "raw" SequenceFile API, but I think it is required to get block-level compression. On a slight tangent, I think that the SequenceFile.Reader should not decompress the entire block but just enough to get the next key/value pair.
          Hide
          eric baldeschwieler added a comment -

          I completely agree that you should incrementally decompress. The right answer might be just enough for the next entry or a small buffer, should performance test that.

          My point on raw is that you can return a reference tuple in an object:

          <raw bytes,is compressed flag, compressor class> used in a reference

          Then you read the bytes, decompressed if they come from a block compressed or an uncompressed file, compressed if they come from an item compressed file.

          Then you pass this reference to the target sequence file's raw write method. The target then compresses or decompresses as needed.

          Since you package all of this up behind an API, folks will not get confused into using this essentially internal API to do the wrong thing and it will efficiently passed item compressed objects from one such stream to another if given the chance.

          This may be worth considering, since sorts and merges may often operate on item compressed values and this will avoid a lot of spurious decompression/compression.

          PS we probably should only bother doing this for values.

          Show
          eric baldeschwieler added a comment - I completely agree that you should incrementally decompress. The right answer might be just enough for the next entry or a small buffer, should performance test that. My point on raw is that you can return a reference tuple in an object: <raw bytes,is compressed flag, compressor class> used in a reference Then you read the bytes, decompressed if they come from a block compressed or an uncompressed file, compressed if they come from an item compressed file. Then you pass this reference to the target sequence file's raw write method. The target then compresses or decompresses as needed. Since you package all of this up behind an API, folks will not get confused into using this essentially internal API to do the wrong thing and it will efficiently passed item compressed objects from one such stream to another if given the chance. This may be worth considering, since sorts and merges may often operate on item compressed values and this will avoid a lot of spurious decompression/compression. PS we probably should only bother doing this for values.
          Hide
          Owen O'Malley added a comment -

          My point is that the raw bytes are useless except in their original context.

          Say my value is compressed as the byte stream: 12, 34, 56, 78
          If I'm merging 100 files, I can't write 12, 34, 56, 78 to the output file and expect it to work, because naturally the compressed bytes depend on the state of the compressor.

          So your reference tuple, would need to be:

          <raw bytes, compressor class, compressor state>

          where the compressor state is some compressor specific data. In the case of gzip, it is the last 32k of decompressed byte or whatever.

          And that assumes that no one ever tries to use a compression algorithm that uses partial bytes.

          It looks to me like you'd add a lot of complexity for very little gain. You'd only win if you had large compressed values that you didn't really need to look at or use for anything. (For example, if you wanted to take a table that was url -> html document and generate the number of urls in each domain.)

          Show
          Owen O'Malley added a comment - My point is that the raw bytes are useless except in their original context. Say my value is compressed as the byte stream: 12, 34, 56, 78 If I'm merging 100 files, I can't write 12, 34, 56, 78 to the output file and expect it to work, because naturally the compressed bytes depend on the state of the compressor. So your reference tuple, would need to be: <raw bytes, compressor class, compressor state> where the compressor state is some compressor specific data. In the case of gzip, it is the last 32k of decompressed byte or whatever. And that assumes that no one ever tries to use a compression algorithm that uses partial bytes. It looks to me like you'd add a lot of complexity for very little gain. You'd only win if you had large compressed values that you didn't really need to look at or use for anything. (For example, if you wanted to take a table that was url -> html document and generate the number of urls in each domain.)
          Hide
          eric baldeschwieler added a comment -

          The point is that you can carry object compressed data through the system compressed. Block compressed data clearly needs to be uncompressed.

          In a lot of situations this makes block compression undesirable. We don't want to loose an important current performance optimization to add block compression if we can avoid it.

          Show
          eric baldeschwieler added a comment - The point is that you can carry object compressed data through the system compressed. Block compressed data clearly needs to be uncompressed. In a lot of situations this makes block compression undesirable. We don't want to loose an important current performance optimization to add block compression if we can avoid it.
          Hide
          Arun C Murthy added a comment -

          Regarding 'incremental decompress' in SequenceFile.Reader:

          Maybe I'm miss something here - but isn't 1 decompress (of the whole block) followed by n reads of keys (or values) going to lead to the same amortized cost as m (where m < n) decompress and n reads? In that case I don't think the complexity of managing this complicated beast (maintaining status of how much is compressed, might have to decompress multiple times to get large values etc.) is worth it...

          Show
          Arun C Murthy added a comment - Regarding 'incremental decompress' in SequenceFile.Reader: Maybe I'm miss something here - but isn't 1 decompress (of the whole block) followed by n reads of keys (or values) going to lead to the same amortized cost as m (where m < n) decompress and n reads? In that case I don't think the complexity of managing this complicated beast (maintaining status of how much is compressed, might have to decompress multiple times to get large values etc.) is worth it...
          Hide
          Owen O'Malley added a comment -

          Sorry for being dense. I missed the fact that you wanted to preserve key-value pair compression as an option.

          I'd propose spliting the classes like:

          SequenceFile.Writer // uncompressed
          SequenceFile.RecordCompressWriter extends Writer
          SequenceFile.BlockCompressWriter extends Writer

          They would have the current interface, with the following new functions:

          void append(byte[] key, int keyOffset, int keyLength, byte[] value, int valueOffset, int valueLength);
          boolean canAppendCompressed();
          void appendCompressed(byte[] key, int keyOffset, int keyLength,
          byte[] value, int valueOffset, int valueLength);

          when we add custom compressors, we can add the compressor to the constructors.

          The Reader should have methods like:

          boolean next(DataOutputStream key, DataOutputStream value);
          boolean canReadCompressed();
          void readCompressed(DataOutputStream key, DataOutputStream value);

          when we add custom compressors, we can add a getter for them like:
          StreamCompressor getCompressor();

          As an implementation, I'd consider having SequenceFile.Reader be a bridge to the class that is doing the reading based on the how it is compressed.

          Thoughts?

          Show
          Owen O'Malley added a comment - Sorry for being dense. I missed the fact that you wanted to preserve key-value pair compression as an option. I'd propose spliting the classes like: SequenceFile.Writer // uncompressed SequenceFile.RecordCompressWriter extends Writer SequenceFile.BlockCompressWriter extends Writer They would have the current interface, with the following new functions: void append(byte[] key, int keyOffset, int keyLength, byte[] value, int valueOffset, int valueLength); boolean canAppendCompressed(); void appendCompressed(byte[] key, int keyOffset, int keyLength, byte[] value, int valueOffset, int valueLength); when we add custom compressors, we can add the compressor to the constructors. The Reader should have methods like: boolean next(DataOutputStream key, DataOutputStream value); boolean canReadCompressed(); void readCompressed(DataOutputStream key, DataOutputStream value); when we add custom compressors, we can add a getter for them like: StreamCompressor getCompressor(); As an implementation, I'd consider having SequenceFile.Reader be a bridge to the class that is doing the reading based on the how it is compressed. Thoughts?
          Hide
          Owen O'Malley added a comment -

          Ok, after talking it over with Eric, here is what is hopefully a last pass at this.

          All in SequenceFile:

          public static class Writer

          { ... current stuff ... /** * Append an uncompressed representation of the key and a raw representation of the value as the * next record. */ public void appendRaw(byte[] key, int keyOffset, int keyLength, RawValue value); }

          public static class RecordCompressWriter extends Writer

          { ... constructor and some overriding methods ... }

          public static class BlockCompressWriter extends Writer { ... constructor and some overriding methods ... }

          public static class Reader

          { ... current stuff ... /** * Read the next key into the key buffer and return the value as a RawValue. * @param key a buffer to store the uncompressed serialized key in as a sequence of bytes * @returns NULL if there are no more key/value pairs in the file */ public RawValue readRaw(DataOutputStream key); }

          public static interface RawValue

          { // writes the uncompressed bytes to the outStream public void writeUncompressedBytes(DataOutputStream outStream); // is this raw value compressed (using zip)? public boolean canWriteCompressed(); // write the (zip) compressed bytes. note that it will NOT compress the bytes if they are not // already compressed // throws IllegalArgumentException if the value is not already compressed public void writeCompressedBytes(DataOutputStream outStream); // when we add custom compressors, we would add: public boolean canWriteCompressed(Class compressorClass); public void writeCompressedBytes(Class compressorClass, DataOutputStream outStream); }
          Show
          Owen O'Malley added a comment - Ok, after talking it over with Eric, here is what is hopefully a last pass at this. All in SequenceFile: public static class Writer { ... current stuff ... /** * Append an uncompressed representation of the key and a raw representation of the value as the * next record. */ public void appendRaw(byte[] key, int keyOffset, int keyLength, RawValue value); } public static class RecordCompressWriter extends Writer { ... constructor and some overriding methods ... } public static class BlockCompressWriter extends Writer { ... constructor and some overriding methods ... } public static class Reader { ... current stuff ... /** * Read the next key into the key buffer and return the value as a RawValue. * @param key a buffer to store the uncompressed serialized key in as a sequence of bytes * @returns NULL if there are no more key/value pairs in the file */ public RawValue readRaw(DataOutputStream key); } public static interface RawValue { // writes the uncompressed bytes to the outStream public void writeUncompressedBytes(DataOutputStream outStream); // is this raw value compressed (using zip)? public boolean canWriteCompressed(); // write the (zip) compressed bytes. note that it will NOT compress the bytes if they are not // already compressed // throws IllegalArgumentException if the value is not already compressed public void writeCompressedBytes(DataOutputStream outStream); // when we add custom compressors, we would add: public boolean canWriteCompressed(Class compressorClass); public void writeCompressedBytes(Class compressorClass, DataOutputStream outStream); }
          Hide
          Doug Cutting added a comment -

          I mostly have minor naming quibbles.

          The Writer method should be named just 'next', not 'nextRaw'.

          The new Writer subclasses should not be public, but rather should be created by a factory method.

          The RawValue class might better be named 'ValueBytes', and it's methods can simply be called writeCompressed(), writeUncompressed(), etc.

          Finally, a substantive remark: we should not allocate a new RawValue for each key read. So the new Reader methods should be:

          public ValueBytes createValueBytes();
          public void next(DataOutputStream key, ValueBytes value);

          Show
          Doug Cutting added a comment - I mostly have minor naming quibbles. The Writer method should be named just 'next', not 'nextRaw'. The new Writer subclasses should not be public, but rather should be created by a factory method. The RawValue class might better be named 'ValueBytes', and it's methods can simply be called writeCompressed(), writeUncompressed(), etc. Finally, a substantive remark: we should not allocate a new RawValue for each key read. So the new Reader methods should be: public ValueBytes createValueBytes(); public void next(DataOutputStream key, ValueBytes value);
          Hide
          Owen O'Malley added a comment -

          I'm not set on the name nextRaw, but I think the semantics are different enough that it deserves a different name, but whatever people want.

          I think the 3 Writer classes should be public, because they are actually doing different things and will likely end up with different parameter lists on the constructors. I'd hate to make the factory take the super set of all parameters that any of the Writers want.

          ValueBytes is fine.

          I wasn't really intending to create a RawValue/ValueBytes for each iteration and your interface makes that clearer, so I like that. smile

          Show
          Owen O'Malley added a comment - I'm not set on the name nextRaw, but I think the semantics are different enough that it deserves a different name, but whatever people want. I think the 3 Writer classes should be public, because they are actually doing different things and will likely end up with different parameter lists on the constructors. I'd hate to make the factory take the super set of all parameters that any of the Writers want. ValueBytes is fine. I wasn't really intending to create a RawValue/ValueBytes for each iteration and your interface makes that clearer, so I like that. smile
          Hide
          Arun C Murthy added a comment -

          Addendum:

          I spoke to Owen to confirm that it makes sense to implement 'lazy decompression' of 'values' in block compressed files i.e. a series of calls to:
          SequenceFile.Reader.next(Writable key)
          will not decompress 'value' blocks until a call to either
          SequenceFile.Reader.next(Writable key, Writable val) or
          SequenceFile.Reader.getCurrentValue(Writable val)

          {explained below}

          Going along the same trajectory it makes sense to add a 'getCurrentValue' api to the Reader, enabling the user to look at the 'key' and only then decide if he wants to fetch the 'value' (lazy decompression of 'value' holds here too; with associated better performance).

          Thoughts?

          Show
          Arun C Murthy added a comment - Addendum: I spoke to Owen to confirm that it makes sense to implement 'lazy decompression' of 'values' in block compressed files i.e. a series of calls to: SequenceFile.Reader.next(Writable key) will not decompress 'value' blocks until a call to either SequenceFile.Reader.next(Writable key, Writable val) or SequenceFile.Reader.getCurrentValue(Writable val) {explained below} Going along the same trajectory it makes sense to add a 'getCurrentValue' api to the Reader, enabling the user to look at the 'key' and only then decide if he wants to fetch the 'value' (lazy decompression of 'value' holds here too; with associated better performance). Thoughts?
          Hide
          eric baldeschwieler added a comment -

          +1

          Show
          eric baldeschwieler added a comment - +1
          Hide
          p sutter added a comment -

          +1

          Show
          p sutter added a comment - +1
          Hide
          Doug Cutting added a comment -

          +1

          Show
          Doug Cutting added a comment - +1
          Hide
          Owen O'Malley added a comment -

          +1

          Hairong is doing a patch today that will add next(key) and getCurrentValue(value) to Reader because she needed it for a patch she is working on. (She hasn't filed the jira yet, or I'd go ahead and link it.)

          Show
          Owen O'Malley added a comment - +1 Hairong is doing a patch today that will add next(key) and getCurrentValue(value) to Reader because she needed it for a patch she is working on. (She hasn't filed the jira yet, or I'd go ahead and link it.)
          Hide
          Arun C Murthy added a comment -

          An implementation detail which I would like to bring to everyone's attention:

          With the 'raw' values now being a concrete object (implementing the 'ValueBytes' interface) we now have a situation in the 'base' sort of SequenceFile.Sorter where we will potentially have to store millions of 'rawValue' objects (assuming a decent sized SequenceFile with 'small' records).

          As it exists today the 'sort' implementation in SequenceFile.Sorter uses 'io.sort.mb' no. of bytes to 'segment' input for sorting (and later merges them).

          Owen suggested we also use 'no. of records' to augment the above in order to prevent situations where we might have to store millions of 'ValueBytes' objects in memory... thus we have a limit on the no. of records (i.e. ValueBytes objects) we use along with 'io.sort.mb' to segment input for sorting.

          Thoughts?

          Show
          Arun C Murthy added a comment - An implementation detail which I would like to bring to everyone's attention: With the 'raw' values now being a concrete object (implementing the 'ValueBytes' interface) we now have a situation in the 'base' sort of SequenceFile.Sorter where we will potentially have to store millions of 'rawValue' objects (assuming a decent sized SequenceFile with 'small' records). As it exists today the 'sort' implementation in SequenceFile.Sorter uses 'io.sort.mb' no. of bytes to 'segment' input for sorting (and later merges them). Owen suggested we also use 'no. of records' to augment the above in order to prevent situations where we might have to store millions of 'ValueBytes' objects in memory... thus we have a limit on the no. of records (i.e. ValueBytes objects) we use along with 'io.sort.mb' to segment input for sorting. Thoughts?
          Hide
          Arun C Murthy added a comment -

          Here's the patch for block compressed SequenceFiles.

          Main attractions:
          a) Three different writers for SequenceFiles: Writer, RecordCompressWriter, BlockCompressWriter (with a factory to create the different writers).
          b) The 'raw' apis are significantly different (as per previous discussions here)
          c) Fixed SequenceFile.Sorter to take advantage of new 'raw' apis and some minor tweak/enhancements there.

          As per Owen's suggestion, I've only uploaded the patch... once I get some feedback, I'll go ahead and change the 'status' to 'Patch Available'.

          Show
          Arun C Murthy added a comment - Here's the patch for block compressed SequenceFiles. Main attractions: a) Three different writers for SequenceFiles: Writer, RecordCompressWriter, BlockCompressWriter (with a factory to create the different writers). b) The 'raw' apis are significantly different (as per previous discussions here) c) Fixed SequenceFile.Sorter to take advantage of new 'raw' apis and some minor tweak/enhancements there. As per Owen's suggestion, I've only uploaded the patch... once I get some feedback, I'll go ahead and change the 'status' to 'Patch Available'.
          Hide
          Owen O'Malley added a comment -

          I have a couple of comments. I had more, but jira ate them.

          (It really sucks that jira throws your comment away if your login times out. sigh)

          1. The Writer class should have the unneeded compression stuff taken out.
          2. The Writer.compressed and blockCompressed fields should be taken out and replaced with methods.
          3. The sync bytes in the block compressed writer should be written before the block rather than after (except for the first block). The goal is to get them between blocks, you don't really want one at the end (or beginning) of the file.
          4. The CompressedBytes class should be private.
          5. The private Writer constructor on line 307 is not used.
          6. The static field VERSION_4 should be renamed to BLOCK_COMPRESS_VERSION and it should be marked final.
          7. I'd rename the "byte[] version" to "syncBlock" and make a new field "byte version" that will contain just the last byte, which is the file version.
          8. We really need to move to the Text class instead of UTF8. This has a couple of changes:
          A. in writeFileHeader, the "new UTF8(...).write(out);" should be "Text.writeString(out, ...);"
          B. in init, reading the class names strings is the reverse: keyClass = Text.readString(in);
          C. we have to support the UTF8 string encodings for old file versions, so you'll need to switch behavior based on the version we are reading.

          Show
          Owen O'Malley added a comment - I have a couple of comments. I had more, but jira ate them. (It really sucks that jira throws your comment away if your login times out. sigh ) 1. The Writer class should have the unneeded compression stuff taken out. 2. The Writer.compressed and blockCompressed fields should be taken out and replaced with methods. 3. The sync bytes in the block compressed writer should be written before the block rather than after (except for the first block). The goal is to get them between blocks, you don't really want one at the end (or beginning) of the file. 4. The CompressedBytes class should be private. 5. The private Writer constructor on line 307 is not used. 6. The static field VERSION_4 should be renamed to BLOCK_COMPRESS_VERSION and it should be marked final. 7. I'd rename the "byte[] version" to "syncBlock" and make a new field "byte version" that will contain just the last byte, which is the file version. 8. We really need to move to the Text class instead of UTF8. This has a couple of changes: A. in writeFileHeader, the "new UTF8(...).write(out);" should be "Text.writeString(out, ...);" B. in init, reading the class names strings is the reverse: keyClass = Text.readString(in); C. we have to support the UTF8 string encodings for old file versions, so you'll need to switch behavior based on the version we are reading.
          Hide
          Arun C Murthy added a comment -

          Owen, appreciate your time...

          Responses inline:

          > 1. The Writer class should have the unneeded compression stuff taken out.
          > 3. The sync bytes in the block compressed writer should be written before the block rather than after (except for the first block). The goal is to get them between blocks, you don't really want one at the end (or beginning) of the file.
          > 4. The CompressedBytes class should be private.
          > 5. The private Writer constructor on line 307 is not used.
          > 6. The static field VERSION_4 should be renamed to BLOCK_COMPRESS_VERSION and it should be marked final.
          > 8. We really need to move to the Text class instead of UTF8. This has a couple of changes:
          A. in writeFileHeader, the "new UTF8(...).write(out);" should be "Text.writeString(out, ...);"
          B. in init, reading the class names strings is the reverse: keyClass = Text.readString(in);
          C. we have to support the UTF8 string encodings for old file versions, so you'll need to switch behavior based on the version we are reading.

          All done and incorporated into latest patch.

          > 2. The Writer.compressed and blockCompressed fields should be taken out and replaced with methods.

          Not clear - let's discuss this.

          > 7. I'd rename the "byte[] version" to "syncBlock" and make a new field "byte version" that will contain just the last byte, which is the file version.

          I don't agree about renaming it to 'syncBlock' since it isn't a sync block. I don't mind doing the "byte version" field, but the advantages aren't very clear.

          _

          I've attached a new patch (SequenceFilesII.patch) which incorporates Owen's suggestions and also fixes the fallout of the latest SequenceFile in other parts of MR universe.

          Also, I've a first cut of the SequenceFile Formats' documentation up:
          http://wiki.apache.org/lucene-hadoop/SequenceFile

          Show
          Arun C Murthy added a comment - Owen, appreciate your time... Responses inline: > 1. The Writer class should have the unneeded compression stuff taken out. > 3. The sync bytes in the block compressed writer should be written before the block rather than after (except for the first block). The goal is to get them between blocks, you don't really want one at the end (or beginning) of the file. > 4. The CompressedBytes class should be private. > 5. The private Writer constructor on line 307 is not used. > 6. The static field VERSION_4 should be renamed to BLOCK_COMPRESS_VERSION and it should be marked final. > 8. We really need to move to the Text class instead of UTF8. This has a couple of changes: A. in writeFileHeader, the "new UTF8(...).write(out);" should be "Text.writeString(out, ...);" B. in init, reading the class names strings is the reverse: keyClass = Text.readString(in); C. we have to support the UTF8 string encodings for old file versions, so you'll need to switch behavior based on the version we are reading. All done and incorporated into latest patch. > 2. The Writer.compressed and blockCompressed fields should be taken out and replaced with methods. Not clear - let's discuss this. > 7. I'd rename the "byte[] version" to "syncBlock" and make a new field "byte version" that will contain just the last byte, which is the file version. I don't agree about renaming it to 'syncBlock' since it isn't a sync block. I don't mind doing the "byte version" field, but the advantages aren't very clear. _ I've attached a new patch (SequenceFilesII.patch) which incorporates Owen's suggestions and also fixes the fallout of the latest SequenceFile in other parts of MR universe. Also, I've a first cut of the SequenceFile Formats' documentation up: http://wiki.apache.org/lucene-hadoop/SequenceFile
          Hide
          Arun C Murthy added a comment -

          Oops...

          > 1. The Writer class should have the unneeded compression stuff taken out.

          I've kept this around for this version (deprecated) to ensure existing applications don't break; I plan to get rid of this one hadoop release later.

          Show
          Arun C Murthy added a comment - Oops... > 1. The Writer class should have the unneeded compression stuff taken out. I've kept this around for this version (deprecated) to ensure existing applications don't break; I plan to get rid of this one hadoop release later.
          Hide
          Arun C Murthy added a comment -

          Please find SequenceFile.final.patch which incorporates all of Owen's feedback...

          Show
          Arun C Murthy added a comment - Please find SequenceFile.final.patch which incorporates all of Owen's feedback...
          Hide
          Doug Cutting added a comment -

          I am not convinced we want folks to encourage folks to add new SequenceFile.Writer subclasses. Thus we should change the (new) protected fields to package private. This will also simplify the javadoc. I'd also prefer that these subclasses were not public either, making everyone use the factory. This would expose far less of the implementation and further simplify the javadoc. Owen has disputed this earlier. Owen?

          I note that when you updated uses of SequenceFile.Writer you consistently used the factory in favor of explicitly constructing the subclasses. Thanks! That makes hiding the subclasses easy. We can always make them public later, if needed, but, once they're public, it is hard to remove them. If we think the factory method has too many confusing parameters, then we could use a typesafe enumeration, e.g. something like:

          writer = SequenceFile.createWriter(fs, conf, path, Compress.BLOCK);

          Show
          Doug Cutting added a comment - I am not convinced we want folks to encourage folks to add new SequenceFile.Writer subclasses. Thus we should change the (new) protected fields to package private. This will also simplify the javadoc. I'd also prefer that these subclasses were not public either, making everyone use the factory. This would expose far less of the implementation and further simplify the javadoc. Owen has disputed this earlier. Owen? I note that when you updated uses of SequenceFile.Writer you consistently used the factory in favor of explicitly constructing the subclasses. Thanks! That makes hiding the subclasses easy. We can always make them public later, if needed, but, once they're public, it is hard to remove them. If we think the factory method has too many confusing parameters, then we could use a typesafe enumeration, e.g. something like: writer = SequenceFile.createWriter(fs, conf, path, Compress.BLOCK);
          Hide
          Sameer Paranjpye added a comment -

          +1 on using an enumeration to represent the compression method. It's not necessary but makes code much more readable.

          Show
          Sameer Paranjpye added a comment - +1 on using an enumeration to represent the compression method. It's not necessary but makes code much more readable.
          Hide
          Arun C Murthy added a comment -

          Doug - here's the latest patch.

          It incorporates all your comments (remove protected, type-safe enum and non-public subclasses). I agree we can make the subclasses 'public' at a later date if need be, hence I've made them package private.

          Show
          Arun C Murthy added a comment - Doug - here's the latest patch. It incorporates all your comments (remove protected, type-safe enum and non-public subclasses). I agree we can make the subclasses 'public' at a later date if need be, hence I've made them package private.
          Hide
          Doug Cutting added a comment -

          I think this is nearly ready.

          A minor improvement: the typesafe enumeration instances should probably have a toString() method, to facilitate debugging.

          Running the TestSequenceFile unit test caused my 515MB Ubuntu box to swap horribly and it didn't complete. I grabbed a stack trace and saw:

          [junit] at java.util.zip.Inflater.init(Native Method)
          [junit] at java.util.zip.Inflater.<init>(Inflater.java:75)
          [junit] at java.util.zip.Inflater.<init>(Inflater.java:82)
          [junit] at org.apache.hadoop.io.SequenceFile$CompressedBytes.<init>(SequenceFile.java:231)
          [junit] at org.apache.hadoop.io.SequenceFile$CompressedBytes.<init>(SequenceFile.java:227)
          [junit] at org.apache.hadoop.io.SequenceFile$Reader.createValueBytes(SequenceFile.java:1195)
          [junit] at org.apache.hadoop.io.SequenceFile$Sorter$SortPass.run(SequenceFile.java:1459)
          [junit] at org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:1413)
          [junit] at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:1386)
          [junit] at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:1406)
          [junit] at org.apache.hadoop.io.TestSequenceFile.sortTest(TestSequenceFile.java:178)

          Since sorting should not do any inflating, the Inflater should probably not be created in this case. So maybe we should lazily initialize this field?

          More generally, before we commit this we should ensure that performance is comparable to what it was before. Creating a new ValueBytes wrapper per entry processed when sorting looks expensive to me, but this may in fact be insignificant. If it is significant, then we might replace the ValueBytes API with a compressor API, where the bytes to be compressed are passed explicitly.

          Show
          Doug Cutting added a comment - I think this is nearly ready. A minor improvement: the typesafe enumeration instances should probably have a toString() method, to facilitate debugging. Running the TestSequenceFile unit test caused my 515MB Ubuntu box to swap horribly and it didn't complete. I grabbed a stack trace and saw: [junit] at java.util.zip.Inflater.init(Native Method) [junit] at java.util.zip.Inflater.<init>(Inflater.java:75) [junit] at java.util.zip.Inflater.<init>(Inflater.java:82) [junit] at org.apache.hadoop.io.SequenceFile$CompressedBytes.<init>(SequenceFile.java:231) [junit] at org.apache.hadoop.io.SequenceFile$CompressedBytes.<init>(SequenceFile.java:227) [junit] at org.apache.hadoop.io.SequenceFile$Reader.createValueBytes(SequenceFile.java:1195) [junit] at org.apache.hadoop.io.SequenceFile$Sorter$SortPass.run(SequenceFile.java:1459) [junit] at org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:1413) [junit] at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:1386) [junit] at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:1406) [junit] at org.apache.hadoop.io.TestSequenceFile.sortTest(TestSequenceFile.java:178) Since sorting should not do any inflating, the Inflater should probably not be created in this case. So maybe we should lazily initialize this field? More generally, before we commit this we should ensure that performance is comparable to what it was before. Creating a new ValueBytes wrapper per entry processed when sorting looks expensive to me, but this may in fact be insignificant. If it is significant, then we might replace the ValueBytes API with a compressor API, where the bytes to be compressed are passed explicitly.
          Hide
          Arun C Murthy added a comment -

          Doug - here's another patch incorporating the fix (lazy initialization of the inflater) plus your other suggestions.

          I've also attached results of a quick test run of SequenceFile-v3 v/s SequenceFile-v4 (2 runs of 10k and 100k records respectively). Within that framework it seems reasonable to assume that creating one ValueBytes object per-record doesn't show up blatantly (of course we can further iterate on this); IAC there is a safety net for extreme corner cases (http://issues.apache.org/jira/browse/HADOOP-54#action_12425953).

          PS: The existing TestSequenceFile.java in trunk doesn't pass along the 'compress' flag to mergeTest - I ran the tests after fixing it; the patch reflects that too. Just fyi.

          Show
          Arun C Murthy added a comment - Doug - here's another patch incorporating the fix (lazy initialization of the inflater) plus your other suggestions. I've also attached results of a quick test run of SequenceFile-v3 v/s SequenceFile-v4 (2 runs of 10k and 100k records respectively). Within that framework it seems reasonable to assume that creating one ValueBytes object per-record doesn't show up blatantly (of course we can further iterate on this); IAC there is a safety net for extreme corner cases ( http://issues.apache.org/jira/browse/HADOOP-54#action_12425953 ). PS: The existing TestSequenceFile.java in trunk doesn't pass along the 'compress' flag to mergeTest - I ran the tests after fixing it; the patch reflects that too. Just fyi.
          Hide
          Owen O'Malley added a comment -

          It would also be good to see the number of objects (and bytes) allocated during the test. Is there an easy way to get that?

          Show
          Owen O'Malley added a comment - It would also be good to see the number of objects (and bytes) allocated during the test. Is there an easy way to get that?
          Hide
          Doug Cutting added a comment -

          Benchmarks should be run w/o the -check option and with the -fast option, no?

          Show
          Doug Cutting added a comment - Benchmarks should be run w/o the -check option and with the -fast option, no?
          Hide
          Arun C Murthy added a comment -

          Here's a patch with a minor tweak to the Sorter (to reuse ValueBytes objects across spills).

          I've also attached results of another set of benchmark runs (w/o -check and with -fast) - I've also run it against instrumented code to show no. of ValueBytes objects and total bytes allocated for raw values (note: doesn't imply all of them are allocated at the same time; just total of bytes allocated in the CompressedBytes.reset/UncompressedBytes.reset calls over lifetime of each test case).

          Show
          Arun C Murthy added a comment - Here's a patch with a minor tweak to the Sorter (to reuse ValueBytes objects across spills). I've also attached results of another set of benchmark runs (w/o -check and with -fast) - I've also run it against instrumented code to show no. of ValueBytes objects and total bytes allocated for raw values (note: doesn't imply all of them are allocated at the same time; just total of bytes allocated in the CompressedBytes.reset/UncompressedBytes.reset calls over lifetime of each test case).
          Hide
          Owen O'Malley added a comment -

          A minor quible is that when you are basically implementing an Enum, we should probably use the name "valueOf(String)" instead of "getCompressionType(String)" to be forward compatible with the java 1.5 signature for that functionality.

          I'd also like to see some performance numbers for straight reads and writes of the seq3 and seq4 block compressed files.

          Show
          Owen O'Malley added a comment - A minor quible is that when you are basically implementing an Enum, we should probably use the name "valueOf(String)" instead of "getCompressionType(String)" to be forward compatible with the java 1.5 signature for that functionality. I'd also like to see some performance numbers for straight reads and writes of the seq3 and seq4 block compressed files.
          Hide
          Arun C Murthy added a comment -

          Here's the latest patch incorporating Owen's quibble about the enum, also I have added a -rwonly option to TestSequenceFile.java and attached results of just read/write tests on seq3 and seq4.

          Show
          Arun C Murthy added a comment - Here's the latest patch incorporating Owen's quibble about the enum, also I have added a -rwonly option to TestSequenceFile.java and attached results of just read/write tests on seq3 and seq4.
          Hide
          Owen O'Malley added a comment -

          I'm ok with the final patch, but I think that we should just go ahead and use the 1.5 enums. I made the trivial edit to do that and re-rolled the patch.

          Show
          Owen O'Malley added a comment - I'm ok with the final patch, but I think that we should just go ahead and use the 1.5 enums. I made the trivial edit to do that and re-rolled the patch.
          Hide
          Arun C Murthy added a comment -

          Great... thanks Owen!

          PS: We should probably send a separate mail to hadoop-dev@ announcing the patch to build.xml to spare everyone a surprise...

          Show
          Arun C Murthy added a comment - Great... thanks Owen! PS: We should probably send a separate mail to hadoop-dev@ announcing the patch to build.xml to spare everyone a surprise...
          Hide
          Doug Cutting added a comment -

          I just committed this.

          Owen, when you converted things to use the Java 1.5 enum you also threw out most of the javadoc on that class. I restored & improved this before I committed.

          Show
          Doug Cutting added a comment - I just committed this. Owen, when you converted things to use the Java 1.5 enum you also threw out most of the javadoc on that class. I restored & improved this before I committed.

            People

            • Assignee:
              Arun C Murthy
              Reporter:
              Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development