Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: java, spec
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We should add support for at least one compression codec to data files.

      1. AVRO-135.patch.txt
        34 kB
        Philip Zeyliger
      2. AVRO-135.patch.txt
        34 kB
        Philip Zeyliger
      3. AVRO-135.patch.txt
        34 kB
        Philip Zeyliger
      4. AVRO-135.patch.txt
        25 kB
        Philip Zeyliger

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          The first codec to implement is probably zlib, since it has built-in support in Java and Python and its license is compatible with Apache, so we can use it from C and C++ too. I propose that we name this codec "zlib".

          The second codec to implement might be LZF. Like LZO, this is much faster than zlib/gzip, without achieving quite as much compression. Unlike LZO, it's license is compatible with Apache.

          http://software.schmorp.de/pkg/liblzf.html

          There's a Java port of LZF, so we would not need to use native code from Java.

          http://h2database.googlecode.com/svn/trunk/h2/src/main/org/h2/compress/CompressLZF.java

          Unfortunately I can't find a Python binding. How hard is it to call C from Python?

          Show
          Doug Cutting added a comment - The first codec to implement is probably zlib, since it has built-in support in Java and Python and its license is compatible with Apache, so we can use it from C and C++ too. I propose that we name this codec "zlib". The second codec to implement might be LZF. Like LZO, this is much faster than zlib/gzip, without achieving quite as much compression. Unlike LZO, it's license is compatible with Apache. http://software.schmorp.de/pkg/liblzf.html There's a Java port of LZF, so we would not need to use native code from Java. http://h2database.googlecode.com/svn/trunk/h2/src/main/org/h2/compress/CompressLZF.java Unfortunately I can't find a Python binding. How hard is it to call C from Python?
          Hide
          Henry Robinson added a comment -

          It's pretty easy to call C from Python - there's a very full featured C API since the standard Python runtime is C based. (see http://docs.python.org/c-api/). The API takes care of connecting Python calls to their C equivalents. Then you call into the API to unmarshal the arguments into C variables and then you can pass them onto the LZF library.

          The tricky part is getting reference counting right but if this is a thin layer between Python and a C LZF library then it probably doesn't need to worry too much about Python object lifetimes. (Concurrency is also a pain, but again probably not a big issue here).

          Show
          Henry Robinson added a comment - It's pretty easy to call C from Python - there's a very full featured C API since the standard Python runtime is C based. (see http://docs.python.org/c-api/ ). The API takes care of connecting Python calls to their C equivalents. Then you call into the API to unmarshal the arguments into C variables and then you can pass them onto the LZF library. The tricky part is getting reference counting right but if this is a thin layer between Python and a C LZF library then it probably doesn't need to worry too much about Python object lifetimes. (Concurrency is also a pain, but again probably not a big issue here).
          Hide
          Doug Cutting added a comment -

          Looks like there's also a Java version of FastLZ, and FastLZ may be added to Hadoop in HADOOP-6349.

          http://code.google.com/p/jfastlz/
          http://code.google.com/p/fastlz/

          FastLZ claims better decompression time than LZF, but the changelog for LZF also claims it's now faster than FastLZ:

          http://www.fastlz.org/lzf.htm
          http://oldhome.schmorp.de/marc/liblzf.html

          It probably doesn't matter too much which one we choose.

          Show
          Doug Cutting added a comment - Looks like there's also a Java version of FastLZ, and FastLZ may be added to Hadoop in HADOOP-6349 . http://code.google.com/p/jfastlz/ http://code.google.com/p/fastlz/ FastLZ claims better decompression time than LZF, but the changelog for LZF also claims it's now faster than FastLZ: http://www.fastlz.org/lzf.htm http://oldhome.schmorp.de/marc/liblzf.html It probably doesn't matter too much which one we choose.
          Hide
          Ismael Juma added a comment -

          There's also HADOOP-6389 too.

          Show
          Ismael Juma added a comment - There's also HADOOP-6389 too.
          Hide
          Philip Zeyliger added a comment -

          With the progress on AVRO-160, I propose we consider this a blocker for the next release. AVRO-160 breaks the API already (and there's some discussion there about storing lengths and such), so it seems important to make sure that it works for the compressed use case while it's really easy to fix.

          I'm happy to spend some cycles on it as soon as AVRO-160 is committed.

          Show
          Philip Zeyliger added a comment - With the progress on AVRO-160 , I propose we consider this a blocker for the next release. AVRO-160 breaks the API already (and there's some discussion there about storing lengths and such), so it seems important to make sure that it works for the compressed use case while it's really easy to fix. I'm happy to spend some cycles on it as soon as AVRO-160 is committed.
          Hide
          Doug Cutting added a comment -

          +1 for making this a blocker. I'm currently leaning towards fastlz. Adding the java version of that looks really easy. Subsequently we can add support for calling the C implementation from Java. Adding python will be a bit more work, since we'll need to call native code, but is probably required before we release, especially if we want to enable compression by default.

          Should we enable compression by default? I think this might make sense if compression is substantially faster than object serialization/deserialization and if all implementations of Avro support the anointed default codec.

          Show
          Doug Cutting added a comment - +1 for making this a blocker. I'm currently leaning towards fastlz. Adding the java version of that looks really easy. Subsequently we can add support for calling the C implementation from Java. Adding python will be a bit more work, since we'll need to call native code, but is probably required before we release, especially if we want to enable compression by default. Should we enable compression by default? I think this might make sense if compression is substantially faster than object serialization/deserialization and if all implementations of Avro support the anointed default codec.
          Hide
          Philip Zeyliger added a comment -

          Let's hold of on enabling cheap compression by default until there are two language implementations? (It's going to be simpler to have the uncompressed version working in two languages than the compressed version.)

          We may want to reserve the right to change the default compression over time, which I think we'll be able to do.

          – Philip

          Show
          Philip Zeyliger added a comment - Let's hold of on enabling cheap compression by default until there are two language implementations? (It's going to be simpler to have the uncompressed version working in two languages than the compressed version.) We may want to reserve the right to change the default compression over time, which I think we'll be able to do. – Philip
          Hide
          Doug Cutting added a comment -

          > Let's hold of on enabling cheap compression by default until there are two language implementations?

          Yes, of course. We should not make a codec the default that's not implemented in every language.

          Show
          Doug Cutting added a comment - > Let's hold of on enabling cheap compression by default until there are two language implementations? Yes, of course. We should not make a codec the default that's not implemented in every language.
          Hide
          Scott Carey added a comment -

          Lets just put deflate or gzip in here for this release. This is the least amount of work, and so long as we make it 'gzip 1' equivalent it isn't that slow. 'gzip 1' is about 4x faster compressing than the normal default of 'gzip 6', with 'gzip 1' on today's CPUs typically between 35MB/sec and 70MB/sec throughput on compression.

          I have some code I can contribute for Java that replaces GzipOutputStream to allow control over the compression ratio and exposes the number of bytes written (compressed and uncompressed). This can use the Hadoop optimized PureJavaCRC32 for best performance as well.
          At this point, I am eager to use 1.3 and will write such a codec anyway if it is not supported (not sure if gzip or deflate, but that is a minor issue on ~64k blocks).

          LZF/LZO/FastLZ would be nice, but that is more involved.
          I've been working on some pure java LZF implementations as an experiment. I chose this over FastLZ because the code was a lot easier to undersdand and much better documented (though both are lacking). Additionally, FastLZ warns that the format may change at any time on their site, which also kept me away.
          Short story – the JIT isn't good enough in Java 6 or Java 7 to do the right low level optimizations to catch up to native code yet, but I can get compression rates about 80 to 120MB/sec and decompression between 100 and 160MB/sec with it and compression ratios just slightly worse than LZO but better than the C LZF code.

          If FastLZ's java library is used, its can use some performance improvements.

          Show
          Scott Carey added a comment - Lets just put deflate or gzip in here for this release. This is the least amount of work, and so long as we make it 'gzip 1' equivalent it isn't that slow. 'gzip 1' is about 4x faster compressing than the normal default of 'gzip 6', with 'gzip 1' on today's CPUs typically between 35MB/sec and 70MB/sec throughput on compression. I have some code I can contribute for Java that replaces GzipOutputStream to allow control over the compression ratio and exposes the number of bytes written (compressed and uncompressed). This can use the Hadoop optimized PureJavaCRC32 for best performance as well. At this point, I am eager to use 1.3 and will write such a codec anyway if it is not supported (not sure if gzip or deflate, but that is a minor issue on ~64k blocks). LZF/LZO/FastLZ would be nice, but that is more involved. I've been working on some pure java LZF implementations as an experiment. I chose this over FastLZ because the code was a lot easier to undersdand and much better documented (though both are lacking). Additionally, FastLZ warns that the format may change at any time on their site, which also kept me away. Short story – the JIT isn't good enough in Java 6 or Java 7 to do the right low level optimizations to catch up to native code yet, but I can get compression rates about 80 to 120MB/sec and decompression between 100 and 160MB/sec with it and compression ratios just slightly worse than LZO but better than the C LZF code. If FastLZ's java library is used, its can use some performance improvements.
          Hide
          Doug Cutting added a comment -

          > Let's just put deflate or gzip in here for this release.

          That would be fine with me too.

          > I've been working on some pure java LZF implementations as an experiment.
          > I can get compression rates about 80 to 120MB/sec and decompression between 100 and 160MB/sec with it

          That's awesome! Good performance without native code would be really nice to have.

          Show
          Doug Cutting added a comment - > Let's just put deflate or gzip in here for this release. That would be fine with me too. > I've been working on some pure java LZF implementations as an experiment. > I can get compression rates about 80 to 120MB/sec and decompression between 100 and 160MB/sec with it That's awesome! Good performance without native code would be really nice to have.
          Hide
          Philip Zeyliger added a comment -

          Who's on first? Which is to say, Scott or Doug, if you're working on this already, awesome, and, if not, I'll take a stab.

          – Philip

          Show
          Philip Zeyliger added a comment - Who's on first? Which is to say, Scott or Doug, if you're working on this already, awesome, and, if not, I'll take a stab. – Philip
          Hide
          Doug Cutting added a comment -

          If you're game to work on it, please do!

          Show
          Doug Cutting added a comment - If you're game to work on it, please do!
          Hide
          Philip Zeyliger added a comment -

          Attaching a patch that attempts gzip compression.

          The patch itself introduces an enum (CompressionCodec), and introduces switch statements in DataFileWriter and DataFileStream. Compression involves writing the original data into a buffer, then compressing that in one go into another buffer, and finally writing the length of that second buffer followed by its contents into the output stream. For decompression, there's a filter that restricts the inputstream to a certain length. I've modified TestDataFile to be parameterized across all possible values of codec, and added a test that's more simple, that I used to hammer out the early bugs. I've added a setCompressionCodec() API to DataFileWriter.

          In theory, for inflate/deflate and for gzip, we don't actually need to store the length. (See http://www.gzip.org/format.txt .) It's irritating to do so with the GzipInputStream API, but we could fix it so that the bytes that weren't used in decompression are returned back to the stream. (The Inflater API has a getRemaining() API that helps do just that, and, in fact, it's used by GzipInputStream in readTrailer().)

          Do folks have an opinion on whether we should store the length or not? If we don't store it, it's possible we could avoid a memory copy or two by writing straight into the output buffer, but it'll complicate the read path a little bit.

          Do folks have an opinion on gzip vs deflate? Gzip costs 10 bytes header, 4 bytes CRC-32, and 4 bytes uncompressed size.

          Show
          Philip Zeyliger added a comment - Attaching a patch that attempts gzip compression. The patch itself introduces an enum (CompressionCodec), and introduces switch statements in DataFileWriter and DataFileStream. Compression involves writing the original data into a buffer, then compressing that in one go into another buffer, and finally writing the length of that second buffer followed by its contents into the output stream. For decompression, there's a filter that restricts the inputstream to a certain length. I've modified TestDataFile to be parameterized across all possible values of codec, and added a test that's more simple, that I used to hammer out the early bugs. I've added a setCompressionCodec() API to DataFileWriter. In theory, for inflate/deflate and for gzip, we don't actually need to store the length. (See http://www.gzip.org/format.txt .) It's irritating to do so with the GzipInputStream API, but we could fix it so that the bytes that weren't used in decompression are returned back to the stream. (The Inflater API has a getRemaining() API that helps do just that, and, in fact, it's used by GzipInputStream in readTrailer().) Do folks have an opinion on whether we should store the length or not? If we don't store it, it's possible we could avoid a memory copy or two by writing straight into the output buffer, but it'll complicate the read path a little bit. Do folks have an opinion on gzip vs deflate? Gzip costs 10 bytes header, 4 bytes CRC-32, and 4 bytes uncompressed size.
          Hide
          Philip Zeyliger added a comment -

          After sleeping on it, I think it's a good idea to leave the compressed length in there. If we do that, then we could implement parallel (as in one thread-per block) decompression.

          Show
          Philip Zeyliger added a comment - After sleeping on it, I think it's a good idea to leave the compressed length in there. If we do that, then we could implement parallel (as in one thread-per block) decompression.
          Hide
          Doug Cutting added a comment -

          Looks good. Some comments:

          • Should we, instead of a switch, have an abstract base class for codecs? It could have methods like:
            • String getName();
            • Decoder expand(Decoder in);
            • void compress(byte[] buffer, int start, int end, Encoder out);
              We can have a factory that creates a codec given a name, so that folks can experiment with non-standard codecs without changing Avro.
          • in the spec, you use the term "varint", not used elsewhere in that document.
          • in the spec, we should be very clear on whether we're using gzip or deflate, as these are often confused. my slight preference would be for deflate, since it's more minimal, but i also realize that adding another level of CRC will give lots of folks warm fuzzy feelings.
          Show
          Doug Cutting added a comment - Looks good. Some comments: Should we, instead of a switch, have an abstract base class for codecs? It could have methods like: String getName(); Decoder expand(Decoder in); void compress(byte[] buffer, int start, int end, Encoder out); We can have a factory that creates a codec given a name, so that folks can experiment with non-standard codecs without changing Avro. in the spec, you use the term "varint", not used elsewhere in that document. in the spec, we should be very clear on whether we're using gzip or deflate, as these are often confused. my slight preference would be for deflate, since it's more minimal, but i also realize that adding another level of CRC will give lots of folks warm fuzzy feelings.
          Hide
          Philip Zeyliger added a comment -

          I spent some time thinking about an interface for Codec and want to digest a bit longer.

          DataFileWriter gets a DatumWriter and a Datum and then it uses DatumWriter.write(D Datum, Encoder) to write the value. In turn, Encoder (which is a BinaryEncoder in this case), writes to an OutputStream. The current approach is to encode into a ByteBufferOutputStream, and, when that reaches a certain size, copy it into the final output. I'm trying to figure out where a Codec interface fits in here. It could:

          • Pretend to be an OutputStream, i.e., be used when constructing BinaryEncoder(), and offer an uncompressedSize() method, as well as a writeTo() method. So, a replacement for ByteBufferOutputStream.
          • Pretend to be an Encoder. The advantage here is that you could build a compression scheme that was schema-aware (e.g., semi-columnar or PAX-like), without re-parsing the data.

          I'm leaning towards the former right now. What do you mean by compress(byte[], int, int, Encoder) above?

          Show
          Philip Zeyliger added a comment - I spent some time thinking about an interface for Codec and want to digest a bit longer. DataFileWriter gets a DatumWriter and a Datum and then it uses DatumWriter.write(D Datum, Encoder) to write the value. In turn, Encoder (which is a BinaryEncoder in this case), writes to an OutputStream. The current approach is to encode into a ByteBufferOutputStream, and, when that reaches a certain size, copy it into the final output. I'm trying to figure out where a Codec interface fits in here. It could: Pretend to be an OutputStream, i.e., be used when constructing BinaryEncoder(), and offer an uncompressedSize() method, as well as a writeTo() method. So, a replacement for ByteBufferOutputStream. Pretend to be an Encoder. The advantage here is that you could build a compression scheme that was schema-aware (e.g., semi-columnar or PAX-like), without re-parsing the data. I'm leaning towards the former right now. What do you mean by compress(byte[], int, int, Encoder) above?
          Hide
          Doug Cutting added a comment -

          > Pretend to be an OutputStream [ ... ]

          I just looked at what your switch statements were doing in each case to devise the API. The API I provided above would make both the default and gzip codecs very small, while using an OutputStream I think would take considerably more code than is in your current patch.

          I wouldn't worry about making a codec API that's useful for other stuff. Making it an abstract class rather than an interface means we can revise it down the road without breaking folks.

          > What do you mean by compress(byte[], int, int, Encoder) above?

          That #compress() method would compress the bytes passed in and write them as a compressed block to the Encoder. The default no-op codec's impl would just write the bytes to the Encoder.

          The #expand() method above would provide a decoder for a single compression block, reading it from the provided Decoder. The default no-op codec would just return the Decoder passed in.

          Show
          Doug Cutting added a comment - > Pretend to be an OutputStream [ ... ] I just looked at what your switch statements were doing in each case to devise the API. The API I provided above would make both the default and gzip codecs very small, while using an OutputStream I think would take considerably more code than is in your current patch. I wouldn't worry about making a codec API that's useful for other stuff. Making it an abstract class rather than an interface means we can revise it down the road without breaking folks. > What do you mean by compress(byte[], int, int, Encoder) above? That #compress() method would compress the bytes passed in and write them as a compressed block to the Encoder. The default no-op codec's impl would just write the bytes to the Encoder. The #expand() method above would provide a decoder for a single compression block, reading it from the provided Decoder. The default no-op codec would just return the Decoder passed in.
          Hide
          Scott Carey added a comment -

          in the spec, we should be very clear on whether we're using gzip or deflate, as these are often confused. my slight preference would be for deflate, since it's more minimal, but i also realize that adding another level of CRC will give lots of folks warm fuzzy feelings.

          I lean towards deflate. Though, we have to be very clear what we mean by that, there are two kinds of 'deflate' interpretations.
          The first, is the raw compressed scheme, which has no crc or header (RFC 1951). In Java, this is the 'unwrapped' deflate variant. Some web browsers call this "deflate".
          The second is a deflate stream with an adler32 checksum, and is the format that a *.zip file stores its individual entries (RFC 1950). This is also known as the "ZLIB" format, but some web browsers simply call it 'deflate'. It has a 6 byte overhead (2 byte header, 4 byte adler32 checksum).

          Lastly, is gzip (RFC 1952), which wraps raw deflate with a header and footer, which typically have about 20 bytes overhead.

          In the past, I've leaned towards gzip because if a file is written in this format, all sorts of utilities can read it. But we are storing compressed blocks within our own file format, so there is no advantage to using gzip. Furthermore, the Java API for gzip annoyingly removes the ability to set the compression level and to find out the number of bytes output. I think that control over the compression level is highly important for users.
          The Deflater API in Java does allow control over the compression level.

          The 'ZLIB' deflate format has an adler32 checksum and 2 byte header (and is standardized), so if we want a checksum we can choose that instead of gzip.
          Otherwise, the raw deflate stream, perhaps with the uncompressed size prepended, would be great.

          Show
          Scott Carey added a comment - in the spec, we should be very clear on whether we're using gzip or deflate, as these are often confused. my slight preference would be for deflate, since it's more minimal, but i also realize that adding another level of CRC will give lots of folks warm fuzzy feelings. I lean towards deflate. Though, we have to be very clear what we mean by that, there are two kinds of 'deflate' interpretations. The first, is the raw compressed scheme, which has no crc or header (RFC 1951). In Java, this is the 'unwrapped' deflate variant. Some web browsers call this "deflate". The second is a deflate stream with an adler32 checksum, and is the format that a *.zip file stores its individual entries (RFC 1950). This is also known as the "ZLIB" format, but some web browsers simply call it 'deflate'. It has a 6 byte overhead (2 byte header, 4 byte adler32 checksum). Lastly, is gzip (RFC 1952), which wraps raw deflate with a header and footer, which typically have about 20 bytes overhead. In the past, I've leaned towards gzip because if a file is written in this format, all sorts of utilities can read it. But we are storing compressed blocks within our own file format, so there is no advantage to using gzip. Furthermore, the Java API for gzip annoyingly removes the ability to set the compression level and to find out the number of bytes output. I think that control over the compression level is highly important for users. The Deflater API in Java does allow control over the compression level. The 'ZLIB' deflate format has an adler32 checksum and 2 byte header (and is standardized), so if we want a checksum we can choose that instead of gzip. Otherwise, the raw deflate stream, perhaps with the uncompressed size prepended, would be great.
          Hide
          Philip Zeyliger added a comment -

          Attaching a patch which ditches the switch statements in exchange for "Codec" classes. I also switched to deflate (the rfc1951 variant).

          I ended up having a package-private Codec abstract class (with two extant implementations) and a public CodecOption class. The latter is necessary for users to set what codec they're using. One thing that doesn't currently work is setting compression options for append. We don't recover what compression strength you used when you re-open a file.

          I spent a good 30+ minutes scratching my head when I switched from GZIPInput/OutputFormat to DeflaterInput/OutputFormat. Whereas the GZIP classes are named Input/Outputformat, the Deflate classes are named DeflaterOutputFormat and InflaterInputFormat.

          Scott, thanks for the clear explanation of rfc1950 vs rfc1951. Do you know what Java library implements 1950? Java's docs are confusing as ever here (at least to me).

          Show
          Philip Zeyliger added a comment - Attaching a patch which ditches the switch statements in exchange for "Codec" classes. I also switched to deflate (the rfc1951 variant). I ended up having a package-private Codec abstract class (with two extant implementations) and a public CodecOption class. The latter is necessary for users to set what codec they're using. One thing that doesn't currently work is setting compression options for append. We don't recover what compression strength you used when you re-open a file. I spent a good 30+ minutes scratching my head when I switched from GZIPInput/OutputFormat to DeflaterInput/OutputFormat. Whereas the GZIP classes are named Input/Outputformat, the Deflate classes are named DeflaterOutputFormat and InflaterInputFormat. Scott, thanks for the clear explanation of rfc1950 vs rfc1951. Do you know what Java library implements 1950? Java's docs are confusing as ever here (at least to me).
          Hide
          Scott Carey added a comment -

          Do you know what Java library implements 1950? Java's docs are confusing as ever here (at least to me).

          The same Deflator class, it is a constructor flag – 'boolean nowrap'.
          http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/Deflater.html

          Specifying wrapped or not wrapped is the same process as compression level – construct the Deflator and pass it in to the DeflatorOutputStream constructor..

          Or, the Deflator class can be used 'raw' without an output stream on byte arrays. The source code for DeflatorOutputStream is instructive for this if it is necessary. Codecs could be defined to function on byte[] rather than input/output streams.

          If I recall correctly, I had to decipher manually that 'nowrap' was indeed just rfc1950 with the header/footer removed — a.k.a. rfc1951 – because the documentation doesn't say much. I was doing this for a servlet compression filter which has to know that some browser user-agents only accept ZLIB ('nowrap = false') when they claim to support "Content-Encoding: deflate", and others mean bare deflate ('nowrap = true'). HTTP spec doesn't specify. Avro won't make that mistake.

          Show
          Scott Carey added a comment - Do you know what Java library implements 1950? Java's docs are confusing as ever here (at least to me). The same Deflator class, it is a constructor flag – 'boolean nowrap'. http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/Deflater.html Specifying wrapped or not wrapped is the same process as compression level – construct the Deflator and pass it in to the DeflatorOutputStream constructor.. Or, the Deflator class can be used 'raw' without an output stream on byte arrays. The source code for DeflatorOutputStream is instructive for this if it is necessary. Codecs could be defined to function on byte[] rather than input/output streams. If I recall correctly, I had to decipher manually that 'nowrap' was indeed just rfc1950 with the header/footer removed — a.k.a. rfc1951 – because the documentation doesn't say much. I was doing this for a servlet compression filter which has to know that some browser user-agents only accept ZLIB ('nowrap = false') when they claim to support "Content-Encoding: deflate", and others mean bare deflate ('nowrap = true'). HTTP spec doesn't specify. Avro won't make that mistake.
          Hide
          Philip Zeyliger added a comment -

          Oy! So, the patch just uploaded is inconsistent: I thought I was doing 1951, whereas it's doing 1950 (nowrap is not set). As far as I can tell, python's zlib uses 1950 by default. Passing a negative argument to python's decompress magically does the right thing, but oy.

          [1]doorstop::~(29777)$python
          Python 2.5.1 (r251:54863, Feb  6 2009, 19:02:12) 
          [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
          Type "help", "copyright", "credits" or "license" for more information.
          >>> import zlib
          >>> zlib.compress("foo")
          'x\x9cK\xcb\xcf\x07\x00\x02\x82\x01E'
          >>> a = zlib.compress("foo")
          >>> b = a[2:-4]
          >>> zlib.decompress(a)
          'foo'
          >>> zlib.decompress(b, -15)
          'foo'
          >>> zlib.decompress(b)
          Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
          zlib.error: Error -3 while decompressing data: incorrect header check
          

          I'll upload a new patch which does set nowrap.

          Show
          Philip Zeyliger added a comment - Oy! So, the patch just uploaded is inconsistent: I thought I was doing 1951, whereas it's doing 1950 (nowrap is not set). As far as I can tell, python's zlib uses 1950 by default. Passing a negative argument to python's decompress magically does the right thing, but oy. [1]doorstop::~(29777)$python Python 2.5.1 (r251:54863, Feb 6 2009, 19:02:12) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import zlib >>> zlib.compress("foo") 'x\x9cK\xcb\xcf\x07\x00\x02\x82\x01E' >>> a = zlib.compress("foo") >>> b = a[2:-4] >>> zlib.decompress(a) 'foo' >>> zlib.decompress(b, -15) 'foo' >>> zlib.decompress(b) Traceback (most recent call last): File "<stdin>", line 1, in <module> zlib.error: Error -3 while decompressing data: incorrect header check I'll upload a new patch which does set nowrap.
          Hide
          Philip Zeyliger added a comment -

          RFC-1951 for reals, we hope.

          I haven't had to do anything about the following javadoc note:

          ' Note: When using the 'nowrap' option it is also necessary to provide an extra "dummy" byte as input. This is required by the ZLIB native library in order to support certain optimizations. '
          
          Show
          Philip Zeyliger added a comment - RFC-1951 for reals, we hope. I haven't had to do anything about the following javadoc note: ' Note: When using the 'nowrap' option it is also necessary to provide an extra "dummy" byte as input. This is required by the ZLIB native library in order to support certain optimizations. '
          Hide
          Doug Cutting added a comment -
          • can we change the terminology in the spec from "supported codecs" to "required codecs"? I think what we want to say is that using these names has defined semantics. Other strings are permitted, but we won't make any guarantees about what happens then.
          • why isn't Codec public?
          • write/uncompress are an odd pair of method names. why not compress/expand or somesuch?
          • i don't see the point of adviseBufferSize(). why not just initialize compressionBuffer to new ByteArrayOutputStream(buffer.size()) the first time. ByteArrayOutputSTream will double it if it needs, but it probably won't, since the output should be substantially smaller than the input.
          • can we rename CodecOption to Codecs or CodecFactory?
          • can we add a static method to define a new Codec?
          Show
          Doug Cutting added a comment - can we change the terminology in the spec from "supported codecs" to "required codecs"? I think what we want to say is that using these names has defined semantics. Other strings are permitted, but we won't make any guarantees about what happens then. why isn't Codec public? write/uncompress are an odd pair of method names. why not compress/expand or somesuch? i don't see the point of adviseBufferSize(). why not just initialize compressionBuffer to new ByteArrayOutputStream(buffer.size()) the first time. ByteArrayOutputSTream will double it if it needs, but it probably won't, since the output should be substantially smaller than the input. can we rename CodecOption to Codecs or CodecFactory? can we add a static method to define a new Codec?
          Hide
          Philip Zeyliger added a comment -

          can we change the terminology in the spec from "supported codecs" to "required codecs"? I think what we want to say is that using these names has defined semantics. Other strings are permitted, but we won't make any guarantees about what happens then.

          Done.

          why isn't Codec public?

          I'm hesitant to make that API public. It's fine for stream-based compression, but I suspect that more awesome compression will come along that will make that API not be reasonable. I'd rather cross the API bridge when someone requests it, but I could be convinced.

          write/uncompress are an odd pair of method names. why not compress/expand or somesuch?

          compress/decompress it is.

          i don't see the point of adviseBufferSize(). why not just initialize compressionBuffer to new ByteArrayOutputStream(buffer.size()) the first time. ByteArrayOutputSTream will double it if it needs, but it probably won't, since the output should be substantially smaller than the input.

          Fair enough; done.

          can we rename CodecOption to Codecs or CodecFactory?

          CodecFactory it is.

          can we add a static method to define a new Codec?

          Done. Though because Codec isn't public, it's not of much use.

          Show
          Philip Zeyliger added a comment - can we change the terminology in the spec from "supported codecs" to "required codecs"? I think what we want to say is that using these names has defined semantics. Other strings are permitted, but we won't make any guarantees about what happens then. Done. why isn't Codec public? I'm hesitant to make that API public. It's fine for stream-based compression, but I suspect that more awesome compression will come along that will make that API not be reasonable. I'd rather cross the API bridge when someone requests it, but I could be convinced. write/uncompress are an odd pair of method names. why not compress/expand or somesuch? compress/decompress it is. i don't see the point of adviseBufferSize(). why not just initialize compressionBuffer to new ByteArrayOutputStream(buffer.size()) the first time. ByteArrayOutputSTream will double it if it needs, but it probably won't, since the output should be substantially smaller than the input. Fair enough; done. can we rename CodecOption to Codecs or CodecFactory? CodecFactory it is. can we add a static method to define a new Codec? Done. Though because Codec isn't public, it's not of much use.
          Hide
          Scott Carey added a comment -

          I'm hesitant to make that API public. It's fine for stream-based compression, but I suspect that more awesome compression will come along that will make that API not be reasonable. I'd rather cross the API bridge when someone requests it, but I could be convinced.

          I agree with the caution here for now. It would not surprise me if we came to prefer a different API than Streams later. Streams are convenient but constraining. I think the first step is to have a basic, common form of compression that all languages should have access to. We can worry about the API flexibility and customization points after we have more codec examples to draw from in formulating the common requirements.

          Show
          Scott Carey added a comment - I'm hesitant to make that API public. It's fine for stream-based compression, but I suspect that more awesome compression will come along that will make that API not be reasonable. I'd rather cross the API bridge when someone requests it, but I could be convinced. I agree with the caution here for now. It would not surprise me if we came to prefer a different API than Streams later. Streams are convenient but constraining. I think the first step is to have a basic, common form of compression that all languages should have access to. We can worry about the API flexibility and customization points after we have more codec examples to draw from in formulating the common requirements.
          Hide
          Doug Cutting added a comment -

          +1 this looks good.

          • Does LengthLimitedInputStream need to be public? If not, let's keep it package private.
          • RandomData#main() should accept a codec argument, and the generate-interop-data build target can be altered to write a file using deflate too. This could be done as a separate issue if you like.
          Show
          Doug Cutting added a comment - +1 this looks good. Does LengthLimitedInputStream need to be public? If not, let's keep it package private. RandomData#main() should accept a codec argument, and the generate-interop-data build target can be altered to write a file using deflate too. This could be done as a separate issue if you like.
          Hide
          Philip Zeyliger added a comment -

          I made LengthLimitedInputStream package-private, filed AVRO-365 for the RandomData issue, and committed this. Thanks for the discussion and reviews, everyone!

          Show
          Philip Zeyliger added a comment - I made LengthLimitedInputStream package-private, filed AVRO-365 for the RandomData issue, and committed this. Thanks for the discussion and reviews, everyone!

            People

            • Assignee:
              Philip Zeyliger
              Reporter:
              Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development