Avro
  1. Avro
  2. AVRO-386

Python implementation of compression

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.0
    • Fix Version/s: 1.3.0
    • Component/s: python
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change

      Description

      AVRO-135 introduced compression. AVRO-368 changed some metadata keys. The patch to follow adds this functionality to the python implementation.

      1. AVRO-386.patch.txt
        14 kB
        Philip Zeyliger
      2. AVRO-386.patch.txt
        14 kB
        Philip Zeyliger

        Activity

        Hide
        Philip Zeyliger added a comment -

        Attaching patch. Note that this implementation is using the AVRO-380 change as well.

        I did a minor amount of manual compatibility checking like so:

        gseq 300 | sed -e 's/^/"/' -e 's/$/"/' | java -jar ../java/build/avro-tools-1.2.0-dev.jar fromjson '"string"' -  --codec "deflate" > /tmp/z
        PYTHONPATH=src:lib python src/avro/tool.py  dump - < /tmp/z
        

        Which is to say I wrote a data file with 300 string entries and made sure they came out ok the other end. I ended up adding a "tool.py" to drive this command-line utility; it's quite minimal.

        Jeff/Sharad: this was my first time with the python API; nice job... wasn't hard to figure out at all.

        Show
        Philip Zeyliger added a comment - Attaching patch. Note that this implementation is using the AVRO-380 change as well. I did a minor amount of manual compatibility checking like so: gseq 300 | sed -e 's/^/"/' -e 's/$/"/' | java -jar ../java/build/avro-tools-1.2.0-dev.jar fromjson '"string"' - --codec "deflate" > /tmp/z PYTHONPATH=src:lib python src/avro/tool.py dump - < /tmp/z Which is to say I wrote a data file with 300 string entries and made sure they came out ok the other end. I ended up adding a "tool.py" to drive this command-line utility; it's quite minimal. Jeff/Sharad: this was my first time with the python API; nice job... wasn't hard to figure out at all.
        Hide
        Philip Zeyliger added a comment -

        Jeff, when JIRA was acting up, suggested the following feedback:

        -----------

        Some comments:

        • Remove the TODO's that you implemented
        • In line 143, you set uncompressed_data = self.buffer_writer.getvalue(); use this variable in line 149 instead of calling self.buffer_writer.getvalue() again?
        • Extra newline added after self.meta = {} in DataFileWriter#init_: remove
        • Will "assert" raise a DataFileException? I haven't tried it.
        • Default "codec" to "'null'" instead of "None" in DFW#_init_?
        • What are you up to in line 213 with self.codec = self.codec?
        • Line 263 is wider than 80 characters
        • Line 281: if you're skipping a long, why not use "skip_long()"? Not a big deal.
        • Line 285: comment on what the argument of "-15" means to zlib.decompress

        Thanks for tackling this one, Philip!

        Show
        Philip Zeyliger added a comment - Jeff, when JIRA was acting up, suggested the following feedback: ----------- Some comments: Remove the TODO's that you implemented In line 143, you set uncompressed_data = self.buffer_writer.getvalue() ; use this variable in line 149 instead of calling self.buffer_writer.getvalue() again? Extra newline added after self. meta = { } in DataFileWriter# init _: remove Will "assert" raise a DataFileException? I haven't tried it. Default "codec" to "'null'" instead of "None" in DFW#_ init _? What are you up to in line 213 with self.codec = self.codec ? Line 263 is wider than 80 characters Line 281: if you're skipping a long, why not use "skip_long()"? Not a big deal. Line 285: comment on what the argument of "-15" means to zlib.decompress Thanks for tackling this one, Philip!
        Hide
        Philip Zeyliger added a comment -

        Thanks for the review. Uploading new patch. If you want to see just what's changed since the last one, http://github.com/philz/avro/commit/3832584eebe545dfe7431ae8cc7bfa85e58ec430 .

        Remove the TODO's that you implemented

        Removed one.

        In line 143, you set uncompressed_data = self.buffer_writer.getvalue(); use this variable in line 149 instead of calling self.buffer_writer.getvalue() again?

        Done.

        Extra newline added after self.meta = {} in DataFileWriter#init_: remove

        Done.

        Will "assert" raise a DataFileException? I haven't tried it.

        It would have thrown an AssertionError (or something like that). Changed to use DataFileException.

        Default "codec" to "'null'" instead of "None" in DFW#init?

        I'm using "null" on purpose; that's the name of the "null codec" in AVRO.

        What are you up to in line 213 with self.codec = self.codec?

        No idea; removed.

        Line 263 is wider than 80 characters

        Done.

        Line 281: if you're skipping a long, why not use "skip_long()"? Not a big deal.

        Java doesn't have skip_long because it's the same performance. Switching to skip; it's better.

        Show
        Philip Zeyliger added a comment - Thanks for the review. Uploading new patch. If you want to see just what's changed since the last one, http://github.com/philz/avro/commit/3832584eebe545dfe7431ae8cc7bfa85e58ec430 . Remove the TODO's that you implemented Removed one. In line 143, you set uncompressed_data = self.buffer_writer.getvalue(); use this variable in line 149 instead of calling self.buffer_writer.getvalue() again? Done. Extra newline added after self.meta = {} in DataFileWriter#init_: remove Done. Will "assert" raise a DataFileException? I haven't tried it. It would have thrown an AssertionError (or something like that). Changed to use DataFileException. Default "codec" to "'null'" instead of "None" in DFW# init ? I'm using "null" on purpose; that's the name of the "null codec" in AVRO. What are you up to in line 213 with self.codec = self.codec? No idea; removed. Line 263 is wider than 80 characters Done. Line 281: if you're skipping a long, why not use "skip_long()"? Not a big deal. Java doesn't have skip_long because it's the same performance. Switching to skip; it's better.
        Hide
        Jeff Hammerbacher added a comment -

        Looks great, thanks!

        One last comment:

        Default "codec" to "'null'" instead of "None" in DFW#init?

        What I meant here: you assign "None" to the codec parameter as its default value, but in the body of the function, the first thing you do with codec is check to see if it's "None", and if so, set it to "'null'". It's not a big deal, but I can see an argument for just using "'null'" as the default value and skipping the check to see if it's "None". You lose the ability to check if the parameter is unset, but you aren't using that ability right now. To be honest, more of a question than a recommendation.

        Otherwise, send it on in!

        Show
        Jeff Hammerbacher added a comment - Looks great, thanks! One last comment: Default "codec" to "'null'" instead of "None" in DFW#init? What I meant here: you assign "None" to the codec parameter as its default value, but in the body of the function, the first thing you do with codec is check to see if it's "None", and if so, set it to "'null'". It's not a big deal, but I can see an argument for just using "'null'" as the default value and skipping the check to see if it's "None". You lose the ability to check if the parameter is unset, but you aren't using that ability right now. To be honest, more of a question than a recommendation. Otherwise, send it on in!
        Hide
        Philip Zeyliger added a comment -

        You're totally right about the default value. I've been burned by leaving mutable things as default arguments in python (because they bind once), so I tend to use None, but strings are immutable in python, so, voila. I've fixed it.

        I neglected to answer:

        Line 285: comment on what the argument of "-15" means to zlib.decompress

        It turns zlib into "raw" mode (that's negative) and indicates the window size (that's the maximum, 15).

        (from zlib.h on my system)
        windowBits can also be -8..-15 for raw inflate. In this case, -windowBits
        determines the window size. inflate() will then process raw deflate data, not looking for a zlib or gzip header, not generating a check value, and not
        looking for any check values for comparison at the end of the stream. This
        is for use with other formats that use the deflate compressed data format such as zip. Those formats provide their own check values. If a custom
        format is developed using the raw deflate format for compressed data, it is
        recommended that a check value such as an adler32 or a crc32 be applied to the uncompressed data as is done in the zlib, gzip, and zip formats. For
        most applications, the zlib format should be used as is. Note that comments
        above on the use in deflateInit2() applies to the magnitude of windowBits.

        Show
        Philip Zeyliger added a comment - You're totally right about the default value. I've been burned by leaving mutable things as default arguments in python (because they bind once), so I tend to use None, but strings are immutable in python, so, voila. I've fixed it. I neglected to answer: Line 285: comment on what the argument of "-15" means to zlib.decompress It turns zlib into "raw" mode (that's negative) and indicates the window size (that's the maximum, 15). (from zlib.h on my system) windowBits can also be -8..-15 for raw inflate. In this case, -windowBits determines the window size. inflate() will then process raw deflate data, not looking for a zlib or gzip header, not generating a check value, and not looking for any check values for comparison at the end of the stream. This is for use with other formats that use the deflate compressed data format such as zip. Those formats provide their own check values. If a custom format is developed using the raw deflate format for compressed data, it is recommended that a check value such as an adler32 or a crc32 be applied to the uncompressed data as is done in the zlib, gzip, and zip formats. For most applications, the zlib format should be used as is. Note that comments above on the use in deflateInit2() applies to the magnitude of windowBits.

          People

          • Assignee:
            Philip Zeyliger
            Reporter:
            Philip Zeyliger
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development