Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.2
    • Component/s: c
    • Labels:

      Description

      We (Experian Hitwise) have added codec support to avro-c (as per the avro spec), with preliminary deflate/inflate support (as well as null codec support).

      This changes the way blocks are written to the file, with the block data being passed through the codec interface, before being written to file.

      This also changes the way blocks are read from a file, with the block data first being read into a buffer, before being passed through the codec interface, then a memory reader is set to the decoded data, which is read by the user calls avro_file_reader_read_value() and avro_file_reader_read()

      Please feel free to make changes, as although I did try to emulate the coding style of the rest of the avro c library, there may be things I've done "differently"

      Code is available from Github: https://github.com/hitwise/avro (branch: codec)

      1. AVRO-957.patch
        101 kB
        Michael Cooper
      2. 0001-Use-builtin-CMake-logic-to-find-zlib-and-lzma.patch
        3 kB
        Douglas Creager

        Issue Links

          Activity

          Hide
          Michael Cooper added a comment -

          I have done some further work on the codecs:

          • Added LZMA codec
          • Removed debug code that was causing compiler warnings.
          • Codecs are now not mandatory and only get compiled in if the system has the library.
          • Updated quickstop example to write two blocks to cover encoding/decoding multiple blocks.

          The work was done in the codec-lzma branch on the hitwise/avro fork. https://github.com/hitwise/avro/tree/codec-lzma

          Show
          Michael Cooper added a comment - I have done some further work on the codecs: Added LZMA codec Removed debug code that was causing compiler warnings. Codecs are now not mandatory and only get compiled in if the system has the library. Updated quickstop example to write two blocks to cover encoding/decoding multiple blocks. The work was done in the codec-lzma branch on the hitwise/avro fork. https://github.com/hitwise/avro/tree/codec-lzma
          Hide
          Michael Cooper added a comment -

          More work:

          • Decoding of blocks now has no hardcoded buffer size. It now grows the buffer until it can decode the whole block, then keeps reusing that block size.
          • Fixed errors in the cmake configuration (the cache caught me out).
          Show
          Michael Cooper added a comment - More work: Decoding of blocks now has no hardcoded buffer size. It now grows the buffer until it can decode the whole block, then keeps reusing that block size. Fixed errors in the cmake configuration (the cache caught me out).
          Hide
          Michael Cooper added a comment -

          I just added a way to set the block size (datum_buffer size). Its a new parameter on the end of the avro_file_writer_create_with_codec function.
          The default is still 16KB.

          I also fixed some error messages in quickstop to include a newline in some of the error messages.

          The branch has now been moved to a branch called "codec".
          https://github.com/hitwise/avro/tree/codec

          Show
          Michael Cooper added a comment - I just added a way to set the block size (datum_buffer size). Its a new parameter on the end of the avro_file_writer_create_with_codec function. The default is still 16KB. I also fixed some error messages in quickstop to include a newline in some of the error messages. The branch has now been moved to a branch called "codec". https://github.com/hitwise/avro/tree/codec
          Hide
          Douglas Creager added a comment -

          I've had a chance to look through this, and I think it's very good. I only have a couple of questions/comments, and then I can commit this.

          First, there are a couple of places where it looks like you're calling avro_realloc once for each block in a file. (In codec.c:encode_deflate and datafile.c:file_read_block_count.) You should first check that the buffer isn't already big enough to hold the requested space. (avro_realloc doesn't do this for you; it does a realloc each time you call it.) That will save some memory allocation overhead when reading from huge files.

          Second, in codec.c:avro_codec, can you add an avro_set_error call in the switch's default block? Something like "Unknown codec %s".

          Third, CMake actually has some built-in logic for finding zlib, and we can use pkg-config to find liblzma. I've created a patch that does this, which I'll attach after I finish typing in this comment.

          Lastly, there are some nit-picky whitespace inconsistencies; since you're using git, you can use the --color option to "git diff" or "git show" to see some of them. (It'll show trailing whitespace in super-bright red.) And there are some places where things aren't lining up when you've got multiple lines inside of some parentheses; make sure you're using real tabs, and that your tab stops are at 8 spaces, when you line things up.

          Other than that, this looks great. As we mentioned on the dev mailing list, we'll need to have an actual patch file uploaded to this ticket, marked that you grant permission to the ASF to include it in the codebase, before we can merge the changes into SVN. (We can't use the git branch or github Pull Request directly — the patches on this JIRA ticket are how ASF keeps track of the copyright assignment for legal purposes.) It looks like the best way to do this will be to change your github Pull Request to use the codec branch. Then you can use the https://github.com/apache/avro/pull/5.patch link to obtain a patch file that you can upload.

          Show
          Douglas Creager added a comment - I've had a chance to look through this, and I think it's very good. I only have a couple of questions/comments, and then I can commit this. First, there are a couple of places where it looks like you're calling avro_realloc once for each block in a file. (In codec.c:encode_deflate and datafile.c:file_read_block_count.) You should first check that the buffer isn't already big enough to hold the requested space. (avro_realloc doesn't do this for you; it does a realloc each time you call it.) That will save some memory allocation overhead when reading from huge files. Second, in codec.c:avro_codec, can you add an avro_set_error call in the switch's default block? Something like "Unknown codec %s". Third, CMake actually has some built-in logic for finding zlib, and we can use pkg-config to find liblzma. I've created a patch that does this, which I'll attach after I finish typing in this comment. Lastly, there are some nit-picky whitespace inconsistencies; since you're using git, you can use the --color option to "git diff" or "git show" to see some of them. (It'll show trailing whitespace in super-bright red.) And there are some places where things aren't lining up when you've got multiple lines inside of some parentheses; make sure you're using real tabs, and that your tab stops are at 8 spaces, when you line things up. Other than that, this looks great. As we mentioned on the dev mailing list, we'll need to have an actual patch file uploaded to this ticket, marked that you grant permission to the ASF to include it in the codebase, before we can merge the changes into SVN. (We can't use the git branch or github Pull Request directly — the patches on this JIRA ticket are how ASF keeps track of the copyright assignment for legal purposes.) It looks like the best way to do this will be to change your github Pull Request to use the codec branch. Then you can use the https://github.com/apache/avro/pull/5.patch link to obtain a patch file that you can upload.
          Hide
          Douglas Creager added a comment -

          A patch on top of the latest codec branch that uses CMake's built-in support for finding zlib, and uses pkg-config to find lzma.

          Once we get the patch file for the codec branch uploaded and committed, I'll commit this on top of it before closing the ticket.

          Show
          Douglas Creager added a comment - A patch on top of the latest codec branch that uses CMake's built-in support for finding zlib, and uses pkg-config to find lzma. Once we get the patch file for the codec branch uploaded and committed, I'll commit this on top of it before closing the ticket.
          Hide
          Michael Cooper added a comment -

          Hi Douglas,

          I have just fixed up those things.

          • It now checks if the buffer length is already big enough before using realloc.
          • Improved error messages.
            • Set the error message when deflate breaks
            • Set the error message when codec is not known
            • Return 1(error) when avro_codec_* cannot determine the codec.
            • Show the error messages in quickstop for debugging.
          • The cmake changes you sent did not work right out of the box for me, the package names were different (case sensitive?).
            • zlib -> ZLIB
            • pkgconfig -> PkgConfig
          • The CODEC_LIB variable was renamed, but it was still used elsewhere. Turns out that its not needed now that you added target_link_libraries.
          • Fixed whitespace issues

          One thing I'm not sure about is in encode_deflate, where apparently deflate() resizes the buffer?

          I have closed and re-opened the pull request with the new branch.
          I will attach the patch shortly.

          Show
          Michael Cooper added a comment - Hi Douglas, I have just fixed up those things. It now checks if the buffer length is already big enough before using realloc. Improved error messages. Set the error message when deflate breaks Set the error message when codec is not known Return 1(error) when avro_codec_* cannot determine the codec. Show the error messages in quickstop for debugging. The cmake changes you sent did not work right out of the box for me, the package names were different (case sensitive?). zlib -> ZLIB pkgconfig -> PkgConfig The CODEC_LIB variable was renamed, but it was still used elsewhere. Turns out that its not needed now that you added target_link_libraries. Fixed whitespace issues One thing I'm not sure about is in encode_deflate, where apparently deflate() resizes the buffer? I have closed and re-opened the pull request with the new branch. I will attach the patch shortly.
          Hide
          Michael Cooper added a comment -

          Attaching patch file AVRO-957.patch:

          We (Hitwise Pty Ltd) hereby assign all rights to the code contained within this patch over to the Apache Software Foundation (ASF) under the Apache Licence.

          Show
          Michael Cooper added a comment - Attaching patch file AVRO-957 .patch: We (Hitwise Pty Ltd) hereby assign all rights to the code contained within this patch over to the Apache Software Foundation (ASF) under the Apache Licence.
          Hide
          Douglas Creager added a comment -

          Committed to SVN

          Show
          Douglas Creager added a comment - Committed to SVN

            People

            • Assignee:
              Unassigned
              Reporter:
              Lucas Martin-King
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development