Avro
  1. Avro
  2. AVRO-684

Java tool for altering the codec of an Avro data file stream.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.0
    • Component/s: java
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      An example is worth a thousand words:

      cat infile.avro | avro-tools recodec deflate - - > outfile.avro

      The above example would create a new file, "outfile.avro", with the same contents as "infile.avro". However, the codec of "outfile.avro" would be "deflate", regardless of the codec of "infile.avro".

      Proposed features:

      • The tool should preserve any metadata present in the input file.
      • Supported codecs will be "deflate" and "null".
      • Optionally add support for specifying the deflation level, perhaps with syntax as follows: "deflate:N" where N is the deflation level, e.g. "deflate:4".

      Does this proposal sound reasonable?

      1. AVRO-684.patch
        11 kB
        Patrick Linehan
      2. AVRO-684.patch
        6 kB
        Patrick Linehan

        Activity

        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks, Patrick!

        Show
        Doug Cutting added a comment - I just committed this. Thanks, Patrick!
        Hide
        Patrick Linehan added a comment -

        Added a test and metadata preservation. Please let me know if the test is up to snuff.

        Show
        Patrick Linehan added a comment - Added a test and metadata preservation. Please let me know if the test is up to snuff.
        Hide
        Doug Cutting added a comment -

        This looks good to me.

        A test is required before this can be committed.

        Concatenation can and probably should be done as a separate follow-on issue.

        Metadata would be nice to have from the start, but could also be done as a separate issue later.

        Show
        Doug Cutting added a comment - This looks good to me. A test is required before this can be committed. Concatenation can and probably should be done as a separate follow-on issue. Metadata would be nice to have from the start, but could also be done as a separate issue later.
        Hide
        Patrick Linehan added a comment -

        I've finished a first draft. Still to be done:

        • Write the test.
        • Preserve file metadata.
        • Implement the concatenation described by Scott Carey.

        I'm assuming that for concatenation, the following would be considered reasonable behavior:

        • Only the metadata from the first input file is written to the output file.
        • The schema from the first input file becomes the schema of the output file. The remaining input file schemas only need to resolve with said schema, not be identical.

        Anyway, the first draft is here in case anyone gets the urge to finish it for me Otherwise I hope to finish it in the next few weeks.

        Show
        Patrick Linehan added a comment - I've finished a first draft. Still to be done: Write the test. Preserve file metadata. Implement the concatenation described by Scott Carey. I'm assuming that for concatenation, the following would be considered reasonable behavior: Only the metadata from the first input file is written to the output file. The schema from the first input file becomes the schema of the output file. The remaining input file schemas only need to resolve with said schema, not be identical. Anyway, the first draft is here in case anyone gets the urge to finish it for me Otherwise I hope to finish it in the next few weeks.
        Hide
        Scott Carey added a comment -

        Yes this would be useful.

        Most of the machinery for this is already in the DataFileWriter class. It is not exposed in a command-line tool though.

        I currently use this machinery to take a large list of small avro files and merge them into one larger avro file with a set compression type and level.

        In addition to the compression level, there is the concept of forcing a re-encode. By default, the current code will not re-encode unless required. Therefore, it won't re-encode deflate:1 to deflate:3 by default unless told to by passing in the flag to force it to re-encode. By default it will decode deflate to null or encode null to deflate. If a block is already compatible, it just copies the raw bytes of the block, which is very fast.

        This tool should also support concatenation of files and creation of one larger file from a collection of smaller ones (of the same schema) with the requested encoding. Maybe something like this:

        $ avro-tools append_to -f outfile.avro -c deflate:5 infile.avro [infile2.avro, . . .]
        

        Which would create outfile.avro with codec deflate:5 form multiple source files.

        Show
        Scott Carey added a comment - Yes this would be useful. Most of the machinery for this is already in the DataFileWriter class. It is not exposed in a command-line tool though. I currently use this machinery to take a large list of small avro files and merge them into one larger avro file with a set compression type and level. In addition to the compression level, there is the concept of forcing a re-encode. By default, the current code will not re-encode unless required. Therefore, it won't re-encode deflate:1 to deflate:3 by default unless told to by passing in the flag to force it to re-encode. By default it will decode deflate to null or encode null to deflate. If a block is already compatible, it just copies the raw bytes of the block, which is very fast. This tool should also support concatenation of files and creation of one larger file from a collection of smaller ones (of the same schema) with the requested encoding. Maybe something like this: $ avro-tools append_to -f outfile.avro -c deflate:5 infile.avro [infile2.avro, . . .] Which would create outfile.avro with codec deflate:5 form multiple source files.
        Hide
        Doug Cutting added a comment -

        The codec and deflation level might be specified with '-codec' and '-level', so that the command syntax might be:

        recodec [-codec codec] [-level level] [infile [outfile]]

        Show
        Doug Cutting added a comment - The codec and deflation level might be specified with '-codec' and '-level', so that the command syntax might be: recodec [-codec codec] [-level level] [infile [outfile] ]

          People

          • Assignee:
            Patrick Linehan
            Reporter:
            Patrick Linehan
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development