Uploaded image for project: 'Avro'
  1. Avro
  2. AVRO-567

add tools for text file import and export

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.0
    • Component/s: java
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      It would be good to have command-line tools to convert between newline-delimited text to Avro data files.

      1. AVRO-567.patch.txt
        10 kB
        Patrick Wendell
      2. AVRO-567.patch.txt.v2
        11 kB
        Patrick Wendell
      3. AVRO-567.patch.txt.v3
        11 kB
        Patrick Wendell
      4. AVRO-567.patch.v4.txt
        11 kB
        Patrick Wendell

        Issue Links

          Activity

          Hide
          cutting Doug Cutting added a comment -

          For this issue, I'm imagining a java tool. A similar C tool would also be tremendously useful.

          Some potential details:

          • The Avro schema used would simply be "bytes".
          • Compression would be enabled by default.
          • Input and output could be from files named on the command line or standard in and out.
          • Hadoop URIs should be accepted as input and output, so that one can use this to, e.g., pipe output to a compressed, splittable file in HDFS.
          Show
          cutting Doug Cutting added a comment - For this issue, I'm imagining a java tool. A similar C tool would also be tremendously useful. Some potential details: The Avro schema used would simply be "bytes". Compression would be enabled by default. Input and output could be from files named on the command line or standard in and out. Hadoop URIs should be accepted as input and output, so that one can use this to, e.g., pipe output to a compressed, splittable file in HDFS.
          Hide
          philip Philip Zeyliger added a comment -

          Would the HDFS integration introduce a dependency on Hadoop Common's FileSystem?

          Show
          philip Philip Zeyliger added a comment - Would the HDFS integration introduce a dependency on Hadoop Common's FileSystem?
          Hide
          cutting Doug Cutting added a comment -

          Yes. Avro already uses the Hadoop 0.20 APIs elsewhere, which are intended to be stable for some time.

          Show
          cutting Doug Cutting added a comment - Yes. Avro already uses the Hadoop 0.20 APIs elsewhere, which are intended to be stable for some time.
          Hide
          pwendell Patrick Wendell added a comment -

          This patch provides conversion from text files to avro data files and back. Supports HDFS, local files, and piping.

          Show
          pwendell Patrick Wendell added a comment - This patch provides conversion from text files to avro data files and back. Supports HDFS, local files, and piping.
          Hide
          cutting Doug Cutting added a comment -

          Looks good! A few nits:

          • I prefer naming these "totext" and "fromtext", or perhaps "tolines" and "fromlines", and classes named ToTextTool and FromTextTool. Avro's implicit.
          • System.getProperty("line.separator").getBytes() should be stored in a constant
          • inStream and outStream should be buffered for performance so that every call to read or write doesn't result in a system call. DataFileStream doesn't automatically add buffering. Hadoop streams are always buffered, and DataFileWriter add's buffering, but adding it redundantly shouldn't cause problems either.
          • compressionLevel should be optional, no? you can use withOptionalArg, e.g.
            OptionSpec<Integer> level = p.accepts("level", "compression level")
              .withOptionalArg().ofType(Integer.class);
            OptionSet opts = p.parse(...);
            if (ops.hasArgument(level))
              compressionLevel = level.value(opts);
          Show
          cutting Doug Cutting added a comment - Looks good! A few nits: I prefer naming these "totext" and "fromtext", or perhaps "tolines" and "fromlines", and classes named ToTextTool and FromTextTool. Avro's implicit. System.getProperty("line.separator").getBytes() should be stored in a constant inStream and outStream should be buffered for performance so that every call to read or write doesn't result in a system call. DataFileStream doesn't automatically add buffering. Hadoop streams are always buffered, and DataFileWriter add's buffering, but adding it redundantly shouldn't cause problems either. compressionLevel should be optional, no? you can use withOptionalArg, e.g. OptionSpec< Integer > level = p.accepts( "level" , "compression level" ) .withOptionalArg().ofType( Integer .class); OptionSet opts = p.parse(...); if (ops.hasArgument(level)) compressionLevel = level.value(opts);
          Hide
          pwendell Patrick Wendell added a comment -

          This update addresses those changes... thanks!

          Show
          pwendell Patrick Wendell added a comment - This update addresses those changes... thanks!
          Hide
          cutting Doug Cutting added a comment -

          I think the default compression level should be 1: fast, but compressed.

          Also, where do we document that '-' means standard in or standard out? The Util class is package-private, so that doesn't count. Perhaps we should add it to the help string?

          Other than that, +1.

          Show
          cutting Doug Cutting added a comment - I think the default compression level should be 1: fast, but compressed. Also, where do we document that '-' means standard in or standard out? The Util class is package-private, so that doesn't count. Perhaps we should add it to the help string? Other than that, +1.
          Hide
          pwendell Patrick Wendell added a comment -

          I agree, here are those changes...

          Show
          pwendell Patrick Wendell added a comment - I agree, here are those changes...
          Hide
          cutting Doug Cutting added a comment -

          Checkstyle fails, complaining about tabs when I run 'cd lang/java; ant clean test'.

          Show
          cutting Doug Cutting added a comment - Checkstyle fails, complaining about tabs when I run 'cd lang/java; ant clean test'.
          Hide
          cutting Doug Cutting added a comment -

          I think my commit of AVRO-512 may also create conflicts with this patch, so please be sure to run 'svn up' before you re-test and re-submit. Thanks!

          Show
          cutting Doug Cutting added a comment - I think my commit of AVRO-512 may also create conflicts with this patch, so please be sure to run 'svn up' before you re-test and re-submit. Thanks!
          Hide
          pwendell Patrick Wendell added a comment -

          Okay this plays nice with checkstyle. Also made compliant with your patch, I think!

          Show
          pwendell Patrick Wendell added a comment - Okay this plays nice with checkstyle. Also made compliant with your patch, I think!
          Hide
          cutting Doug Cutting added a comment -

          I just committed this. Thanks, Patrick!

          Show
          cutting Doug Cutting added a comment - I just committed this. Thanks, Patrick!

            People

            • Assignee:
              pwendell Patrick Wendell
              Reporter:
              cutting Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development