Avro
  1. Avro
  2. AVRO-567

add tools for text file import and export

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.0
    • Component/s: java
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      It would be good to have command-line tools to convert between newline-delimited text to Avro data files.

      1. AVRO-567.patch.txt
        10 kB
        Patrick Wendell
      2. AVRO-567.patch.txt.v2
        11 kB
        Patrick Wendell
      3. AVRO-567.patch.txt.v3
        11 kB
        Patrick Wendell
      4. AVRO-567.patch.v4.txt
        11 kB
        Patrick Wendell

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Patrick!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Patrick!
          Hide
          Patrick Wendell added a comment -

          Okay this plays nice with checkstyle. Also made compliant with your patch, I think!

          Show
          Patrick Wendell added a comment - Okay this plays nice with checkstyle. Also made compliant with your patch, I think!
          Hide
          Doug Cutting added a comment -

          I think my commit of AVRO-512 may also create conflicts with this patch, so please be sure to run 'svn up' before you re-test and re-submit. Thanks!

          Show
          Doug Cutting added a comment - I think my commit of AVRO-512 may also create conflicts with this patch, so please be sure to run 'svn up' before you re-test and re-submit. Thanks!
          Hide
          Doug Cutting added a comment -

          Checkstyle fails, complaining about tabs when I run 'cd lang/java; ant clean test'.

          Show
          Doug Cutting added a comment - Checkstyle fails, complaining about tabs when I run 'cd lang/java; ant clean test'.
          Hide
          Patrick Wendell added a comment -

          I agree, here are those changes...

          Show
          Patrick Wendell added a comment - I agree, here are those changes...
          Hide
          Doug Cutting added a comment -

          I think the default compression level should be 1: fast, but compressed.

          Also, where do we document that '-' means standard in or standard out? The Util class is package-private, so that doesn't count. Perhaps we should add it to the help string?

          Other than that, +1.

          Show
          Doug Cutting added a comment - I think the default compression level should be 1: fast, but compressed. Also, where do we document that '-' means standard in or standard out? The Util class is package-private, so that doesn't count. Perhaps we should add it to the help string? Other than that, +1.
          Hide
          Patrick Wendell added a comment -

          This update addresses those changes... thanks!

          Show
          Patrick Wendell added a comment - This update addresses those changes... thanks!
          Hide
          Doug Cutting added a comment -

          Looks good! A few nits:

          • I prefer naming these "totext" and "fromtext", or perhaps "tolines" and "fromlines", and classes named ToTextTool and FromTextTool. Avro's implicit.
          • System.getProperty("line.separator").getBytes() should be stored in a constant
          • inStream and outStream should be buffered for performance so that every call to read or write doesn't result in a system call. DataFileStream doesn't automatically add buffering. Hadoop streams are always buffered, and DataFileWriter add's buffering, but adding it redundantly shouldn't cause problems either.
          • compressionLevel should be optional, no? you can use withOptionalArg, e.g.
            OptionSpec<Integer> level = p.accepts("level", "compression level")
              .withOptionalArg().ofType(Integer.class);
            OptionSet opts = p.parse(...);
            if (ops.hasArgument(level))
              compressionLevel = level.value(opts);
          Show
          Doug Cutting added a comment - Looks good! A few nits: I prefer naming these "totext" and "fromtext", or perhaps "tolines" and "fromlines", and classes named ToTextTool and FromTextTool. Avro's implicit. System.getProperty("line.separator").getBytes() should be stored in a constant inStream and outStream should be buffered for performance so that every call to read or write doesn't result in a system call. DataFileStream doesn't automatically add buffering. Hadoop streams are always buffered, and DataFileWriter add's buffering, but adding it redundantly shouldn't cause problems either. compressionLevel should be optional, no? you can use withOptionalArg, e.g. OptionSpec< Integer > level = p.accepts( "level" , "compression level" ) .withOptionalArg().ofType( Integer .class); OptionSet opts = p.parse(...); if (ops.hasArgument(level)) compressionLevel = level.value(opts);
          Hide
          Patrick Wendell added a comment -

          This patch provides conversion from text files to avro data files and back. Supports HDFS, local files, and piping.

          Show
          Patrick Wendell added a comment - This patch provides conversion from text files to avro data files and back. Supports HDFS, local files, and piping.
          Hide
          Doug Cutting added a comment -

          Yes. Avro already uses the Hadoop 0.20 APIs elsewhere, which are intended to be stable for some time.

          Show
          Doug Cutting added a comment - Yes. Avro already uses the Hadoop 0.20 APIs elsewhere, which are intended to be stable for some time.
          Hide
          Philip Zeyliger added a comment -

          Would the HDFS integration introduce a dependency on Hadoop Common's FileSystem?

          Show
          Philip Zeyliger added a comment - Would the HDFS integration introduce a dependency on Hadoop Common's FileSystem?
          Hide
          Doug Cutting added a comment -

          For this issue, I'm imagining a java tool. A similar C tool would also be tremendously useful.

          Some potential details:

          • The Avro schema used would simply be "bytes".
          • Compression would be enabled by default.
          • Input and output could be from files named on the command line or standard in and out.
          • Hadoop URIs should be accepted as input and output, so that one can use this to, e.g., pipe output to a compressed, splittable file in HDFS.
          Show
          Doug Cutting added a comment - For this issue, I'm imagining a java tool. A similar C tool would also be tremendously useful. Some potential details: The Avro schema used would simply be "bytes". Compression would be enabled by default. Input and output could be from files named on the command line or standard in and out. Hadoop URIs should be accepted as input and output, so that one can use this to, e.g., pipe output to a compressed, splittable file in HDFS.

            People

            • Assignee:
              Patrick Wendell
              Reporter:
              Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development