Uploaded image for project: 'Avro'
  1. Avro
  2. AVRO-867

Allow tools to read files via hadoop FileSystem class

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.7.5
    • Fix Version/s: 1.7.5
    • Component/s: java
    • Labels:
      None
    • Release Note:
      avro-tools can now access Hadoop supported filesystem when started via hadoop jar.

      Description

      It would be great if I could use the various tools to read/parse files that are in HDFS, S3, etc via the FileSystem api. We could retain backwards compatibility by assuming that unqualified urls are "file://" but allow reading of files from fully qualified urls such as hdfs://. The required apis are already part of the avro-tools uber jar to support the TetherTool.

      1. addedHadoopFileSupport.diff
        18 kB
        Vincenz Priesnitz
      2. AVRO-867.patch
        15 kB
        Doug Cutting

        Activity

        Hide
        cutting Doug Cutting added a comment -

        I assume you're proposing to move something like Util#fileOrStdin and #fileOrStdin into another module? That sounds reasonable. These could probably go into the mapred module, since it already depends on HDFS.

        Show
        cutting Doug Cutting added a comment - I assume you're proposing to move something like Util#fileOrStdin and #fileOrStdin into another module? That sounds reasonable. These could probably go into the mapred module, since it already depends on HDFS.
        Hide
        joecrobak Joe Crobak added a comment -

        I assume you're proposing to move something like Util#fileOrStdin and #fileOrStdin into another module? That sounds reasonable. These could probably go into the mapred module, since it already depends on HDFS.

        Ah, I hadn't realized that Util#fileOrStdin does exactly this. In that case, this is more about updating all the tools to use #fileOrStdin if that makes sense (e.g. DataFileReader and DataFileGetSchema don't use it).

        Show
        joecrobak Joe Crobak added a comment - I assume you're proposing to move something like Util#fileOrStdin and #fileOrStdin into another module? That sounds reasonable. These could probably go into the mapred module, since it already depends on HDFS. Ah, I hadn't realized that Util#fileOrStdin does exactly this. In that case, this is more about updating all the tools to use #fileOrStdin if that makes sense (e.g. DataFileReader and DataFileGetSchema don't use it).
        Hide
        cutting Doug Cutting added a comment -

        If DataFileReader were to incorporate this, then the core Avro pom might depend on Hadoop. Some have complained about this before, since Hadoop depends on Avro, creating a circular dependency. (In practice this is not an issue as long as both provide some backwards compatibility. Avro can build against an older, published version of Hadoop and vice-versa.)

        Perhaps this could be implemented using reflection, e.g., something like:

        Class.forName("org.apache.hadoop.fs.FileSystem").getMethod("open").invoke(...)

        That way it'd work if Hadoop is on the classpath, but would not require a dependency on Hadoop.

        As a middle ground, Hadoop could be required for compilation but only used at runtime when an HDFS URI is passed in.

        Alternately, we might add a UriResolver interface and a base implementation that just works for local files. Then Avro's mapred module could add an implementation that supports HDFS too. The default factory might first look for an org.apache.avro.mapred.FileSystemResolver class, and, if that doesn't exist, use the base implementation.

        Show
        cutting Doug Cutting added a comment - If DataFileReader were to incorporate this, then the core Avro pom might depend on Hadoop. Some have complained about this before, since Hadoop depends on Avro, creating a circular dependency. (In practice this is not an issue as long as both provide some backwards compatibility. Avro can build against an older, published version of Hadoop and vice-versa.) Perhaps this could be implemented using reflection, e.g., something like: Class.forName("org.apache.hadoop.fs.FileSystem").getMethod("open").invoke(...) That way it'd work if Hadoop is on the classpath, but would not require a dependency on Hadoop. As a middle ground, Hadoop could be required for compilation but only used at runtime when an HDFS URI is passed in. Alternately, we might add a UriResolver interface and a base implementation that just works for local files. Then Avro's mapred module could add an implementation that supports HDFS too. The default factory might first look for an org.apache.avro.mapred.FileSystemResolver class, and, if that doesn't exist, use the base implementation.
        Hide
        joecrobak Joe Crobak added a comment -

        If DataFileReader were to incorporate this, then the core Avro pom might depend on Hadoop. Some have complained about this before, since Hadoop depends on Avro, creating a circular dependency. (In practice this is not an issue as long as both provide some backwards compatibility. Avro can build against an older, published version of Hadoop and vice-versa.)

        Sorry – when I mentioned DataFileReader I really meant DataFileReaderTool (same goes for DataFileGetSchemaTool). My thought was to modify DataFileReaderTool as follows...

        Rather than:

        GenericDatumReader<Object> reader = new GenericDatumReader<Object>();
        FileReader<Object> fileReader =
              DataFileReader.openReader(new File(args.get(0)), reader);
        ...
        for (Object datum : fileReader) {
          ...
        }
        

        use the DataFileStream like:

        GenericDatumReader<Object> reader = new GenericDatumReader<Object>();
        DataFileStream<Object> streamReader =
              new DataFileStream(Util.fileOrStdin(args.get(0)), reader);
        ...
        for (Object datum : streamReader) {
         ...
        }
        

        There are a few other Tools that could be simplified with the usage of fileOrStdin, too. How does this sound?

        Show
        joecrobak Joe Crobak added a comment - If DataFileReader were to incorporate this, then the core Avro pom might depend on Hadoop. Some have complained about this before, since Hadoop depends on Avro, creating a circular dependency. (In practice this is not an issue as long as both provide some backwards compatibility. Avro can build against an older, published version of Hadoop and vice-versa.) Sorry – when I mentioned DataFileReader I really meant DataFileReaderTool (same goes for DataFileGetSchemaTool). My thought was to modify DataFileReaderTool as follows... Rather than: GenericDatumReader< Object > reader = new GenericDatumReader< Object >(); FileReader< Object > fileReader = DataFileReader.openReader( new File(args.get(0)), reader); ... for ( Object datum : fileReader) { ... } use the DataFileStream like: GenericDatumReader< Object > reader = new GenericDatumReader< Object >(); DataFileStream< Object > streamReader = new DataFileStream(Util.fileOrStdin(args.get(0)), reader); ... for ( Object datum : streamReader) { ... } There are a few other Tools that could be simplified with the usage of fileOrStdin, too. How does this sound?
        Hide
        cutting Doug Cutting added a comment -

        That sounds great! +1

        Show
        cutting Doug Cutting added a comment - That sounds great! +1
        Hide
        vince83 Vincenz Priesnitz added a comment -

        Attached you find a patch that changes the Utils class to use the hadoop FileSystem class. It is now possible to use any supported filesystem for input or output files in more tools.

        Without any configurations, the tools behave as before:

        # reads from local file system by default
        # supports relative paths
        java -jar avro-tools-1.7.5.jar tojson ~/myDir/myData.avro
        

        If invoked via hadoop jar, the tools support more filesystems. Different filesystems can be used in a single call. Furthermore, any default filesystem that might be specified in core-site.xml is respected.

        # combines an ftp file and a local file and writes result file combinedData.avro directly on the default hdfs server.
        hadoop jar avro-tools-1.7.5.jar concat ftp://myFtpServer/data1.avro file:///home/user/data2.avro combinedData.avro
        

        Now it is possible to take a look at remote files quicker, e.g.:

        hadoop jar avro-Tools getschema Data_on_hdfs.avro
        hadoop jar avro-Tools tojson ftp://server-address/Data_on_ftp.avro 
        

        The following tools now use Utils for accessing files: concat, fragtojson, fromjson, fromtext, getmeta, getschema, jsontofrag, recodec, tojson, totext.

        Show
        vince83 Vincenz Priesnitz added a comment - Attached you find a patch that changes the Utils class to use the hadoop FileSystem class. It is now possible to use any supported filesystem for input or output files in more tools. Without any configurations, the tools behave as before: # reads from local file system by default # supports relative paths java -jar avro-tools-1.7.5.jar tojson ~/myDir/myData.avro If invoked via hadoop jar, the tools support more filesystems. Different filesystems can be used in a single call. Furthermore, any default filesystem that might be specified in core-site.xml is respected. # combines an ftp file and a local file and writes result file combinedData.avro directly on the default hdfs server. hadoop jar avro-tools-1.7.5.jar concat ftp://myFtpServer/data1.avro file:///home/user/data2.avro combinedData.avro Now it is possible to take a look at remote files quicker, e.g.: hadoop jar avro-Tools getschema Data_on_hdfs.avro hadoop jar avro-Tools tojson ftp://server-address/Data_on_ftp.avro The following tools now use Utils for accessing files: concat, fragtojson, fromjson, fromtext, getmeta, getschema, jsontofrag, recodec, tojson, totext.
        Hide
        cutting Doug Cutting added a comment -

        This looks like a good contribution!

        Here's a version of the patch with the following minor modifications:

        • diff is from root, rather than lang/java/tools
        • removed some spurious whitespace changes

        I'll commit this soon unless someone objects.

        Show
        cutting Doug Cutting added a comment - This looks like a good contribution! Here's a version of the patch with the following minor modifications: diff is from root, rather than lang/java/tools removed some spurious whitespace changes I'll commit this soon unless someone objects.
        Hide
        cutting Doug Cutting added a comment -

        I committed this. Thanks, Vincenz!

        Show
        cutting Doug Cutting added a comment - I committed this. Thanks, Vincenz!
        Hide
        hudson Hudson added a comment -

        Integrated in AvroJava #363 (See https://builds.apache.org/job/AvroJava/363/)
        AVRO-867. Java: Enable command-line tools to read data files from any Hadoop FileSystem implementation. Contributed by Vincenz Priesnitz. (Revision 1470691)

        Result = SUCCESS
        cutting :
        Files :

        • /avro/trunk/CHANGES.txt
        • /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/BinaryFragmentToJsonTool.java
        • /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/DataFileGetMetaTool.java
        • /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/DataFileGetSchemaTool.java
        • /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/DataFileReadTool.java
        • /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/DataFileWriteTool.java
        • /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/JsonToBinaryFragmentTool.java
        • /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/RecodecTool.java
        • /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/ToTextTool.java
        • /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/Util.java
        Show
        hudson Hudson added a comment - Integrated in AvroJava #363 (See https://builds.apache.org/job/AvroJava/363/ ) AVRO-867 . Java: Enable command-line tools to read data files from any Hadoop FileSystem implementation. Contributed by Vincenz Priesnitz. (Revision 1470691) Result = SUCCESS cutting : Files : /avro/trunk/CHANGES.txt /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/BinaryFragmentToJsonTool.java /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/DataFileGetMetaTool.java /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/DataFileGetSchemaTool.java /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/DataFileReadTool.java /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/DataFileWriteTool.java /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/JsonToBinaryFragmentTool.java /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/RecodecTool.java /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/ToTextTool.java /avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/Util.java

          People

          • Assignee:
            vince83 Vincenz Priesnitz
            Reporter:
            joecrobak Joe Crobak
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development