Uploaded image for project: 'Crunch'
  1. Crunch
  2. CRUNCH-677

Support passing FileSystem to File Sources and Targets

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.0
    • Component/s: Core
    • Labels:
      None

      Description

      We'd like to pass a FileSystem instance to File Sources and Targets to fully qualify the Path.  Without the FileSystem, the Pipeline doesn't necessarily have enough information to understand the Path.  In particular, when the Path is an HA HDFS path like "hdfs://my-cluster/data", the Pipeline might not have the configuration to resolve "hdfs://my-cluster".

      While it is in some cases possible to seed the Pipeline configuration with all the HDFS properties necessary to communicate with any HDFS HA cluster the Pipeline might talk to, it can be awkward and/or difficult to do this in all cases.  We have cases where we'd like not to have to know all of the clusters upfront.

      With the proposed change, code like the following is possible, where readFileSystem and writeFileSystem are external FileSystems synthesized from Configuration completely separate from that used to construct the Pipeline itself:

      Configuration emptyConfiguration = new Configuration(false);
      Pipeline pipeline = new MRPipeline(getClass(), emptyConfiguration);
      
      FileSystem readFileSystem = ...;
      PCollection<String> data = pipeline.read(From.textFile("hdfs://my-cluster-1/data", readFileSystem));
      
      FileSystem writeFileSystem = ...;
      pipeline.write(data, To.textFile("hdfs://my-cluster-2/output", writeFileSystem));
      

      Note: the hdfs://my-cluster-1 and hdfs://my-cluster-2 parts of the paths would not strictly need to be included as they would be implied by the FileSystem instances passed in the calls. As such the paths could simply be passed as "/data" and "/output" with equivalent behavior.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jwills Josh Wills
                Reporter:
                ben.roling Ben Roling
              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m