Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-5360

Pig sets working directory of input file systems causes exception thrown

    Details

    • Type: Bug
    • Status: Patch Available
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.17.0
    • Fix Version/s: 0.18.0
    • Component/s: impl
    • Labels:
    • Flags:
      Patch

      Description

      In getSplits() method in PigInputFormat, Pig is trying to set the working directory of input File System to jobContext.getWorkingDirectory(), which is always the default working directory of default file system (eg. hdfs://host:port/user/userId in case of HDFS) unless “mapreduce.job.working.dir” is explicitly set to non-default value. So if the input path uses non-default file system, then it will fail since it is trying to set the working directory of non-default file system to a HDFS path.

      The proposed change is to completely remove this logic of setting working directory. There are several reasons for doing so.

      Firstly, getSplits() is only supposed to return a list of input splits. It should not have side effects (especially doing so can potentially change the output path). Having InputFormat changes OutputFormat does not make much sense here.

      Secondly, there is inconsistency between the working directories of input and output file systems. if "mapreduce.job.working.dir" is set to non-default value, it will affect the output path only (if it is a relative path) because input path will be made qualified even before this logic.

      Thirdly, there is already a "CD" functionality that allows customers to change the working directory. However, this logic will overwrite the "CD" functionality if input and output paths both use default file system.

      Lastly, if customer has a sequence of jobs, changing the working directory may change the input paths of downstream jobs if the input paths are specified as relative

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              xuzhoyin Xuzhou Yin
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 504h
                504h
                Remaining:
                Remaining Estimate - 504h
                504h
                Logged:
                Time Spent - Not Specified
                Not Specified