Description
In getSplits() method in PigInputFormat, Pig is trying to set the working directory of input File System to jobContext.getWorkingDirectory(), which is always the default working directory of default file system (eg. hdfs://host:port/user/userId in case of HDFS) unless “mapreduce.job.working.dir” is explicitly set to non-default value. So if the input path uses non-default file system, then it will fail since it is trying to set the working directory of non-default file system to a HDFS path.
The proposed change is to completely remove this logic of setting working directory. There are several reasons for doing so.
Firstly, getSplits() is only supposed to return a list of input splits. It should not have side effects (especially doing so can potentially change the output path). Having InputFormat changes OutputFormat does not make much sense here.
Secondly, there is inconsistency between the working directories of input and output file systems. if "mapreduce.job.working.dir" is set to non-default value, it will affect the output path only (if it is a relative path) because input path will be made qualified even before this logic.
Thirdly, there is already a "CD" functionality that allows customers to change the working directory. However, this logic will overwrite the "CD" functionality if input and output paths both use default file system.
Lastly, if customer has a sequence of jobs, changing the working directory may change the input paths of downstream jobs if the input paths are specified as relative