Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Many pig scripts are executed as a series of MR jobs. When submitting each MR job to the cluster, pig makes a single fat job.jar that contains pig itself, along with all registered jars (UDFs and their dependencies).
This becomes problematic when pig is upgraded during a long-running job. For example, going from pig-0.9.2+1.jar to pig-0.9.2+2.jar. When creating the next job.jar pig will fail because the expected pig jar is no longer available.
A common case where this happens is deploying a new pig RPM.
Pig should handle the case where its jar is removed while executing a script.
DISCUSSED OPTIONS THAT SEEM PROBLEMATIC:
- Creating a single pig.jar symlink that points at the installed pig version could cause MR jobs to use different pig versions during the same script. This could lead to very difficult to debug issues, and potential correctness issues.
- Extracting pig.jar once for the whole job could be problematic if /tmp is used and something like tmpwatch runs.
POSSIBLE SOLUTION:
Pig could put pig.jar in the distributed cache once at reuse that jar on HDFS for all launched jobs.
WHY ARE YOU DELETING PIG.JAR DURING THE JOB!?!?
Allowing RPM upgrades mid-pig-job means the machine does not need to be drained for maintenance, reducing the impact of upgrades. Having just one pig version installed simplifies packaging and for users to choose the right version. Overall it just keeps things simple, which is a feature itself.