Pig
  1. Pig
  2. PIG-2239

Pig should use "bin/hadoop jar pig-withouthadoop.jar" in bin/pig instead of forming java command itself

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.1, 0.10.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      With this change it is now possible to run Pig with different versions of Hadoop just by setting HADOOP_HOME to point to the directory where you have installed Hadoop. By default (if you do not set HADOOP_HOME) Pig still runs with the embedded version (0.20.2 currently).
      Show
      With this change it is now possible to run Pig with different versions of Hadoop just by setting HADOOP_HOME to point to the directory where you have installed Hadoop. By default (if you do not set HADOOP_HOME) Pig still runs with the embedded version (0.20.2 currently).

      Description

      This will obliterate tons of classpath issues and hadoop versions, paths problem that has fraught bin/pig and Pig in general.

      1. PIG-2239-2.patch
        11 kB
        Daniel Dai
      2. PIG-2239-1-0.9.patch
        10 kB
        Daniel Dai
      3. PIG-2239-1.patch
        10 kB
        Daniel Dai
      4. PIG-2239-0.patch
        5 kB
        Daniel Dai

        Issue Links

          Activity

          Hide
          Daniel Dai added a comment -

          Let hadoop figure out its own dependencies should be the right way. Currently Pig is very cumbersome in finding hadoop jar and its dependencies. This issue will become more apparent once Pig support both hadoop 20 and 23. The only downside of this approach is we will rely on hadoop command line processor to process Pig command line, and possibly create backward compatibility issues.

          Show
          Daniel Dai added a comment - Let hadoop figure out its own dependencies should be the right way. Currently Pig is very cumbersome in finding hadoop jar and its dependencies. This issue will become more apparent once Pig support both hadoop 20 and 23. The only downside of this approach is we will rely on hadoop command line processor to process Pig command line, and possibly create backward compatibility issues.
          Hide
          Daniel Dai added a comment -

          The other backward incompatibility change is Pig will not runnable without a local hadoop installation.

          Show
          Daniel Dai added a comment - The other backward incompatibility change is Pig will not runnable without a local hadoop installation.
          Hide
          Milind Bhandarkar added a comment -

          +1 !!!

          (I remember an email thread opposing this a couple of years ago. I hope the issues in that thread are resolved. I think the prominent issue was that bundling and shipping the whole pig jar as job.jar for every job will be a problem.)

          Show
          Milind Bhandarkar added a comment - +1 !!! (I remember an email thread opposing this a couple of years ago. I hope the issues in that thread are resolved. I think the prominent issue was that bundling and shipping the whole pig jar as job.jar for every job will be a problem.)
          Hide
          Dmitriy V. Ryaboy added a comment -

          We can check if the hadoop command is available, and fall back to the old method with a warning if it is not. That will let us work in places that have the hadoop jar but not the hadoop scripts, etc.

          For reducing size of the job jar in other similar cases, I've actually taken to keeping a tiny jar around, and invoking hadoop jar on that, while ensuring that my real jar is on the classpath and in libjars. Otherwise all the unnecessary unjarring Hadoop does gets really annoying.

          Show
          Dmitriy V. Ryaboy added a comment - We can check if the hadoop command is available, and fall back to the old method with a warning if it is not. That will let us work in places that have the hadoop jar but not the hadoop scripts, etc. For reducing size of the job jar in other similar cases, I've actually taken to keeping a tiny jar around, and invoking hadoop jar on that, while ensuring that my real jar is on the classpath and in libjars. Otherwise all the unnecessary unjarring Hadoop does gets really annoying.
          Hide
          Daniel Dai added a comment -

          Attach initial patch PIG-2239-0.patch. Some notes for the patch:
          1. There is no change in java code, we still produce pig.jar, pig-withouthadoop.jar. However, pig-withouthadoop.jar is more lightweight (no hadoop dependencies, only Pig dependencies)

          2. If HADOOP_HOME is defined, bin/pig will invoke hadoop runJar to run pig-withouthadoop.jar, org.apache.pig.Main

          3. If HADOOP_HOME is not defined, fall back to old way--link to bundled hadoop 20.2 libraries.

          4. I didn't see any conflict options between Pig and Hadoop command line. The only change is now pig command line support hadoop generic options which is not supported before.

          Show
          Daniel Dai added a comment - Attach initial patch PIG-2239 -0.patch. Some notes for the patch: 1. There is no change in java code, we still produce pig.jar, pig-withouthadoop.jar. However, pig-withouthadoop.jar is more lightweight (no hadoop dependencies, only Pig dependencies) 2. If HADOOP_HOME is defined, bin/pig will invoke hadoop runJar to run pig-withouthadoop.jar, org.apache.pig.Main 3. If HADOOP_HOME is not defined, fall back to old way--link to bundled hadoop 20.2 libraries. 4. I didn't see any conflict options between Pig and Hadoop command line. The only change is now pig command line support hadoop generic options which is not supported before.
          Hide
          Daniel Dai added a comment -

          Actually 4 is not true. We do not go through GenericOptionsParser and hadoop will not take any command line options. So the command line parsing will be totally up to Pig as before.

          Show
          Daniel Dai added a comment - Actually 4 is not true. We do not go through GenericOptionsParser and hadoop will not take any command line options. So the command line parsing will be totally up to Pig as before.
          Hide
          Ashutosh Chauhan added a comment -

          @Daniel,

          Is this targeted for 0.9 branch too? I tried to apply the patch, but it failed.

          Show
          Ashutosh Chauhan added a comment - @Daniel, Is this targeted for 0.9 branch too? I tried to apply the patch, but it failed.
          Hide
          Daniel Dai added a comment -

          Yes, it is targeted for 0.9 branch as well, though I have not tried yet, maybe a different patch is needed.

          Show
          Daniel Dai added a comment - Yes, it is targeted for 0.9 branch as well, though I have not tried yet, maybe a different patch is needed.
          Hide
          Daniel Dai added a comment -

          PIG-2239-1.patch also change the release process. When we release, we will include both fat pig.jar and thin pig-withouthadoop.jar. If HADOOP_HOME exists, we use hadoop binary to run pig-withouthadoop.jar. Otherwise, we fall back to use pig.jar, which include hadoop 20.2.

          Show
          Daniel Dai added a comment - PIG-2239 -1.patch also change the release process. When we release, we will include both fat pig.jar and thin pig-withouthadoop.jar. If HADOOP_HOME exists, we use hadoop binary to run pig-withouthadoop.jar. Otherwise, we fall back to use pig.jar, which include hadoop 20.2.
          Hide
          Daniel Dai added a comment -

          A bug fix to make PIG_CONF_DIR works. Also modify the comments to match the current patch.

          Show
          Daniel Dai added a comment - A bug fix to make PIG_CONF_DIR works. Also modify the comments to match the current patch.
          Hide
          Alan Gates added a comment -

          The debug statements:

          echo "Find hadoop at $HADOOP_BIN"

          and

          echo "Cannot find local hadoop installation, using bundled hadoop 20.2"

          should be changed to print only if $debug is true.

          Other than that, +1

          Show
          Alan Gates added a comment - The debug statements: echo "Find hadoop at $HADOOP_BIN" and echo "Cannot find local hadoop installation, using bundled hadoop 20.2" should be changed to print only if $debug is true. Other than that, +1
          Hide
          Daniel Dai added a comment -

          Patch committed to both trunk and 0.9 branch.

          Show
          Daniel Dai added a comment - Patch committed to both trunk and 0.9 branch.

            People

            • Assignee:
              Daniel Dai
              Reporter:
              Ashutosh Chauhan
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development