Pig
  1. Pig
  2. PIG-2262

AvroStorage dependencies are missing from the release tarball

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: build, piggybank
    • Labels:
      None

      Description

      This makes AvroStorage hard to use, since users have to download the dependencies manually, or build Pig themselves.

      1. PIG-2262.patch
        3 kB
        Tom White
      2. PIG-2262.patch
        2 kB
        Tom White

        Activity

        Tom White created issue -
        Hide
        Tom White added a comment -

        This patch fixes the the problem.

        Show
        Tom White added a comment - This patch fixes the the problem.
        Tom White made changes -
        Field Original Value New Value
        Attachment PIG-2262.patch [ 12492690 ]
        Tom White made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Daniel Dai added a comment -

        There are a couple issues with this approach, actually most of issues are not specific to AvroStorage, it is how we deal with UDF dependent jars:

        1. Pig don't automatically ship all classes in pig-withouthadoop.jar
        We also need to make code change in JarManager.jar to denote the package to ship. Putting a jar into pig-withouthadoop.jar alone is equal to put this jar in classpath. This mechanism confusing and we shall stop putting more jars into pig-withouthadoop.jar

        2. Conflict with hadoop bundled jars
        Hadoop 20.204 bundles jackson-1.0.1, which is too old for AvroLoader. In frontend, we can force hadoop take our jackson-1.7.3 by setting flag HADOOP_USER_CLASSPATH_FIRST=true. But in the backend, seems hadoop always pick bundled jackson-1.0.1, which results a job failure.

        3. Do we need to bundle piggybank dependent jars?
        We don't even bundle hbase.jar though HbaseLoader is in builtin. Further, these jars are not even in Pig distribution. They are ivy dependencies and will only be retrieved during compilation. My thinking is we need to bundle some popular jars (hbase.jar, avro.jar, etc) in lib so user knows where to find it when needed. But we don't want to ship all those jars to the backend. Ideally Pig should be smart enough to ship jars when needed (as we do for jython.jar)

        Show
        Daniel Dai added a comment - There are a couple issues with this approach, actually most of issues are not specific to AvroStorage, it is how we deal with UDF dependent jars: 1. Pig don't automatically ship all classes in pig-withouthadoop.jar We also need to make code change in JarManager.jar to denote the package to ship. Putting a jar into pig-withouthadoop.jar alone is equal to put this jar in classpath. This mechanism confusing and we shall stop putting more jars into pig-withouthadoop.jar 2. Conflict with hadoop bundled jars Hadoop 20.204 bundles jackson-1.0.1, which is too old for AvroLoader. In frontend, we can force hadoop take our jackson-1.7.3 by setting flag HADOOP_USER_CLASSPATH_FIRST=true. But in the backend, seems hadoop always pick bundled jackson-1.0.1, which results a job failure. 3. Do we need to bundle piggybank dependent jars? We don't even bundle hbase.jar though HbaseLoader is in builtin. Further, these jars are not even in Pig distribution. They are ivy dependencies and will only be retrieved during compilation. My thinking is we need to bundle some popular jars (hbase.jar, avro.jar, etc) in lib so user knows where to find it when needed. But we don't want to ship all those jars to the backend. Ideally Pig should be smart enough to ship jars when needed (as we do for jython.jar)
        Hide
        Tom White added a comment -

        Thanks for the review, Daniel.

        > 1. Pig don't automatically ship all classes in pig-withouthadoop.jar

        Ah, I didn't realize this. So the original patch is not the correct fix.

        > Further, these jars are not even in Pig distribution. They are ivy dependencies and will only be retrieved during compilation. My thinking is we need to bundle some popular jars (hbase.jar, avro.jar, etc) in lib so user knows where to find it when needed.

        I've attached a new patch to do this for AvroStorage, so users don't need to find the JARs themselves (this was the problem I was trying to solve).

        > Ideally Pig should be smart enough to ship jars when needed (as we do for jython.jar)

        This would be a nice extension.

        Show
        Tom White added a comment - Thanks for the review, Daniel. > 1. Pig don't automatically ship all classes in pig-withouthadoop.jar Ah, I didn't realize this. So the original patch is not the correct fix. > Further, these jars are not even in Pig distribution. They are ivy dependencies and will only be retrieved during compilation. My thinking is we need to bundle some popular jars (hbase.jar, avro.jar, etc) in lib so user knows where to find it when needed. I've attached a new patch to do this for AvroStorage, so users don't need to find the JARs themselves (this was the problem I was trying to solve). > Ideally Pig should be smart enough to ship jars when needed (as we do for jython.jar) This would be a nice extension.
        Tom White made changes -
        Attachment PIG-2262.patch [ 12496310 ]
        Hide
        Dmitriy V. Ryaboy added a comment -

        AvroStorage is currently in piggybank, one would think binding piggybank dependencies should happen in piggybank?

        I don't really want to push a bunch more unnecessary jars into the main jar when they aren't even required by anything in Pig proper.

        I know, I know, HBaseStorage. That was a mistake.

        Show
        Dmitriy V. Ryaboy added a comment - AvroStorage is currently in piggybank, one would think binding piggybank dependencies should happen in piggybank? I don't really want to push a bunch more unnecessary jars into the main jar when they aren't even required by anything in Pig proper. I know, I know, HBaseStorage. That was a mistake.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Canceling patch to clear the review queue; let's solve this at the piggybank level.

        Show
        Dmitriy V. Ryaboy added a comment - Canceling patch to clear the review queue; let's solve this at the piggybank level.
        Dmitriy V. Ryaboy made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Tom White added a comment -

        Thanks Dmitriy. Fixing in Piggybank sounds like the right thing to do. I'm unassigning myself since I'm not working on this at the moment.

        Show
        Tom White added a comment - Thanks Dmitriy. Fixing in Piggybank sounds like the right thing to do. I'm unassigning myself since I'm not working on this at the moment.
        Tom White made changes -
        Assignee Tom White [ tomwhite ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        12d 19h 39m 1 Tom White 14/Sep/11 18:35
        Patch Available Patch Available Open Open
        380d 1h 22m 1 Dmitriy V. Ryaboy 28/Sep/12 19:58

          People

          • Assignee:
            Unassigned
            Reporter:
            Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development