Pig
  1. Pig
  2. PIG-924

Make Pig work with multiple versions of Hadoop

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.0
    • Component/s: None
    • Labels:
      None

      Description

      The current Pig build scripts package hadoop and other dependencies into the pig.jar file.
      This means that if users upgrade Hadoop, they also need to upgrade Pig.

      Pig has relatively few dependencies on Hadoop interfaces that changed between 18, 19, and 20. It is possibly to write a dynamic shim that allows Pig to use the correct calls for any of the above versions of Hadoop. Unfortunately, the building process precludes us from the ability to do this at runtime, and forces an unnecessary Pig rebuild even if dynamic shims are created.

      1. pig_924.patch
        43 kB
        Dmitriy V. Ryaboy
      2. pig_924.2.patch
        42 kB
        Dmitriy V. Ryaboy
      3. pig_924.3.patch
        39 kB
        Dmitriy V. Ryaboy

        Issue Links

          Activity

          Hide
          Olga Natkovich added a comment -

          Closing since we have a plan of action. I just aploaded needed code to PIG-660 to work with Hadoop 20 till we have an official release.

          Show
          Olga Natkovich added a comment - Closing since we have a plan of action. I just aploaded needed code to PIG-660 to work with Hadoop 20 till we have an official release.
          Hide
          Santhosh Srinivasan added a comment -

          Hadoop has promised "APIs in stone" forever and has not delivered on that promise yet. Higher layers in the stack have to learn how to cope with a ever changing lower layer. How this change is managed is a matter of convenience to the owners of the higher layer. I really like Shims approach which avoids the cost of branching out Pig every time we make a compatible release. The cost of creating a branch for each version of hadoop seems to be too high compared to the cost of the Shims approach.

          Of course, there are pros and cons to each approach. The question here is when will Hadoop set its APIs in stone and how many more releases will we have before this happens. If the answer to the question is 12 months and 2 more releases, then we should go with the Shims approach. If the answer is 3-6 months and one more release then we should stick with our current approach and pay the small penalty of patches supplied to work with the specific release of Hadoop.

          Summary: Use the shims patch if APIs are not set in stone within a quarter or two and if there is more than one release of Hadoop.

          Show
          Santhosh Srinivasan added a comment - Hadoop has promised "APIs in stone" forever and has not delivered on that promise yet. Higher layers in the stack have to learn how to cope with a ever changing lower layer. How this change is managed is a matter of convenience to the owners of the higher layer. I really like Shims approach which avoids the cost of branching out Pig every time we make a compatible release. The cost of creating a branch for each version of hadoop seems to be too high compared to the cost of the Shims approach. Of course, there are pros and cons to each approach. The question here is when will Hadoop set its APIs in stone and how many more releases will we have before this happens. If the answer to the question is 12 months and 2 more releases, then we should go with the Shims approach. If the answer is 3-6 months and one more release then we should stick with our current approach and pay the small penalty of patches supplied to work with the specific release of Hadoop. Summary: Use the shims patch if APIs are not set in stone within a quarter or two and if there is more than one release of Hadoop.
          Hide
          Todd Lipcon added a comment -

          Hey guys,

          As we understood it, Pig 0.5 wasn't due for quite some time. If it's the case that 0.5 is a small release on top of 0.4 and it should be out in a few weeks, this seems a lot more reasonable.

          Most likely we'll end up applying this patch to the 0.4 release for our distribution, even if there are multiple branches made in SVN. That's fine, though - we've got a process developed for this and are happy to support users on both versions for the next several months as people transition to 0.20 and the new APIs.

          Feel free to resolve as wontfix
          -Todd

          Show
          Todd Lipcon added a comment - Hey guys, As we understood it, Pig 0.5 wasn't due for quite some time. If it's the case that 0.5 is a small release on top of 0.4 and it should be out in a few weeks, this seems a lot more reasonable. Most likely we'll end up applying this patch to the 0.4 release for our distribution, even if there are multiple branches made in SVN. That's fine, though - we've got a process developed for this and are happy to support users on both versions for the next several months as people transition to 0.20 and the new APIs. Feel free to resolve as wontfix -Todd
          Hide
          Dmitriy V. Ryaboy added a comment -

          Arun – it wouldn't suffice for those who want to use pig-0.4 with hadoop 0.19* or 0.20*

          Pig 0.5 isn't due out for 4 to 6 months which is behind the curve for adoption of 20. Putting in this patch will make compatibility an issue of a compile-time flag. Putting in this patch and restructuring the ant tasks somewhat will make this completely transparent.

          Waiting until 0.5 means that users wind up with instructions like this: http://behemoth.strlen.net/~alex/hadoop20-pig-howto.txt for half a year.

          Show
          Dmitriy V. Ryaboy added a comment - Arun – it wouldn't suffice for those who want to use pig-0.4 with hadoop 0.19* or 0.20* Pig 0.5 isn't due out for 4 to 6 months which is behind the curve for adoption of 20. Putting in this patch will make compatibility an issue of a compile-time flag. Putting in this patch and restructuring the ant tasks somewhat will make this completely transparent. Waiting until 0.5 means that users wind up with instructions like this: http://behemoth.strlen.net/~alex/hadoop20-pig-howto.txt for half a year.
          Hide
          Olga Natkovich added a comment -

          Todd and Dmitry, I understand your intention. I am wondering if in the current situation, the following might not be the best course of action:

          (1) Release Pig 0.4.0. I think we resolved all the blockers and can start the process
          (2) Wait till Hadoop 20.1 is released and release Pig 0.5.0.

          Owen promised that Hadoop 20.1 will go out for a vote next week. This means that Pig 0.4.0 and 0.5.0 will be just a couple of weeks apart which should not be a big issue for users. Meanwhile they can apply PIG-660 to the code bundled with Pig 0.4.0 or the trunk. I am currently working with the release engineering to get an official hadoop20.jar that Pig can be build with. I expect to have it in the next couple of days.

          The concern with applying the patch is the code complexity it introduces. Also, if there are patches that are version specific, they will not be easy to apply. Multiple branches is something we understand and know how to work with better. We also don't want to set a precedent of supporting pig releases on multiple versions on Hadoop because it is not clear that this is something we will be able to maintain going forward.

          Show
          Olga Natkovich added a comment - Todd and Dmitry, I understand your intention. I am wondering if in the current situation, the following might not be the best course of action: (1) Release Pig 0.4.0. I think we resolved all the blockers and can start the process (2) Wait till Hadoop 20.1 is released and release Pig 0.5.0. Owen promised that Hadoop 20.1 will go out for a vote next week. This means that Pig 0.4.0 and 0.5.0 will be just a couple of weeks apart which should not be a big issue for users. Meanwhile they can apply PIG-660 to the code bundled with Pig 0.4.0 or the trunk. I am currently working with the release engineering to get an official hadoop20.jar that Pig can be build with. I expect to have it in the next couple of days. The concern with applying the patch is the code complexity it introduces. Also, if there are patches that are version specific, they will not be easy to apply. Multiple branches is something we understand and know how to work with better. We also don't want to set a precedent of supporting pig releases on multiple versions on Hadoop because it is not clear that this is something we will be able to maintain going forward.
          Hide
          Arun C Murthy added a comment -

          The fact is, though, that there are a significant number of people running 0.18.x who would like to use Pig 0.4.0, and supporting them out of the box seems worth it. Given that the API is still changing for 0.21, and Pig hasn't adopted the "new" MR APIs yet, it seems like it's premature to leave 18 in the cold.

          I believe the plan is for 0.4.0 to work with hadoop-0.18.* anyway... wouldn't that suffice?

          Show
          Arun C Murthy added a comment - The fact is, though, that there are a significant number of people running 0.18.x who would like to use Pig 0.4.0, and supporting them out of the box seems worth it. Given that the API is still changing for 0.21, and Pig hasn't adopted the "new" MR APIs yet, it seems like it's premature to leave 18 in the cold. I believe the plan is for 0.4.0 to work with hadoop-0.18.* anyway... wouldn't that suffice?
          Hide
          Arun C Murthy added a comment -

          I agree with Owen.

          One conceivable option is for the Pig project to maintain separate branches (per Pig release) to support the various Hadoop versions... several projects are run this way. Clearly it adds to the cost for pushing out a release for the Pig committers and it is their call.

          Show
          Arun C Murthy added a comment - I agree with Owen. One conceivable option is for the Pig project to maintain separate branches (per Pig release) to support the various Hadoop versions... several projects are run this way. Clearly it adds to the cost for pushing out a release for the Pig committers and it is their call.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Owen – I may not have made the intent clear; the idea is that when Pig is rewritten to use the future-proofed APIs, the shims will go away (presumably for 0.5). Right now, pig is not using the new APIs, even the 20 patch posted by Olga uses the deprecated mapred calls.

          This is only to make life easier in the transitional period while Pig is using the old, mutating APIs.

          Check out the pig user list archives for motivation of why these shims are needed.

          Show
          Dmitriy V. Ryaboy added a comment - Owen – I may not have made the intent clear; the idea is that when Pig is rewritten to use the future-proofed APIs, the shims will go away (presumably for 0.5). Right now, pig is not using the new APIs, even the 20 patch posted by Olga uses the deprecated mapred calls. This is only to make life easier in the transitional period while Pig is using the old, mutating APIs. Check out the pig user list archives for motivation of why these shims are needed.
          Hide
          Todd Lipcon added a comment -

          I think this is a bad idea and is totally unmaintainable. In particular, the HadoopShim interface is very specific to the changes in those particular versions. We are trying to stabilize the FileSystem and Map/Reduce interfaces to avoid these problems and that is a much better solution.

          Agreed that this is not a long term solution. Like you said, the long term solution is stabilized cross-version APIs so this isn't necessary. The fact is, though, that there are a significant number of people running 0.18.x who would like to use Pig 0.4.0, and supporting them out of the box seems worth it. This patch is pretty small and easily verifiable both by eye and by tests. Given that the API is still changing for 0.21, and Pig hasn't adopted the "new" MR APIs yet, it seems like it's premature to leave 18 in the cold.

          Do you have an objection to committing this only on the 0.4.0 branch and not planning to maintain it in trunk/0.5?

          Show
          Todd Lipcon added a comment - I think this is a bad idea and is totally unmaintainable. In particular, the HadoopShim interface is very specific to the changes in those particular versions. We are trying to stabilize the FileSystem and Map/Reduce interfaces to avoid these problems and that is a much better solution. Agreed that this is not a long term solution. Like you said, the long term solution is stabilized cross-version APIs so this isn't necessary. The fact is, though, that there are a significant number of people running 0.18.x who would like to use Pig 0.4.0, and supporting them out of the box seems worth it. This patch is pretty small and easily verifiable both by eye and by tests. Given that the API is still changing for 0.21, and Pig hasn't adopted the "new" MR APIs yet, it seems like it's premature to leave 18 in the cold. Do you have an objection to committing this only on the 0.4.0 branch and not planning to maintain it in trunk/0.5?
          Hide
          Daniel Dai added a comment -

          Wrapping hadoop functionality add extra maintenance cost to adopting new features of hadoop. We still need to figure out the balance point between usability and maintenance cost. I don't think this issue is a blocker for 0.4.

          Show
          Daniel Dai added a comment - Wrapping hadoop functionality add extra maintenance cost to adopting new features of hadoop. We still need to figure out the balance point between usability and maintenance cost. I don't think this issue is a blocker for 0.4.
          Hide
          Owen O'Malley added a comment -

          -1

          I think this is a bad idea and is totally unmaintainable. In particular, the HadoopShim interface is very specific to the changes in those particular versions. We are trying to stabilize the FileSystem and Map/Reduce interfaces to avoid these problems and that is a much better solution.

          Show
          Owen O'Malley added a comment - -1 I think this is a bad idea and is totally unmaintainable. In particular, the HadoopShim interface is very specific to the changes in those particular versions. We are trying to stabilize the FileSystem and Map/Reduce interfaces to avoid these problems and that is a much better solution.
          Hide
          Dmitriy V. Ryaboy added a comment -

          linking this with Pig for Hadoop 20 issue.

          Show
          Dmitriy V. Ryaboy added a comment - linking this with Pig for Hadoop 20 issue.
          Hide
          Todd Lipcon added a comment -

          If existing deployments need a single pig.jar without a hadoop dependency, it might be possible to create a new target (pig-all) that would create a statically bundled jar; but I think the default behavior should be to not bundle, build all the shims, and use whatever hadoop is on the path.

          +1 for making the default to not bundle hadoop inside pig.jar, and adding another non-default target for those people who might want it.

          The current patch is written as is so that it can be applied to trunk, enabling people to compile statically, and only require a change to the ant build files to switch to a dynamic compile later on (after 0.4, probably)

          From the packager's perspective, I'd love if this change could get in for 0.4. If it doesn't, we'll end up applying the patch ourselves for packaging purposes - we need to have the hadoop dependency be on the user's installed hadoop, not on whatever happened to get bundled into pig.jar.

          Show
          Todd Lipcon added a comment - If existing deployments need a single pig.jar without a hadoop dependency, it might be possible to create a new target (pig-all) that would create a statically bundled jar; but I think the default behavior should be to not bundle, build all the shims, and use whatever hadoop is on the path. +1 for making the default to not bundle hadoop inside pig.jar, and adding another non-default target for those people who might want it. The current patch is written as is so that it can be applied to trunk, enabling people to compile statically, and only require a change to the ant build files to switch to a dynamic compile later on (after 0.4, probably) From the packager's perspective, I'd love if this change could get in for 0.4. If it doesn't, we'll end up applying the patch ourselves for packaging purposes - we need to have the hadoop dependency be on the user's installed hadoop, not on whatever happened to get bundled into pig.jar.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Daniel, you've hit the nail on the head.

          This patch is specifically written to enable us to compile against all the versions of hadoop, and let the user pick which one he wants at runtime (by virtue of including the right hadoop on the path – no flags needed). In fact the default ant task in the shims directory compiles all the shims at once.

          The version string hack is safe, as long as hadoop is built correctly (the zebra version is not, as it returns "Unknown", hence the last-resort hack of defaulting to 20).
          If hadoop came from its own jar I could use reflection to get the jar name, and use that as a fallback for an Unknown version – but in pig, hadoop comes from the pig.jar !

          Ideally, Pig would compile all the versions of shims into its jars, and the pig jar woud not include hadoop. Then the user would include the right hadoop on the path (or bin/pig would do it for him), and everything would happen automagically.

          By bundling hadoop into the jar, however, switching hadoop versions on the fly is next to impossible (or at least I don't know how) – we have multiple jars on the classpath, and the classloader will use whatever is the latest (or is it earliest?). Finding the right resource becomes fraught with peril.

          If existing deployments need a single pig.jar without a hadoop dependency, it might be possible to create a new target (pig-all) that would create a statically bundled jar; but I think the default behavior should be to not bundle, build all the shims, and use whatever hadoop is on the path.

          The current patch is written as is so that it can be applied to trunk, enabling people to compile statically, and only require a change to the ant build files to switch to a dynamic compile later on (after 0.4, probably)

          Show
          Dmitriy V. Ryaboy added a comment - Daniel, you've hit the nail on the head. This patch is specifically written to enable us to compile against all the versions of hadoop, and let the user pick which one he wants at runtime (by virtue of including the right hadoop on the path – no flags needed). In fact the default ant task in the shims directory compiles all the shims at once. The version string hack is safe, as long as hadoop is built correctly (the zebra version is not, as it returns "Unknown", hence the last-resort hack of defaulting to 20). If hadoop came from its own jar I could use reflection to get the jar name, and use that as a fallback for an Unknown version – but in pig, hadoop comes from the pig.jar ! Ideally, Pig would compile all the versions of shims into its jars, and the pig jar woud not include hadoop. Then the user would include the right hadoop on the path (or bin/pig would do it for him), and everything would happen automagically. By bundling hadoop into the jar, however, switching hadoop versions on the fly is next to impossible (or at least I don't know how) – we have multiple jars on the classpath, and the classloader will use whatever is the latest (or is it earliest?). Finding the right resource becomes fraught with peril. If existing deployments need a single pig.jar without a hadoop dependency, it might be possible to create a new target (pig-all) that would create a statically bundled jar; but I think the default behavior should be to not bundle, build all the shims, and use whatever hadoop is on the path. The current patch is written as is so that it can be applied to trunk, enabling people to compile statically, and only require a change to the ant build files to switch to a dynamic compile later on (after 0.4, probably)
          Hide
          Daniel Dai added a comment -

          From your latest patch, shims works this way
          1. The version of shims Pig compiles is controlled by "hadoop.version" property in build.xml
          2. The version of shims Pig uses is determined dynamically by hacking the string returned by VersionInfo.getVersion

          As in your code comment, version string hack is not safe. My thinking is that pig only use bundled hadoop unless override:
          1. Pig compile all version of shims, There is no conflict between different version of shims, why not compile them all? So user do not need to recompile the code if he want to use different external hadoop.
          2. Pig bundles a default hadoop, which is specified by hadoop.version in build.xml. Pig use this version of shims by default
          3. If user want to use an external hadoop, he/she need to override the default hadoop version explicitly, eg, "-Dhadoop_version" in command line.

          Show
          Daniel Dai added a comment - From your latest patch, shims works this way 1. The version of shims Pig compiles is controlled by "hadoop.version" property in build.xml 2. The version of shims Pig uses is determined dynamically by hacking the string returned by VersionInfo.getVersion As in your code comment, version string hack is not safe. My thinking is that pig only use bundled hadoop unless override: 1. Pig compile all version of shims, There is no conflict between different version of shims, why not compile them all? So user do not need to recompile the code if he want to use different external hadoop. 2. Pig bundles a default hadoop, which is specified by hadoop.version in build.xml. Pig use this version of shims by default 3. If user want to use an external hadoop, he/she need to override the default hadoop version explicitly, eg, "-Dhadoop_version" in command line.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Regarding deprecation – I tried setting it back to off, and adding @SuppressWarnings("deprecation") to the shims for 20, but and complained about deprecation nonetheless. Not sure what its deal is.

          Adding something like this to the main build.xml works. Does this seem like a reasonable solution?

              <!-- set deprecation off if hadoop version greater or equals 20 -->
              <target name="set_deprecation">
                <condition property="hadoop_is20">
          	<equals arg1="${hadoop.version}" arg2="20"/>
                </condition>
                <antcall target="if_hadoop_is20"/>
                <antcall target="if_hadoop_not20"/>
              </target>
              <target name="if_hadoop_is20" if="hadoop_is20">
                <property name="javac.deprecation" value="off" />
              </target>
              <target name="if_hadoop_not20" unless="hadoop_is20">
                <property name="javac.deprecation" value="on" />
              </target>
          
          
              <target name="init" depends="set_deprecation">
            [....]
          
          Show
          Dmitriy V. Ryaboy added a comment - Regarding deprecation – I tried setting it back to off, and adding @SuppressWarnings("deprecation") to the shims for 20, but and complained about deprecation nonetheless. Not sure what its deal is. Adding something like this to the main build.xml works. Does this seem like a reasonable solution? <!-- set deprecation off if hadoop version greater or equals 20 --> <target name= "set_deprecation" > <condition property= "hadoop_is20" > <equals arg1= "${hadoop.version}" arg2= "20" /> </condition> <antcall target= "if_hadoop_is20" /> <antcall target= "if_hadoop_not20" /> </target> <target name= "if_hadoop_is20" if = "hadoop_is20" > <property name= "javac.deprecation" value= "off" /> </target> <target name= "if_hadoop_not20" unless= "hadoop_is20" > <property name= "javac.deprecation" value= "on" /> </target> <target name= "init" depends= "set_deprecation" > [....]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12416945/pig_924.3.patch
          against trunk revision 804406.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/173/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/173/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/173/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12416945/pig_924.3.patch against trunk revision 804406. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/173/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/173/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/173/console This message is automatically generated.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Removed unused methods from the shims.

          I don't know what reasoning led to turning off deprecation in 20, or to turning off the hbase test – so just following suit on this one.

          Show
          Dmitriy V. Ryaboy added a comment - Removed unused methods from the shims. I don't know what reasoning led to turning off deprecation in 20, or to turning off the hbase test – so just following suit on this one.
          Hide
          Todd Lipcon added a comment -

          Few more comments I missed on the first pass through:

          • A few of the shim methods appear unused:
          • fileSystemDeleteOnExit
          • inputFormatValidateInput
          • setTmpFiles
          • Is the inner MiniDFSCluster class used? I think this is replaced by the MiniDFSClusterShim if I understand it correctly.
          • Still seems to be some unrelated changes to build.xml - the javac.deprecation for example
          • If we are now excluding TestHBaseStorage on all platforms, we should get rid of the two lines above it that exclude it only on windows - it's redundant and confusing.

          Thanks
          -Todd

          Show
          Todd Lipcon added a comment - Few more comments I missed on the first pass through: A few of the shim methods appear unused: fileSystemDeleteOnExit inputFormatValidateInput setTmpFiles Is the inner MiniDFSCluster class used? I think this is replaced by the MiniDFSClusterShim if I understand it correctly. Still seems to be some unrelated changes to build.xml - the javac.deprecation for example If we are now excluding TestHBaseStorage on all platforms, we should get rid of the two lines above it that exclude it only on windows - it's redundant and confusing. Thanks -Todd
          Hide
          Dmitriy V. Ryaboy added a comment -

          This patch addresses the reviewer comments.
          I put the factor of 0.9 into the 18 shim to restore old behavior (not sure what the motivation was for changing this for 20..

          I set the default hadoop version to 18, so that we can verify correctness by running automated tests.

          The existing unit tests are sufficient verification of this patch (at least as far as 18 is concerned).

          Show
          Dmitriy V. Ryaboy added a comment - This patch addresses the reviewer comments. I put the factor of 0.9 into the 18 shim to restore old behavior (not sure what the motivation was for changing this for 20.. I set the default hadoop version to 18, so that we can verify correctness by running automated tests. The existing unit tests are sufficient verification of this patch (at least as far as 18 is concerned).
          Hide
          Daniel Dai added a comment -

          Hi, Dmitriy,
          Generally the patch is good. Just like Todd said, we don't want to change anything else besides the shims layer. In addition to Todd's comment, Main.java contains the change for "pig.logfile", which you address in Pig-923. Would you please clear things up and resubmit?

          Thanks

          Show
          Daniel Dai added a comment - Hi, Dmitriy, Generally the patch is good. Just like Todd said, we don't want to change anything else besides the shims layer. In addition to Todd's comment, Main.java contains the change for "pig.logfile", which you address in Pig-923. Would you please clear things up and resubmit? Thanks
          Hide
          Todd Lipcon added a comment -

          Gotcha, thanks for explaining. Aside from the nits, patch looks good to me.

          Show
          Todd Lipcon added a comment - Gotcha, thanks for explaining. Aside from the nits, patch looks good to me.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Thanks for looking, Todd – most of those changes, like the factor of 0.9, deprecation, excluding HBase test, etc, are consistent with the 0.20 patch posted to PIG-660 .
          Moving junit.hadoop.conf is critical – there are comments about this in 660 – without it, resetting hadoop.version doesn't actually work, as some of the information from a previous build sticks around.

          I'll fix the whitespace; this wasn't a final patch, more of a proof of concept. Point being this could work, but it can't, because Hadoop is bundled in the jar. I am looking for comments from the core developer team regarding the possibility of un-bundling.

          Show
          Dmitriy V. Ryaboy added a comment - Thanks for looking, Todd – most of those changes, like the factor of 0.9, deprecation, excluding HBase test, etc, are consistent with the 0.20 patch posted to PIG-660 . Moving junit.hadoop.conf is critical – there are comments about this in 660 – without it, resetting hadoop.version doesn't actually work, as some of the information from a previous build sticks around. I'll fix the whitespace; this wasn't a final patch, more of a proof of concept. Point being this could work, but it can't, because Hadoop is bundled in the jar. I am looking for comments from the core developer team regarding the possibility of un-bundling.
          Hide
          Todd Lipcon added a comment -

          Couple notes on the patch:

          • you've turned javac.deprecation from "on" to "off" - seems unwise. perhaps you should just do this for the one javac task where you want that behavior
          • src.shims.dir.com in the build.xml has a "REMOVE" mark on it - is this still needed? it looks like it is, but perhaps is better named .common instead of .com
          • you've moved junit.hadoop.conf into basedir instead of $ {user.home}

            - this seems reasonable but is orthogonal to this patch. should be a separate JIRA

          • why are we now excluding HBase storage test?
          • some spurious whitespace changes (eg TypeCheckingVisitor.java)
          • in MRCompiler, a factor of 0.9 seems to have disappeared. the commented-out line should be removed
          • some tab characters seem to be introduced
          • in MiniCluster, also some commented-out code which should be cleaned up
          Show
          Todd Lipcon added a comment - Couple notes on the patch: you've turned javac.deprecation from "on" to "off" - seems unwise. perhaps you should just do this for the one javac task where you want that behavior src.shims.dir.com in the build.xml has a "REMOVE" mark on it - is this still needed? it looks like it is, but perhaps is better named .common instead of .com you've moved junit.hadoop.conf into basedir instead of $ {user.home} - this seems reasonable but is orthogonal to this patch. should be a separate JIRA why are we now excluding HBase storage test? some spurious whitespace changes (eg TypeCheckingVisitor.java) in MRCompiler, a factor of 0.9 seems to have disappeared. the commented-out line should be removed some tab characters seem to be introduced in MiniCluster, also some commented-out code which should be cleaned up
          Hide
          Daniel Dai added a comment -

          I am reviewing the patch.

          Show
          Daniel Dai added a comment - I am reviewing the patch.
          Hide
          Todd Lipcon added a comment -

          Oops, apparently it is Monday and my brain is scrambled. Above should read "pretty important that a single build of Pig will work...", of course.

          Show
          Todd Lipcon added a comment - Oops, apparently it is Monday and my brain is scrambled. Above should read "pretty important that a single build of Pig will work...", of course.
          Hide
          Todd Lipcon added a comment -

          Hey guys,

          Any word on this? From the packaging perspective it's pretty important that a single build of Hive will work with both Hadoop 18 and Hadoop 20. Obviously packaging isn't the Yahoo team's highest priority, but I think it is very important for community adoption, etc. If we require separate builds for 18 and 20 it's one more thing that can cause confusion for new users.

          As I understand it from Dmitriy, for this to work we just need to stop packing the Hadoop JAR into the pig JAR. Instead, the wrapper script just needs to specify the hadoop JAR on the classpath. Is there some barrier to doing this that I'm unaware of?

          Show
          Todd Lipcon added a comment - Hey guys, Any word on this? From the packaging perspective it's pretty important that a single build of Hive will work with both Hadoop 18 and Hadoop 20. Obviously packaging isn't the Yahoo team's highest priority, but I think it is very important for community adoption, etc. If we require separate builds for 18 and 20 it's one more thing that can cause confusion for new users. As I understand it from Dmitriy, for this to work we just need to stop packing the Hadoop JAR into the pig JAR. Instead, the wrapper script just needs to specify the hadoop JAR on the classpath. Is there some barrier to doing this that I'm unaware of?
          Hide
          Dmitriy V. Ryaboy added a comment -

          The attached patch includes dynamic shims that could be used with Pig if it didn't bundle its hadoop classes.

          Show
          Dmitriy V. Ryaboy added a comment - The attached patch includes dynamic shims that could be used with Pig if it didn't bundle its hadoop classes.

            People

            • Assignee:
              Dmitriy V. Ryaboy
              Reporter:
              Dmitriy V. Ryaboy
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development