Pig
  1. Pig
  2. PIG-2578

Multiple Store-commands mess up mapred.output.dir.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.1, 0.9.2
    • Fix Version/s: 0.10.0, 0.11
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When one runs a pig-script with multiple storers, one sees the following:
      1. When run as a script, Pig launches a single job.
      2. PigOutputCommitter::setupJob() calls the underlyingOutputCommitter::setupJob(), once for each storer. But the mapred.output.dir is the same for both calls, even though the storers write to different locations.

      This was originally seen in HCATALOG-276, when HCatalog's end-to-end tests are run against Pig.
      (https://issues.apache.org/jira/browse/HCATALOG-276)

      Sample pig-script (near identical to HCatalog's Pig_Checkin_4 test):

      a = load 'keyvals' using org.apache.hcatalog.pig.HCatLoader();
      split a into b if key<200, c if key >=200;
      store b into 'keyvals_lt200' using org.apache.hcatalog.pig.HCatStorer();
      store c into 'keyvals_ge200' using org.apache.hcatalog.pig.HCatStorer();

      I've suggested a workaround in HCat for the time being, but I think this might be something that needs fixing in Pig.

      Thanks.

        Issue Links

          Activity

          Mithun Radhakrishnan created issue -
          Mithun Radhakrishnan made changes -
          Field Original Value New Value
          Link This issue blocks HCATALOG-276 [ HCATALOG-276 ]
          Hide
          Ashutosh Chauhan added a comment -

          I am finding this a bit surprising. If this were to be true, multi-query cannot work effectively, since then both stores using FileOutputFormat will effectively write in same directory, messing up the outputs of each other. I suspect problem may exist in HCatalog. A test-case independent of HCatalog demonstrating Pig bug will be highly appreciated.

          Show
          Ashutosh Chauhan added a comment - I am finding this a bit surprising. If this were to be true, multi-query cannot work effectively, since then both stores using FileOutputFormat will effectively write in same directory, messing up the outputs of each other. I suspect problem may exist in HCatalog. A test-case independent of HCatalog demonstrating Pig bug will be highly appreciated.
          Hide
          Daniel Dai added a comment -

          There two folds of this issue:
          1. I found one bug in Pig which we didn't make a copy of hadoop configuration before invoking various StoreFunc hooks. This is more obvious under hadoop 23 for HCat when hadoop need "mapred.output.dir" in OutputCommitter to move promote output.
          2. There is also a fix on HCat side. setStoreLocation suppose to set up the right "mapred.output.dir"

          Show
          Daniel Dai added a comment - There two folds of this issue: 1. I found one bug in Pig which we didn't make a copy of hadoop configuration before invoking various StoreFunc hooks. This is more obvious under hadoop 23 for HCat when hadoop need "mapred.output.dir" in OutputCommitter to move promote output. 2. There is also a fix on HCat side. setStoreLocation suppose to set up the right "mapred.output.dir"
          Hide
          Daniel Dai added a comment -

          Attach the patch in Pig side.

          Show
          Daniel Dai added a comment - Attach the patch in Pig side.
          Daniel Dai made changes -
          Attachment PIG-2578-1.patch [ 12522606 ]
          Daniel Dai made changes -
          Assignee Daniel Dai [ daijy ]
          Fix Version/s 0.10.0 [ 12316246 ]
          Fix Version/s 0.11 [ 12318878 ]
          Hide
          Daniel Dai added a comment -

          HCat side fix is part of HCATALOG-375.

          Show
          Daniel Dai added a comment - HCat side fix is part of HCATALOG-375 .
          Hide
          Thejas M Nair added a comment -

          +1

          Show
          Thejas M Nair added a comment - +1
          Hide
          Daniel Dai added a comment -

          Patch committed to 0.10/trunk.

          Show
          Daniel Dai added a comment - Patch committed to 0.10/trunk.
          Daniel Dai made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Resolution Fixed [ 1 ]
          Daniel Dai made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Bill Graham added a comment -

          This patch has caused some issues, see PIG-2870.

          Show
          Bill Graham added a comment - This patch has caused some issues, see PIG-2870 .
          Rohini Palaniswamy made changes -
          Link This issue relates to PIG-2821 [ PIG-2821 ]
          Rohini Palaniswamy made changes -
          Link This issue is related to PIG-2870 [ PIG-2870 ]
          Hide
          Rohini Palaniswamy added a comment -

          PIG-2821 is another issue which has problems because of this patch.

          We should revert this patch.

          With PIG-2578, setting config or credentials in job is not actually being passed to the actual job launched. The testcase in MultiQueryLocal is not valid as the condition (job.getConfiguration().get(key)==null) is always true. It was always using the suffix initialized in the constructor and never from the job. If a StoreFunc adds config to Job then it should be its responsibility to handle multiple instances of it. HCATALOG-314 handles it for hcatalog and fixes the original issue reported. We should not remove the ability for a StoreFunc to set a config on the job.

          Show
          Rohini Palaniswamy added a comment - PIG-2821 is another issue which has problems because of this patch. We should revert this patch. With PIG-2578 , setting config or credentials in job is not actually being passed to the actual job launched. The testcase in MultiQueryLocal is not valid as the condition (job.getConfiguration().get(key)==null) is always true. It was always using the suffix initialized in the constructor and never from the job. If a StoreFunc adds config to Job then it should be its responsibility to handle multiple instances of it. HCATALOG-314 handles it for hcatalog and fixes the original issue reported. We should not remove the ability for a StoreFunc to set a config on the job.
          Hide
          Rohini Palaniswamy added a comment -

          Spoke with Daniel. He said it was intentional to make the JobConf read-only so that each store does not override another. But the problem with that is it does not allow addition of Credentials and setting of JT specific config like Distributed cache configuration on the Job. We need a more cleaner solution to solve it and prevent StoreFunc implementations to not put something in JobConf that will not work correctly with multiple stores. We got rid of the multiple stores messing up problem in hcat by putting the properties in UDFContext instead of Job. But cannot expect all StoreFunc implementations to do that unless forced to which was the intention of this JIRA.

          Show
          Rohini Palaniswamy added a comment - Spoke with Daniel. He said it was intentional to make the JobConf read-only so that each store does not override another. But the problem with that is it does not allow addition of Credentials and setting of JT specific config like Distributed cache configuration on the Job. We need a more cleaner solution to solve it and prevent StoreFunc implementations to not put something in JobConf that will not work correctly with multiple stores. We got rid of the multiple stores messing up problem in hcat by putting the properties in UDFContext instead of Job. But cannot expect all StoreFunc implementations to do that unless forced to which was the intention of this JIRA.
          Hide
          Dmitriy V. Ryaboy added a comment -

          I am aware of many StoreFunc implementations that rely on being able to mess with the JobConf. This is an undocumented and backwards incompatible change.. I can see why we need it, but the proper way to do this would be to document it, provide explicit instructions on using UDFContext (and how/where/when to get it), and migrate piggybank and builtin storefuncs that rely on mutable jobconfs. Further, to make the contract clear, rather than passing in a dummy new jobConf that gets GC'd immediately, we should pass in a wrapper job conf which throws exceptions on any set() call, to prevent surprises.

          Show
          Dmitriy V. Ryaboy added a comment - I am aware of many StoreFunc implementations that rely on being able to mess with the JobConf. This is an undocumented and backwards incompatible change.. I can see why we need it, but the proper way to do this would be to document it, provide explicit instructions on using UDFContext (and how/where/when to get it), and migrate piggybank and builtin storefuncs that rely on mutable jobconfs. Further, to make the contract clear, rather than passing in a dummy new jobConf that gets GC'd immediately, we should pass in a wrapper job conf which throws exceptions on any set() call, to prevent surprises.
          Hide
          Bill Graham added a comment -

          Regarding the wrapper job conf, in some cases I'm sure it's justified to set a conf. What if we throw an exception if a value set attempt occurs where a different value already exists? We could include messaging about how UDFContext if probably what they want. This approach would be backward compatible with jobs that use conf properly with a single-store job, for example.

          Show
          Bill Graham added a comment - Regarding the wrapper job conf, in some cases I'm sure it's justified to set a conf. What if we throw an exception if a value set attempt occurs where a different value already exists? We could include messaging about how UDFContext if probably what they want. This approach would be backward compatible with jobs that use conf properly with a single-store job, for example.
          Rohini Palaniswamy made changes -
          Link This issue relates to PIG-2821 [ PIG-2821 ]
          Hide
          Rohini Palaniswamy added a comment -

          Did some debugging with and without PIG-2578. Multiple storage using PigStorage worked fine in both cases. This is because before every getOutputFormat call, there is a setLocation with a copy of JobContext or TaskAttemptContext and that copy was passed to getOutputCommitter(), getRecordWriter() or checkOutputSpecs() calls. So the output format actually runs with the correct configuration. So multiple store commands don't always get messed up. The corner case problem I see is that, if one instance of the store set a configuration to a specific value and another instance of the store does not set any value at all for that config it will still get the config with the value set from the copy of the job put by the first instance(without PIG-2578).

          The actual problem was with the hcat code when this jira was filed. It set the mapred.output.dir and lot of other properties in front end but not in the backened during setStoreLocation.
          http://svn.apache.org/viewvc/incubator/hcatalog/branches/branch-0.4/src/java/org/apache/hcatalog/pig/HCatStorer.java?revision=1325867&view=markup
          If it had set the mapred.output.dir in the backend also, it would have worked fine. It was later fixed to do so.

          Show
          Rohini Palaniswamy added a comment - Did some debugging with and without PIG-2578 . Multiple storage using PigStorage worked fine in both cases. This is because before every getOutputFormat call, there is a setLocation with a copy of JobContext or TaskAttemptContext and that copy was passed to getOutputCommitter(), getRecordWriter() or checkOutputSpecs() calls. So the output format actually runs with the correct configuration. So multiple store commands don't always get messed up. The corner case problem I see is that, if one instance of the store set a configuration to a specific value and another instance of the store does not set any value at all for that config it will still get the config with the value set from the copy of the job put by the first instance(without PIG-2578 ). The actual problem was with the hcat code when this jira was filed. It set the mapred.output.dir and lot of other properties in front end but not in the backened during setStoreLocation. http://svn.apache.org/viewvc/incubator/hcatalog/branches/branch-0.4/src/java/org/apache/hcatalog/pig/HCatStorer.java?revision=1325867&view=markup If it had set the mapred.output.dir in the backend also, it would have worked fine. It was later fixed to do so.
          Hide
          Raghu Angadi added a comment -

          Thanks for the analysis Rohini. +1 for reverting this patch.

          For the larger issue, I think Pig should clearly define the contract for job/conf passed setLocation() and setStoreLocation() so the user's StoreFunc can be implemented properly. I would suggest resisting the temptation to say "this method might be called any number of times" (a variant of this appears multiple places in Pig interface). While this made UDF implementors think twice about what they are doing, it allowed Pig to implement work arounds rather than proper fixes (i.e. why is "setStoreLocation()" called so many places?).

          Show
          Raghu Angadi added a comment - Thanks for the analysis Rohini. +1 for reverting this patch. For the larger issue, I think Pig should clearly define the contract for job/conf passed setLocation() and setStoreLocation() so the user's StoreFunc can be implemented properly. I would suggest resisting the temptation to say "this method might be called any number of times" (a variant of this appears multiple places in Pig interface). While this made UDF implementors think twice about what they are doing, it allowed Pig to implement work arounds rather than proper fixes (i.e. why is "setStoreLocation()" called so many places?).
          Hide
          Bill Graham added a comment -

          I think the problem is not that as much that setStoreLocation can get called multiple times, but that from the Javadocs it's not clear what the effects (or side-effects) will occur when setStoreLocation sets values in the Config.

          +1 on reverting this patch and adding better javadocs for starters since the build is currently broken for a number of common use cases. We can then add examples and safeguards to illustrate proper usage in a multi-store environment.

          Show
          Bill Graham added a comment - I think the problem is not that as much that setStoreLocation can get called multiple times, but that from the Javadocs it's not clear what the effects (or side-effects) will occur when setStoreLocation sets values in the Config. +1 on reverting this patch and adding better javadocs for starters since the build is currently broken for a number of common use cases. We can then add examples and safeguards to illustrate proper usage in a multi-store environment.
          Hide
          Daniel Dai added a comment -

          I am fine with reverting the patch. The underlying problem is setStoreLocation is the only hook for StoreFunc for multiple purpose. In the javadoc, we shall make it clear:
          1. Need to distinguish frontend/backend (using UDFContext.isFrontend()), user can setup global configuration in the frontend, but can only setup store only configuration in the backend
          2. When setting up global configuration, need to bear in mind there could be multiple store, so config entries can overwrite each other.

          Show
          Daniel Dai added a comment - I am fine with reverting the patch. The underlying problem is setStoreLocation is the only hook for StoreFunc for multiple purpose. In the javadoc, we shall make it clear: 1. Need to distinguish frontend/backend (using UDFContext.isFrontend()), user can setup global configuration in the frontend, but can only setup store only configuration in the backend 2. When setting up global configuration, need to bear in mind there could be multiple store, so config entries can overwrite each other.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Reverted in PIG-2890. I don't see a way to reopen this jira and change it to won't fix..

          Show
          Dmitriy V. Ryaboy added a comment - Reverted in PIG-2890 . I don't see a way to reopen this jira and change it to won't fix..

            People

            • Assignee:
              Daniel Dai
              Reporter:
              Mithun Radhakrishnan
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development