Pig
  1. Pig
  2. PIG-2872

StoreFuncInterface.setStoreLocation get's a copy of a Configuration object

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.11
    • Fix Version/s: None
    • Component/s: impl
    • Labels:
      None
    • Environment:

      Pig trunk, Hadoop 0.20.205 with Kerberos, ElasticSearch trunk, Wonderdog trunk

      Description

      When an implementation of StoreFuncInterface.setStoreLocation is called from JobControlCompiler.getJob, it is passed a copy of the Configuration that will be used for the Job that will be submitted:

      JobControlCompiler.java
      sFunc.setStoreLocation(st.getSFile().getFileName(), new org.apache.hadoop.mapreduce.Job(nwJob.getConfiguration()));
      

      When a new org.apache.hadoop.mapreduce.Job is created it creates a copy of the Configuration object, as far as I know. Thus anything added to the Configuration object in the implementation of setStoreLocation will not be included in the Configuration of nwJob in JobControlCompiler.getJob.

      I notice this goes wrong in Wonderdog, which needs to include the Elasticsearch configuration file in the DistributedCache. It is added to mapred.cache.files through setStoreLocation, but this setting doesn't make it back into the Job returned by JobControlCompiler.getJob, and is therefore never localized.

      This might be intentional semantics within Pig, but I'm not familiar enough with StoreFuncs to know whether it is.

        Issue Links

          Activity

          Hide
          Bill Graham added a comment -

          I ran into issues with this as well. See PIG-2870 and the attached patch. Does Wonderdog work if you apply this patch?

          Show
          Bill Graham added a comment - I ran into issues with this as well. See PIG-2870 and the attached patch. Does Wonderdog work if you apply this patch?
          Hide
          Evert Lammerts added a comment -

          The patch gets rid of this exception, thanks! Will you ship it into 0.11?

          Show
          Evert Lammerts added a comment - The patch gets rid of this exception, thanks! Will you ship it into 0.11?
          Hide
          Cheolsoo Park added a comment -

          In PIG-2821, it looks like Rohini is proposing to revert PIG-2578 that introduced this issue in the first place. Don't we need a coordinated fix for the regression from PIG-2578?

          Thanks!

          Show
          Cheolsoo Park added a comment - In PIG-2821 , it looks like Rohini is proposing to revert PIG-2578 that introduced this issue in the first place. Don't we need a coordinated fix for the regression from PIG-2578 ? Thanks!
          Hide
          Bill Graham added a comment -

          The patch in PIG-2870 works with PIG-2578, but I we still need to think it through some more since the patch is basically forking logic depending on whether it's a single or multi-store job. Let's continue the discusion at PIG-2870.

          And yes, we will ship a fix to this problem one way or the other in Pig 0.11.

          Show
          Bill Graham added a comment - The patch in PIG-2870 works with PIG-2578 , but I we still need to think it through some more since the patch is basically forking logic depending on whether it's a single or multi-store job. Let's continue the discusion at PIG-2870 . And yes, we will ship a fix to this problem one way or the other in Pig 0.11.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Reverted PIG-2578. Is still a problem, or does reverting that patch fix this issue?

          Show
          Dmitriy V. Ryaboy added a comment - Reverted PIG-2578 . Is still a problem, or does reverting that patch fix this issue?
          Hide
          William Watson added a comment -

          We're seeing this issue (from PIG-2578) in pig 0.12 with DBStorage. Is there a new thread for a fix or is this the current one?

          Show
          William Watson added a comment - We're seeing this issue (from PIG-2578 ) in pig 0.12 with DBStorage. Is there a new thread for a fix or is this the current one?
          Hide
          Rohini Palaniswamy added a comment -

          PIG-2578 has been reverted and is not in 0.12. You should not be facing the issue because of that. If you still see issues, can you give more details, your analysis, stack traces, etc. Does it work with 0.11?

          Show
          Rohini Palaniswamy added a comment - PIG-2578 has been reverted and is not in 0.12. You should not be facing the issue because of that. If you still see issues, can you give more details, your analysis, stack traces, etc. Does it work with 0.11?
          Hide
          Taylor Finnell added a comment -

          I was working with William Watson when we experienced the issue. Our script is roughly as follows...

          A = LOAD '...' USING CSVLoader ...;
          STORE A INTO '/tmp/A-unused' USING DBStorage (org.postgresql.Driver, ..., INSERT INTO ....);
          B = FOREACH A GENERATE X, Y, CONCAT(X, Y) as Z;
          STORE B INTO '/tmp/B-unused' USING DBStorage (org.postgresql.Driver, ..., INSERT INTO ....);
          

          Both DBStorage calls insert into different tables in the same database.

          When the script is run both A, B are stored into their /tmp/ locations. However, the data never makes it into the database. We found two ways to get the data to make it into the database. The first, was to add a DUMP B command after the assignment of B. The second was to execute the script with the -M flag.

          Show
          Taylor Finnell added a comment - I was working with William Watson when we experienced the issue. Our script is roughly as follows... A = LOAD '...' USING CSVLoader ...; STORE A INTO '/tmp/A-unused' USING DBStorage (org.postgresql.Driver, ..., INSERT INTO ....); B = FOREACH A GENERATE X, Y, CONCAT(X, Y) as Z; STORE B INTO '/tmp/B-unused' USING DBStorage (org.postgresql.Driver, ..., INSERT INTO ....); Both DBStorage calls insert into different tables in the same database. When the script is run both A, B are stored into their /tmp/ locations. However, the data never makes it into the database. We found two ways to get the data to make it into the database. The first, was to add a DUMP B command after the assignment of B. The second was to execute the script with the -M flag.
          Hide
          Rohini Palaniswamy added a comment -

          Can you open a separate jira for this? If it works with multiquery off /adding a DUMP statement then the problem is with multiquery mode. multiquery off mode stops parsing when a STORE is encountered and executes the script till then. DUMP also essentially does the same executive the script till DUMP for both multiquery on and off modes. So basically it fails with the multiquery mode.

          Show
          Rohini Palaniswamy added a comment - Can you open a separate jira for this? If it works with multiquery off /adding a DUMP statement then the problem is with multiquery mode. multiquery off mode stops parsing when a STORE is encountered and executes the script till then. DUMP also essentially does the same executive the script till DUMP for both multiquery on and off modes. So basically it fails with the multiquery mode.

            People

            • Assignee:
              Unassigned
              Reporter:
              Evert Lammerts
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:

                Development