Pig
  1. Pig
  2. PIG-2553

Pig shouldn't allow attempts to write multiple relations into same directory

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available
    • Release Note:
      To enforce strict checking of output location, set "pig.location.check.strict=true". By default, it is false.

      Description

      We've seen multiple occasions where users accidentally try to store 2 or more different relations to the same destination directory. Currently, this passes the Pig planner and fails on MR side due to concurrent attempts to create the same part file on the reducer. This is extremely confusing to the user, and hard to debug.
      We should instead fail their scripts before they are even submitted, since we can identify the erroneous condition from the beginning.

      1. PIG-2553.patch
        2 kB
        Prashant Kommireddi
      2. PIG-2553_1.patch
        8 kB
        Prashant Kommireddi
      3. PIG-2553_2.patch
        9 kB
        Prashant Kommireddi

        Activity

        Hide
        Prashant Kommireddi added a comment -

        Looks like we can place a check in

         Graph.postProcess() 

        to make sure a script does not contain multiple stores pointing to the same output location. Does that seem like a fair approach? Any corner cases to watch out for?

        Show
        Prashant Kommireddi added a comment - Looks like we can place a check in Graph.postProcess() to make sure a script does not contain multiple stores pointing to the same output location. Does that seem like a fair approach? Any corner cases to watch out for?
        Hide
        Prashant Kommireddi added a comment -

        First attempt at this. Reviewers, please take a look. Basically checking for the LOStore(s) in a LogicalPlan to ensure no 2 output locations are the same.

        Show
        Prashant Kommireddi added a comment - First attempt at this. Reviewers, please take a look. Basically checking for the LOStore(s) in a LogicalPlan to ensure no 2 output locations are the same.
        Hide
        Cheolsoo Park added a comment -

        Hi Prashant,

        Thank you very much for the patch! I tested it in local and mr mode, and it works fine. Can you please add a unit test probably in TestPigServer?

        Show
        Cheolsoo Park added a comment - Hi Prashant, Thank you very much for the patch! I tested it in local and mr mode, and it works fine. Can you please add a unit test probably in TestPigServer?
        Hide
        Rohini Palaniswamy added a comment -

        This might break some existing custom StoreFuncs. Currently you can write a custom storer which allows to write into same directory, but with different file names. We have users who have written that kind of Storers.

        Show
        Rohini Palaniswamy added a comment - This might break some existing custom StoreFuncs. Currently you can write a custom storer which allows to write into same directory, but with different file names. We have users who have written that kind of Storers.
        Hide
        Prashant Kommireddi added a comment -

        Thanks for the feedback Rohini and Cheolsoo.

        Rohini, what does such a StoreFunc look like? May be the following?

        STORE alias1 INTO 'output' using MyStoreFunc('filename1');
        STORE alias2 INTO 'output' using MyStoreFunc('filename2');
        
        Show
        Prashant Kommireddi added a comment - Thanks for the feedback Rohini and Cheolsoo. Rohini, what does such a StoreFunc look like? May be the following? STORE alias1 INTO 'output' using MyStoreFunc('filename1'); STORE alias2 INTO 'output' using MyStoreFunc('filename2');
        Hide
        Rohini Palaniswamy added a comment -

        Yes. That would be a simple one.

        Show
        Rohini Palaniswamy added a comment - Yes. That would be a simple one.
        Hide
        Prashant Kommireddi added a comment -

        StoreFuncs using setStoreLocation/relToAbsPathForStoreLocation to append filenames make it difficult to handle this. Any other ideas since the earlier approach is not safe?

        Show
        Prashant Kommireddi added a comment - StoreFuncs using setStoreLocation/relToAbsPathForStoreLocation to append filenames make it difficult to handle this. Any other ideas since the earlier approach is not safe?
        Hide
        Rohini Palaniswamy added a comment -

        PiggyBank also has a MultiStorage - http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/MultiStorage.java?revision=1145447&view=markup.

        One thing we could do is print a warning message instead of throwing a error. I don't see a way to correctly determine and throw an error. And in cases like MultiStorage the filename is not even static. The output dir name/file name is dynamic and depends on the value of a field in the record.

        Show
        Rohini Palaniswamy added a comment - PiggyBank also has a MultiStorage - http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/MultiStorage.java?revision=1145447&view=markup . One thing we could do is print a warning message instead of throwing a error. I don't see a way to correctly determine and throw an error. And in cases like MultiStorage the filename is not even static. The output dir name/file name is dynamic and depends on the value of a field in the record.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Rohini, what if we hide this behind a property, something like "pig.location.check.strict"?
        I see accidental writes to the same output path causing problems all the time.. would love to have this feature.

        Prashant – I haven't looked at the patch yet, but just something to check: it does allow writes of multiple relations to, say, the same HBase table?

        Show
        Dmitriy V. Ryaboy added a comment - Rohini, what if we hide this behind a property, something like "pig.location.check.strict"? I see accidental writes to the same output path causing problems all the time.. would love to have this feature. Prashant – I haven't looked at the patch yet, but just something to check: it does allow writes of multiple relations to, say, the same HBase table?
        Hide
        Rohini Palaniswamy added a comment -

        Thanks for bringing up hbase Dmitriy. HCat table is another case.

        Would that property be true by default? I am fine making the property true by default as long as it checks only for filesystem locations. HCat and HBase tables are going to be more common going forward, and it would not be nice to ask the users to launch pig every time with that property set to false.

        Show
        Rohini Palaniswamy added a comment - Thanks for bringing up hbase Dmitriy. HCat table is another case. Would that property be true by default? I am fine making the property true by default as long as it checks only for filesystem locations. HCat and HBase tables are going to be more common going forward, and it would not be nice to ask the users to launch pig every time with that property set to false.
        Hide
        Prashant Kommireddi added a comment -

        That's a good point Dmitriy. The patch does not handle multiple relations being written to hbase. Is it sufficient to check for the schema (hdfs://, hbase://, file://,...) ?

        Rohini, you are right. Any implementation of StoreFunc similar to Hadoop MultipleOutputFormat would break this. As Dmitriy suggested, I think it makes sense to provide an option to users, in addition to logging a warning message?

        Show
        Prashant Kommireddi added a comment - That's a good point Dmitriy. The patch does not handle multiple relations being written to hbase. Is it sufficient to check for the schema (hdfs://, hbase://, file:// ,...) ? Rohini, you are right. Any implementation of StoreFunc similar to Hadoop MultipleOutputFormat would break this. As Dmitriy suggested, I think it makes sense to provide an option to users, in addition to logging a warning message?
        Hide
        Dmitriy V. Ryaboy added a comment -

        I think it should be set to false by default so that current scripts that use fancy storers, etc, can keep running without change (the scripts that have bugs, which we are trying to address with this, don't run correctly at all, so we don't have to worry about being backwards compatible with them). Individual pig admins / script authors can decide to turn it on by default if they notice this happening a lot.

        We are piling up quite the list of exceptions, though. Between hcat, hbase, unknown other schemas, and hdfs/s3/kfs/mapr cases, I'm getting concerned that maybe this wasn't such a well thought out feature wish on my part!

        What do you guys think?

        Show
        Dmitriy V. Ryaboy added a comment - I think it should be set to false by default so that current scripts that use fancy storers, etc, can keep running without change (the scripts that have bugs, which we are trying to address with this, don't run correctly at all, so we don't have to worry about being backwards compatible with them). Individual pig admins / script authors can decide to turn it on by default if they notice this happening a lot. We are piling up quite the list of exceptions, though. Between hcat, hbase, unknown other schemas, and hdfs/s3/kfs/mapr cases, I'm getting concerned that maybe this wasn't such a well thought out feature wish on my part! What do you guys think?
        Hide
        Prashant Kommireddi added a comment -

        I feel like we could treat this as file-based vs non-file based storage locations, similar to PIG-2924. The patch there uses a default FileBasedOutputSizeReader to determine output size and "pig.stats.output.size.reader" to compute size based on a different implementation.

        For this JIRA, can we also use a similar idea and handle file-based schemes with UriUtil.isHDFSFileOrLocalOrS3N(String uri)? For all other schemes (hbase, hcat, ...) we can allow multiple relations writing to same location.

        1. Check if pig.location.check.strict is set
        2. If not set, just log a warning if scheme is file-based
        3. If set, check if scheme is file-based and report an error
        4. If set but not a file-based scheme, continue without any warning/error message

        Thoughts?

        Show
        Prashant Kommireddi added a comment - I feel like we could treat this as file-based vs non-file based storage locations, similar to PIG-2924 . The patch there uses a default FileBasedOutputSizeReader to determine output size and "pig.stats.output.size.reader" to compute size based on a different implementation. For this JIRA, can we also use a similar idea and handle file-based schemes with UriUtil.isHDFSFileOrLocalOrS3N(String uri)? For all other schemes (hbase, hcat, ...) we can allow multiple relations writing to same location. 1. Check if pig.location.check.strict is set 2. If not set, just log a warning if scheme is file-based 3. If set, check if scheme is file-based and report an error 4. If set but not a file-based scheme, continue without any warning/error message Thoughts?
        Hide
        Prashant Kommireddi added a comment -

        Also, we could add kfs and maprfs to the list of file-based schemes. I agree the list could keep growing, but I suspect most users facing this issue to be hdfs:// and file:// users. What do you guys think of this for a start?

        Show
        Prashant Kommireddi added a comment - Also, we could add kfs and maprfs to the list of file-based schemes. I agree the list could keep growing, but I suspect most users facing this issue to be hdfs:// and file:// users. What do you guys think of this for a start?
        Hide
        Prashant Kommireddi added a comment -

        Ping!

        Show
        Prashant Kommireddi added a comment - Ping!
        Hide
        Rohini Palaniswamy added a comment -

        Sounds good to me Prashant. I can't think of any objections to having pig.location.check.strict default as false and the 4 point proposal you have outlined.

        Show
        Rohini Palaniswamy added a comment - Sounds good to me Prashant. I can't think of any objections to having pig.location.check.strict default as false and the 4 point proposal you have outlined.
        Hide
        Prashant Kommireddi added a comment -

        Thanks Rohini. Dmitriy, you ok with the approach?

        Show
        Prashant Kommireddi added a comment - Thanks Rohini. Dmitriy, you ok with the approach?
        Hide
        Dmitriy V. Ryaboy added a comment -

        I wouldn't even log a warning in 2.
        The approach sounds good overall.
        Patch should include changes to pig docs that instruct admins to set this unless they know something creative is happening.

        Show
        Dmitriy V. Ryaboy added a comment - I wouldn't even log a warning in 2. The approach sounds good overall. Patch should include changes to pig docs that instruct admins to set this unless they know something creative is happening.
        Hide
        Prashant Kommireddi added a comment -

        Should we add support for kfs and maprfs?

        Show
        Prashant Kommireddi added a comment - Should we add support for kfs and maprfs?
        Hide
        Prashant Kommireddi added a comment -

        Rohini, Dmitriy - adding a new patch with the discussed changes.

        Show
        Prashant Kommireddi added a comment - Rohini, Dmitriy - adding a new patch with the discussed changes.
        Hide
        Prashant Kommireddi added a comment -

        Ping!

        Show
        Prashant Kommireddi added a comment - Ping!
        Hide
        Cheolsoo Park added a comment -

        Sorry for the delay. Here are some comments. Please let me know what you think.

        • Wouldn't it make more sense to make it public since this property is a public property, and this variable may be reused somewhere else in the future? Do you agree?
          private static final String PIG_LOCATION_CHECK_STRICT = "pig.location.check.strict";
          
        • Can you check whether PIG_LOCATION_CHECK_STRICT is enabled before calling getStoreLocIfInvalid(storeOps) since then we can avoid calling it when unnecessary?
          LOStore invalidStore = getStoreLocIfInvalid(storeOps);
          if (invalidStore != null && "true".equals(pigContext.getProperties().getProperty(PIG_LOCATION_CHECK_STRICT))) {
              throw new RuntimeException("Script contains 2 or more STORE statements writing to same location : "+ invalidStore.getFileSpec().getFileName());
          }
          
        • Wouldn't it make more sense for getStoreLocIfInvalid() to return the filename as String instead of LOStore? LOStore seems unnecessary to me.
        • I am not sure if creating the admin section in the docs makes sense. Even if admin sets this property, users always can override it running Pig with -Dpig.location.check.strict=false. So I don't think that this property is different from any other user properties. Can we document it in conf/pig.property like we did for other properties? Do you agree?
        Show
        Cheolsoo Park added a comment - Sorry for the delay. Here are some comments. Please let me know what you think. Wouldn't it make more sense to make it public since this property is a public property, and this variable may be reused somewhere else in the future? Do you agree? private static final String PIG_LOCATION_CHECK_STRICT = "pig.location.check.strict" ; Can you check whether PIG_LOCATION_CHECK_STRICT is enabled before calling getStoreLocIfInvalid(storeOps) since then we can avoid calling it when unnecessary? LOStore invalidStore = getStoreLocIfInvalid(storeOps); if (invalidStore != null && " true " .equals(pigContext.getProperties().getProperty(PIG_LOCATION_CHECK_STRICT))) { throw new RuntimeException( "Script contains 2 or more STORE statements writing to same location : " + invalidStore.getFileSpec().getFileName()); } Wouldn't it make more sense for getStoreLocIfInvalid() to return the filename as String instead of LOStore ? LOStore seems unnecessary to me. I am not sure if creating the admin section in the docs makes sense. Even if admin sets this property, users always can override it running Pig with -Dpig.location.check.strict=false . So I don't think that this property is different from any other user properties. Can we document it in conf/pig.property like we did for other properties? Do you agree?
        Hide
        Prashant Kommireddi added a comment -

        Hi Cheolsoo, please see my comments below

        1. Should we hold off on making this variable public until when needed? One could always modify scope in the future.
        2. Good point. Will do
        3. Returning String makes sense.
        4. I feel like the new section would be a useful place for admins to go to, and we could keep adding properties that admins could/should be aware of. If its an overkill, I am fine with documenting pig.properties only. Let me know.

        Again, thanks for reviewing.

        Show
        Prashant Kommireddi added a comment - Hi Cheolsoo, please see my comments below 1. Should we hold off on making this variable public until when needed? One could always modify scope in the future. 2. Good point. Will do 3. Returning String makes sense. 4. I feel like the new section would be a useful place for admins to go to, and we could keep adding properties that admins could/should be aware of. If its an overkill, I am fine with documenting pig.properties only. Let me know. Again, thanks for reviewing.
        Hide
        Cheolsoo Park added a comment -

        Hi Prashant,

        Thanks for your responses:

        1. Agreed.
        2. Thanks.
        3. On a second thought, how about simplifying it even further?
          if ("true".equals(pigContext.getProperties().getProperty(PIG_LOCATION_CHECK_STRICT))) {
              checkDuplicateStoreLoc(storeOps);
          }
          ...
          /**
           * This method checks whether the multiple sinks (STORE) use the same
           * "file-based" location. If yes, throws a runtime exception.
           * 
           * @param storeOps
           */
          private void checkDuplicateStoreLoc(Set<LOStore> storeOps) {
              Set<String> uniqueStoreLoc = new HashSet<String>();
              for(LOStore store : storeOps) {
                  String filename = store.getFileSpec().getFileName();
                  if(!uniqueStoreLoc.add(filename) && UriUtil.isHDFSFileOrLocalOrS3N(filename))
                      throw new RuntimeException("Script contains 2 or more STORE statements writing to same location : "+ filename);
              }
          }
          
        4. Sure. That sounds reasonable. But can you add the new property to pig.properties as well? I like to have a single place where all properties are listed. As far as I know, pig.properties is only such a place as of now.
        5. I can't build admin.xml. I get the following error when running ant docs:
               [exec] /home/cheolsoo/workspace/pig/src/docs/src/documentation/content/xdocs/admin.xml:33:66: Element type "b" must be declared.
               [exec] /home/cheolsoo/workspace/pig/src/docs/src/documentation/content/xdocs/admin.xml:33:194: The content of element type "p" must match "(strong|em|code|sub|sup|br|img|icon|acronym|map|xi:include|a)"
          

          Replacing <b></b> with <strong></strong> works for me. Also, it would be nice if you could avoid using tabs for indentation.

        Show
        Cheolsoo Park added a comment - Hi Prashant, Thanks for your responses: Agreed. Thanks. On a second thought, how about simplifying it even further? if ( " true " .equals(pigContext.getProperties().getProperty(PIG_LOCATION_CHECK_STRICT))) { checkDuplicateStoreLoc(storeOps); } ... /** * This method checks whether the multiple sinks (STORE) use the same * "file-based" location. If yes, throws a runtime exception. * * @param storeOps */ private void checkDuplicateStoreLoc(Set<LOStore> storeOps) { Set< String > uniqueStoreLoc = new HashSet< String >(); for (LOStore store : storeOps) { String filename = store.getFileSpec().getFileName(); if (!uniqueStoreLoc.add(filename) && UriUtil.isHDFSFileOrLocalOrS3N(filename)) throw new RuntimeException( "Script contains 2 or more STORE statements writing to same location : " + filename); } } Sure. That sounds reasonable. But can you add the new property to pig.properties as well? I like to have a single place where all properties are listed. As far as I know, pig.properties is only such a place as of now. I can't build admin.xml . I get the following error when running ant docs : [exec] /home/cheolsoo/workspace/pig/src/docs/src/documentation/content/xdocs/admin.xml:33:66: Element type "b" must be declared. [exec] /home/cheolsoo/workspace/pig/src/docs/src/documentation/content/xdocs/admin.xml:33:194: The content of element type "p" must match "(strong|em|code|sub|sup|br|img|icon|acronym|map|xi:include|a)" Replacing <b></b> with <strong></strong> works for me. Also, it would be nice if you could avoid using tabs for indentation.
        Hide
        Prashant Kommireddi added a comment -

        Hey Cheolsoo, been out on vacation all this while. Will be getting back to this soon.

        Show
        Prashant Kommireddi added a comment - Hey Cheolsoo, been out on vacation all this while. Will be getting back to this soon.
        Hide
        Prashant Kommireddi added a comment -

        Hey Cheolsoo Park attaching a patch. Please note I have made changes to UriUtil.java in removing unused imports. Rest is as we discussed, except that I did not find broken indentation (other than docs, not sure if that needs to be indented too?). Let me know if the indentation needs to be adjusted.

        Show
        Prashant Kommireddi added a comment - Hey Cheolsoo Park attaching a patch. Please note I have made changes to UriUtil.java in removing unused imports. Rest is as we discussed, except that I did not find broken indentation (other than docs, not sure if that needs to be indented too?). Let me know if the indentation needs to be adjusted.
        Hide
        Prashant Kommireddi added a comment -

        Ping!

        Show
        Prashant Kommireddi added a comment - Ping!
        Hide
        Cheolsoo Park added a comment -

        Prashant Kommireddi, sorry for the delay. I am looking at it now.

        Show
        Cheolsoo Park added a comment - Prashant Kommireddi , sorry for the delay. I am looking at it now.
        Hide
        Cheolsoo Park added a comment -

        Looks good to me. I am running full unit test to make sure nothing is broken.

        Show
        Cheolsoo Park added a comment - Looks good to me. I am running full unit test to make sure nothing is broken.
        Hide
        Prashant Kommireddi added a comment -

        Thanks @cheolsoo

        Show
        Prashant Kommireddi added a comment - Thanks @cheolsoo
        Hide
        Cheolsoo Park added a comment -

        +1. I ran full unit test and found no additional failures.

        There are failing unit test case in trunk, but they're all known issues.

        PIG-2994:
        org.apache.pig.test.TestGrunt.testExplainDot
        org.apache.pig.test.TestGrunt.testExplainOut
        org.apache.pig.test.TestGrunt.testExplainBrief
        org.apache.pig.test.TestGrunt.testExplainScript
        PIG-3138:
        org.apache.pig.test.TestPigRunner.simpleTest
        PIG-3153:
        org.apache.pig.test.TestScriptUDF.testJavascriptExampleScript

        Show
        Cheolsoo Park added a comment - +1. I ran full unit test and found no additional failures. There are failing unit test case in trunk, but they're all known issues. PIG-2994 : org.apache.pig.test.TestGrunt.testExplainDot org.apache.pig.test.TestGrunt.testExplainOut org.apache.pig.test.TestGrunt.testExplainBrief org.apache.pig.test.TestGrunt.testExplainScript PIG-3138 : org.apache.pig.test.TestPigRunner.simpleTest PIG-3153 : org.apache.pig.test.TestScriptUDF.testJavascriptExampleScript
        Hide
        Cheolsoo Park added a comment -

        Committed to trunk. Thanks Prashant!

        Show
        Cheolsoo Park added a comment - Committed to trunk. Thanks Prashant!
        Hide
        Prashant Kommireddi added a comment -

        Thanks for the review/commit, Cheolsoo. Thanks for pitching-in Dmitriy, Rohini.

        Show
        Prashant Kommireddi added a comment - Thanks for the review/commit, Cheolsoo. Thanks for pitching-in Dmitriy, Rohini.

          People

          • Assignee:
            Prashant Kommireddi
            Reporter:
            Dmitriy V. Ryaboy
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development