Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available
    • Hadoop Flags:
      Reviewed

      Description

      we have users who are asking for a flag to overwrite existing directory

      1. PIG-259.9.patch
        11 kB
        Nezih Yigitbasi
      2. PIG-259.8.patch
        16 kB
        Nezih Yigitbasi
      3. PIG-259.7.patch
        15 kB
        Nezih Yigitbasi
      4. PIG-259.6.patch
        15 kB
        Nezih Yigitbasi
      5. PIG-259.5.patch
        15 kB
        Nezih Yigitbasi
      6. Pig_259.patch
        7 kB
        Jeff Zhang
      7. Pig_259_4.patch
        12 kB
        Jeff Zhang
      8. Pig_259_3.patch
        11 kB
        Jeff Zhang
      9. Pig_259_2.patch
        12 kB
        Jeff Zhang

        Activity

        Hide
        Pi Song added a comment -

        Can we just add "FORCE_WRITE" keyword? Is that appropriate?

        STORE X INTO '/tmp/data1' FORCE_WRITE ?
        
        Show
        Pi Song added a comment - Can we just add "FORCE_WRITE" keyword? Is that appropriate? STORE X INTO '/tmp/data1' FORCE_WRITE ?
        Hide
        Olga Natkovich added a comment -

        I agree that it should be on per store basis so it should be attached to store command. Whether it should be FORCE_WRITE or OVERWRITE or another keyword, I don't have a strong opinion on.

        Show
        Olga Natkovich added a comment - I agree that it should be on per store basis so it should be attached to store command. Whether it should be FORCE_WRITE or OVERWRITE or another keyword, I don't have a strong opinion on.
        Hide
        Jeff Zhang added a comment -

        I choose the keyword "overwrite" to indicate user want to overwrite the file.

        The following is the implementation details:
        1. Add an variable isOverWrite in LOStore
        2. In the InputOutputFileValidator, delete the destination file first if you use the "overwrite" keyword.

        Show
        Jeff Zhang added a comment - I choose the keyword "overwrite" to indicate user want to overwrite the file. The following is the implementation details: 1. Add an variable isOverWrite in LOStore 2. In the InputOutputFileValidator, delete the destination file first if you use the "overwrite" keyword.
        Hide
        Alan Gates added a comment -

        A few comments and questions on this:

        1) We should make this work against the load/store branch instead of trunk. We're hoping to merge load/store into trunk in a week or two, so it makes more sense to put it there. This will also have implications for load/store. One, it will need to communicate to the new validate function that it's ok if the file (or whatever is being overwritten) exists. Two, load implementations will need to handle removing the file (or whatever) if necessary. For example, PigStorage will need to handle removing the file so MR doesn't complain.

        2) Should we have overwrite be a keyword (as originally proposed and in the patch) or should it be string, like hints in join? I don't have a strong opinion one way or another but I think it's worth considering which we want.

        3) Is the semantic of overwrite that it saves whether the file is there or not, or that it's an error if the file is not there to write? Write whether there or not makes more sense to me, but I wanted to make sure we all agree on it.

        4) What happens when a user requests overwrite and the job fails before it runs? In the current implementation the file will be removed up front, so any planning errors will still result in the file being removed. Also, the file will be removed up front, even if the job remains in Hadoop's queue for a long time waiting to run. At the very least, I think Pig should delay removing the file until it is ready to launch the job so that type checking errors or whatever don't result in the file being removed when the job is not run.

        Show
        Alan Gates added a comment - A few comments and questions on this: 1) We should make this work against the load/store branch instead of trunk. We're hoping to merge load/store into trunk in a week or two, so it makes more sense to put it there. This will also have implications for load/store. One, it will need to communicate to the new validate function that it's ok if the file (or whatever is being overwritten) exists. Two, load implementations will need to handle removing the file (or whatever) if necessary. For example, PigStorage will need to handle removing the file so MR doesn't complain. 2) Should we have overwrite be a keyword (as originally proposed and in the patch) or should it be string, like hints in join? I don't have a strong opinion one way or another but I think it's worth considering which we want. 3) Is the semantic of overwrite that it saves whether the file is there or not, or that it's an error if the file is not there to write? Write whether there or not makes more sense to me, but I wanted to make sure we all agree on it. 4) What happens when a user requests overwrite and the job fails before it runs? In the current implementation the file will be removed up front, so any planning errors will still result in the file being removed. Also, the file will be removed up front, even if the job remains in Hadoop's queue for a long time waiting to run. At the very least, I think Pig should delay removing the file until it is ready to launch the job so that type checking errors or whatever don't result in the file being removed when the job is not run.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12434801/Pig_259.patch
        against trunk revision 906326.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/192/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/192/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/192/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12434801/Pig_259.patch against trunk revision 906326. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/192/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/192/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/192/console This message is automatically generated.
        Hide
        Jeff Zhang added a comment -

        Response to Alan regarding his comments,

        1. I put the logic of deleting output file in JobControlCompiler, then it is easy for me to delay the deletion util the dependent job is done.

        2. I prefer using keywords rather than string, because if using string, the following statement:

         store a into 'output' 'overwrite'; 

        has two consecutive string, it looks a little weird in my opinion.

        3. I think the semantic of overwrite is the same as it is in file system. In file system, when we overwrite file using java api, it won't complain even the file does not exist

        Show
        Jeff Zhang added a comment - Response to Alan regarding his comments, 1. I put the logic of deleting output file in JobControlCompiler, then it is easy for me to delay the deletion util the dependent job is done. 2. I prefer using keywords rather than string, because if using string, the following statement: store a into 'output' 'overwrite'; has two consecutive string, it looks a little weird in my opinion. 3. I think the semantic of overwrite is the same as it is in file system. In file system, when we overwrite file using java api, it won't complain even the file does not exist
        Hide
        Jeff Zhang added a comment -

        Alan,

        Should I create a new sub task under Pig-966 ? or is there any way to move this task under Pig-966 ?

        Show
        Jeff Zhang added a comment - Alan, Should I create a new sub task under Pig-966 ? or is there any way to move this task under Pig-966 ?
        Hide
        Alan Gates added a comment -

        Changes look good. This patch is still against trunk. Was your plan to apply this to trunk and then provide a different patch for load/store redesign or just to take this patch and change it to work against load/store? I think the latter is better, as it avoids the need to undo this patch when we merge in load/store. If you want to be able to apply the patch now to trunk (without checking it in) we can still leave your existing patch on this JIRA and only commit a patch against load/store redesign.

        As for attaching this to load/store, you can click on the edit link on the left side and this will take you to a page where you can make it a sub-task of PIG-966. (You may not have had this edit link before, as I had forgotten to add you to the list of committers on JIRA. I fixed that, so you should see the link now.)

        Show
        Alan Gates added a comment - Changes look good. This patch is still against trunk. Was your plan to apply this to trunk and then provide a different patch for load/store redesign or just to take this patch and change it to work against load/store? I think the latter is better, as it avoids the need to undo this patch when we merge in load/store. If you want to be able to apply the patch now to trunk (without checking it in) we can still leave your existing patch on this JIRA and only commit a patch against load/store redesign. As for attaching this to load/store, you can click on the edit link on the left side and this will take you to a page where you can make it a sub-task of PIG-966 . (You may not have had this edit link before, as I had forgotten to add you to the list of committers on JIRA. I fixed that, so you should see the link now.)
        Hide
        Jeff Zhang added a comment -

        Thanks, Alan. I have moved it under Pig-966. And the new patch is for load-store-redesign branch not for trunk.

        Show
        Jeff Zhang added a comment - Thanks, Alan. I have moved it under Pig-966. And the new patch is for load-store-redesign branch not for trunk.
        Hide
        Alan Gates added a comment -

        Sorry, I missed that it was already for load-store redesign.

        In the new load/store scheme we can't assume that outputs are files. So putting a DFS remove statement into JobControlCompiler won't be the right way to do this. What does overwrite mean when we're storing to HBase instead of DFS? It seems that this needs to be defined on the per store function basis, which implies it needs to somehow be part of the store function interface. This could be done by adding a boolean to validate or by adding a new method to the StoreFunc interface to overwrite.

        Show
        Alan Gates added a comment - Sorry, I missed that it was already for load-store redesign. In the new load/store scheme we can't assume that outputs are files. So putting a DFS remove statement into JobControlCompiler won't be the right way to do this. What does overwrite mean when we're storing to HBase instead of DFS? It seems that this needs to be defined on the per store function basis, which implies it needs to somehow be part of the store function interface. This could be done by adding a boolean to validate or by adding a new method to the StoreFunc interface to overwrite.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Doesn't the StoreFunc take care of resource creation/validation in LSR?
        If so, no new method is needed (since as you say, what exactly it means is indeterminate). StoreFuncs that want to can use initialization variables to determine their behavior (eg: store into '/my/path' using PigStorage('\t', '-overwrite:true, -writeHeader:true, -writeSchema:false');

        Show
        Dmitriy V. Ryaboy added a comment - Doesn't the StoreFunc take care of resource creation/validation in LSR? If so, no new method is needed (since as you say, what exactly it means is indeterminate). StoreFuncs that want to can use initialization variables to determine their behavior (eg: store into '/my/path' using PigStorage('\t', '-overwrite:true, -writeHeader:true, -writeSchema:false');
        Hide
        Alan Gates added a comment -

        If we make overwrite part of the language (as the JIRA proposes and Jeff's patch implements) then we need to make it part of the interface, at least to inform the store func that overwrite was set. Are you suggesting that individual store funcs that want to support overwrite instead support it as an argument and not make it a part of the language?

        Show
        Alan Gates added a comment - If we make overwrite part of the language (as the JIRA proposes and Jeff's patch implements) then we need to make it part of the interface, at least to inform the store func that overwrite was set. Are you suggesting that individual store funcs that want to support overwrite instead support it as an argument and not make it a part of the language?
        Hide
        Dmitriy V. Ryaboy added a comment -

        Yeah I think it makes more sense on that level.
        I think it's worth our while to add it to PigStorage and any other built-in engines though.
        Otherwise it's not clear what the overwrite keyword means. If it's HBase, does it mean drop the table and overwrite? Or delete old values when there is an old confilct? Does that mean we don't delete them otherwise? What about appends?

        This request made more sense when Pig maintained strong control over file creation and such, but now that it's pushed into IOFormats, I think it's ok to push this functionality there as well.

        It's just a lot of stuff I think we don't want to be in the business of creating interfaces for. Different storage engines have different options for what they can and can't do.

        Show
        Dmitriy V. Ryaboy added a comment - Yeah I think it makes more sense on that level. I think it's worth our while to add it to PigStorage and any other built-in engines though. Otherwise it's not clear what the overwrite keyword means. If it's HBase, does it mean drop the table and overwrite? Or delete old values when there is an old confilct? Does that mean we don't delete them otherwise? What about appends? This request made more sense when Pig maintained strong control over file creation and such, but now that it's pushed into IOFormats, I think it's ok to push this functionality there as well. It's just a lot of stuff I think we don't want to be in the business of creating interfaces for. Different storage engines have different options for what they can and can't do.
        Hide
        Jeff Zhang added a comment -

        Response to Alan,

        I agree that it makes more sense to do the overwrite in StoreFunc, and I notice that there's a JIAR PIG-1216 which is related with this.

        Show
        Jeff Zhang added a comment - Response to Alan, I agree that it makes more sense to do the overwrite in StoreFunc, and I notice that there's a JIAR PIG-1216 which is related with this.
        Hide
        Jeff Zhang added a comment -

        Response to Dmitriy,

        Thanks for your suggestion of implementing overwrite on the StoreFunc level rather than on language level. I can bug in this. AndI think another advantage of putting it in StoreFunc is that it's more flexible than putting it in language. We have more control on StoreFunc than pig latin.

        Show
        Jeff Zhang added a comment - Response to Dmitriy, Thanks for your suggestion of implementing overwrite on the StoreFunc level rather than on language level. I can bug in this. AndI think another advantage of putting it in StoreFunc is that it's more flexible than putting it in language. We have more control on StoreFunc than pig latin.
        Hide
        Jeff Zhang added a comment -

        Sorry, I mean I can buy in your suggestion.

        Show
        Jeff Zhang added a comment - Sorry, I mean I can buy in your suggestion.
        Hide
        Olga Natkovich added a comment -

        +1 on passing the information in the constructor. Since we need the store function to to the validation, we don't have control over the semantics and it is better not to have constructs in the language whose semantics are not well defined.

        One thing we need to provide to the store function writer is guidence on when the information they get in the constructor can be acted on.

        Show
        Olga Natkovich added a comment - +1 on passing the information in the constructor. Since we need the store function to to the validation, we don't have control over the semantics and it is better not to have constructs in the language whose semantics are not well defined. One thing we need to provide to the store function writer is guidence on when the information they get in the constructor can be acted on.
        Hide
        Jeff Zhang added a comment -

        This patch is for trunk (since LSRD has already been merged into trunk).

        I've put the overwrite logic into StoreFuncInterface. Add two methods

        public boolean isOverWrite();
            
        public void deleteOldOutput(POStore store, PigContext context) throws IOException;
        

        The second parameter of PigStorage is for overwrite option. It can be 'overwrite' or 'force_write'. Both is OK.

        Show
        Jeff Zhang added a comment - This patch is for trunk (since LSRD has already been merged into trunk). I've put the overwrite logic into StoreFuncInterface. Add two methods public boolean isOverWrite(); public void deleteOldOutput(POStore store, PigContext context) throws IOException; The second parameter of PigStorage is for overwrite option. It can be 'overwrite' or 'force_write'. Both is OK.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12436962/Pig_259_3.patch
        against trunk revision 916065.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/221/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/221/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/221/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12436962/Pig_259_3.patch against trunk revision 916065. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/221/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/221/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/221/console This message is automatically generated.
        Hide
        Jeff Zhang added a comment -

        Refine the patch.
        1. resolve the warning of javadoc and findbug
        2. method refactoring, refactor method "deleteOldOutput" to "cleanupOutput" in StoreFuncInterface

        Show
        Jeff Zhang added a comment - Refine the patch. 1. resolve the warning of javadoc and findbug 2. method refactoring, refactor method "deleteOldOutput" to "cleanupOutput" in StoreFuncInterface
        Hide
        Alan Gates added a comment -

        Based on comments from Dmitry, Olga, and you above I thought there was agreement that this should be store function specific, and not part of the interface. So changes to PigStorage to allow a second constructor arg that means overwrite would be good. But your latest patch returns to changes to the interface. Did you decide against doing it inside specific store functions? If so, why?

        Show
        Alan Gates added a comment - Based on comments from Dmitry, Olga, and you above I thought there was agreement that this should be store function specific, and not part of the interface. So changes to PigStorage to allow a second constructor arg that means overwrite would be good. But your latest patch returns to changes to the interface. Did you decide against doing it inside specific store functions? If so, why?
        Hide
        Jeff Zhang added a comment -

        Alan,

        In the patch I created a PigStorage constructor with a second argument indicate whether to overwrite. And the reason I change the StoreFuncInterface is that I'd like the users who want to create customized StoreFunc has the option to support overwrite.

        Besides I need the overwrite information in InputOutputFileVisitor and JobControlCompiler, But in these two classes I only can get StoreFuncInterface. One way to work around is to check the class whether it is PigStorage, and then cast it to PigStorage, but it seems to me not so friendly and extensible.

        Show
        Jeff Zhang added a comment - Alan, In the patch I created a PigStorage constructor with a second argument indicate whether to overwrite. And the reason I change the StoreFuncInterface is that I'd like the users who want to create customized StoreFunc has the option to support overwrite. Besides I need the overwrite information in InputOutputFileVisitor and JobControlCompiler, But in these two classes I only can get StoreFuncInterface. One way to work around is to check the class whether it is PigStorage, and then cast it to PigStorage, but it seems to me not so friendly and extensible.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12437000/Pig_259_4.patch
        against trunk revision 916793.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/216/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12437000/Pig_259_4.patch against trunk revision 916793. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/216/console This message is automatically generated.
        Hide
        Alan Gates added a comment -

        Thanks for the explanation, now I understand what you're doing. I'd like to make one more suggestion. Could we do this in a separate interface that store functions could choose to implement? One of our goals with PIG-966 was to make the easy case of store functions easy, and put less commonly used features in separate interfaces. This way implementing a new store function is less of a burden. So could we add an interface like:

        interface OverwritingLoadFunc {
        
            
            /**
             * whether the implementation support overwrite
             * 
             * @return
             */
            public boolean isOverWrite();
            
            /**
             * cleanup the old output if you want to overwrite
             * 
             * @param store, the store information you would like to delete
             * @param context, used for deletion operation
             * @throws IOException
             */
            public void cleanupOutput(POStore store, PigContext context) throws IOException;
        }
        

        Also, I'm guessing your cleanupOutput method will need a Job and a store location (like the recently added cleanupOnFailure call, see PIG-1265).

        Show
        Alan Gates added a comment - Thanks for the explanation, now I understand what you're doing. I'd like to make one more suggestion. Could we do this in a separate interface that store functions could choose to implement? One of our goals with PIG-966 was to make the easy case of store functions easy, and put less commonly used features in separate interfaces. This way implementing a new store function is less of a burden. So could we add an interface like: interface OverwritingLoadFunc { /** * whether the implementation support overwrite * * @ return */ public boolean isOverWrite(); /** * cleanup the old output if you want to overwrite * * @param store, the store information you would like to delete * @param context, used for deletion operation * @ throws IOException */ public void cleanupOutput(POStore store, PigContext context) throws IOException; } Also, I'm guessing your cleanupOutput method will need a Job and a store location (like the recently added cleanupOnFailure call, see PIG-1265 ).
        Hide
        Olga Natkovich added a comment -

        I am wondering if this is an issue we need to resolve in Pig at all. Openning new interfaces is costly and something we would need to maintain for a long time. Are the use cases strong enough to pay the cost?

        Show
        Olga Natkovich added a comment - I am wondering if this is an issue we need to resolve in Pig at all. Openning new interfaces is costly and something we would need to maintain for a long time. Are the use cases strong enough to pay the cost?
        Hide
        Jeff Zhang added a comment -

        I also think it make sense to have the StoreFuncInterface contain the overwriting interface. For users it is easy to understand the interface and easy to maintenance.

        Show
        Jeff Zhang added a comment - I also think it make sense to have the StoreFuncInterface contain the overwriting interface. For users it is easy to understand the interface and easy to maintenance.
        Hide
        Jeff Zhang added a comment -

        Actually I have different thoughts about the class hierarchy structure of LoadFunc and StoreFunc, I will attach some diagram to illustrate my idea int Pig-966 later.

        Show
        Jeff Zhang added a comment - Actually I have different thoughts about the class hierarchy structure of LoadFunc and StoreFunc, I will attach some diagram to illustrate my idea int Pig-966 later.
        Hide
        Olga Natkovich added a comment -

        Jeff, are you still planning to work on this issue for 0.8.0 release or should we unlink it?

        Show
        Olga Natkovich added a comment - Jeff, are you still planning to work on this issue for 0.8.0 release or should we unlink it?
        Hide
        Olga Natkovich added a comment -

        Unlinking since there is no activity since early may. Jeff, please, feel free to link in if you still planning to work on it for 0.8 release

        Show
        Olga Natkovich added a comment - Unlinking since there is no activity since early may. Jeff, please, feel free to link in if you still planning to work on it for 0.8 release
        Hide
        Jeff Zhang added a comment -

        Olga, Sorry, I am afraid I have no time to commit this issue before the release of 0.8. I move it to 0.9

        Show
        Jeff Zhang added a comment - Olga, Sorry, I am afraid I have no time to commit this issue before the release of 0.8. I move it to 0.9
        Hide
        Olga Natkovich added a comment -

        Jeff, are you still planning to work on this for 0.9?

        Show
        Olga Natkovich added a comment - Jeff, are you still planning to work on this for 0.9?
        Hide
        Jeff Zhang added a comment -

        Yes, I will attach the patch for the latest trunk asap

        Show
        Jeff Zhang added a comment - Yes, I will attach the patch for the latest trunk asap
        Hide
        Olga Natkovich added a comment -

        Hi Jeff, It has been a couple of weeks so I just wanted to check with you on this. We would really like to get all new features on in the next couple of weeks to leave sufficient time for testing before the release. Let me know if you still want to keep it attached to 0.9 release

        Show
        Olga Natkovich added a comment - Hi Jeff, It has been a couple of weeks so I just wanted to check with you on this. We would really like to get all new features on in the next couple of weeks to leave sufficient time for testing before the release. Let me know if you still want to keep it attached to 0.9 release
        Hide
        Olga Natkovich added a comment -

        Unlinking from the release since there is no activities and we are moving to the stabilization part of the release

        Show
        Olga Natkovich added a comment - Unlinking from the release since there is no activities and we are moving to the stabilization part of the release
        Hide
        Gianmarco De Francisci Morales added a comment -

        Canceling because the patch is stale and there has been no activity in 1 year.

        Show
        Gianmarco De Francisci Morales added a comment - Canceling because the patch is stale and there has been no activity in 1 year.
        Hide
        Nezih Yigitbasi added a comment -

        Seems like no one is working on this, so I assigned this task to myself and gave it a shot. Following Alan's comments I introduced a new OverwritingStoreFunc interface, added additional unit tests and all tests pass now. Can you guys please review?

        Show
        Nezih Yigitbasi added a comment - Seems like no one is working on this, so I assigned this task to myself and gave it a shot. Following Alan's comments I introduced a new OverwritingStoreFunc interface, added additional unit tests and all tests pass now. Can you guys please review?
        Hide
        Daniel Dai added a comment -

        Nezih Yigitbasi, thanks for the patch. Couple of comments:
        1. Formatting: please use 4 space instead of tab
        2. TestPigStorage.java: PigStorage(',', '-overwrite true') => PigStorage(',', '--overwrite true'), -long_option only works in older version of common-cli
        3. InputOutputFileValidator.java: I get a org.apache.hadoop.fs.FileAlreadyExistsException instead of org.apache.hadoop.mapred.FileAlreadyExistsException in Hadoop 2
        4. PigOutputFormat.java: Seems we already did the cleanup in JobControlCompiler, any reason do it again here?

        Show
        Daniel Dai added a comment - Nezih Yigitbasi , thanks for the patch. Couple of comments: 1. Formatting: please use 4 space instead of tab 2. TestPigStorage.java: PigStorage(',', '-overwrite true') => PigStorage(',', '--overwrite true'), -long_option only works in older version of common-cli 3. InputOutputFileValidator.java: I get a org.apache.hadoop.fs.FileAlreadyExistsException instead of org.apache.hadoop.mapred.FileAlreadyExistsException in Hadoop 2 4. PigOutputFormat.java: Seems we already did the cleanup in JobControlCompiler, any reason do it again here?
        Hide
        Nezih Yigitbasi added a comment -

        Daniel,
        Thanks for the feedback.
        1. Updated formatting.
        2. Updated.
        3. Updated and made sure that unit tests also pass with -Dhadoopversion=23
        4. We do not do cleanup here. What we do is check whether store func is an OverwritingStoreFunc and decide whether we should call checkOutputSpecs. checkOutputSpecs() call fails when the directory already exists, that's why we need to bypass that check for OverwritingStoreFunc.

        Show
        Nezih Yigitbasi added a comment - Daniel, Thanks for the feedback. 1. Updated formatting. 2. Updated. 3. Updated and made sure that unit tests also pass with -Dhadoopversion=23 4. We do not do cleanup here. What we do is check whether store func is an OverwritingStoreFunc and decide whether we should call checkOutputSpecs. checkOutputSpecs() call fails when the directory already exists, that's why we need to bypass that check for OverwritingStoreFunc.
        Hide
        Daniel Dai added a comment -

        Thanks for the update. Another several comments:
        1. PigOutputFormat.java: Remove "PigStorage ps = (PigStorage) sFunc;", we cannot assume sFunc is PigStorage
        2. InputOutputFileValidator.java: Shall we skip checkOutputSpecs when overwrite happens? There is nothing wrong to capture FileAlreadyExistsException exception, but since you skip checkOutputSpecs in PigOutputFormat, it seems better to do it consistently
        3. Another tab in PigStorage.java: "protected ResourceSchema schema"

        Show
        Daniel Dai added a comment - Thanks for the update. Another several comments: 1. PigOutputFormat.java: Remove "PigStorage ps = (PigStorage) sFunc;", we cannot assume sFunc is PigStorage 2. InputOutputFileValidator.java: Shall we skip checkOutputSpecs when overwrite happens? There is nothing wrong to capture FileAlreadyExistsException exception, but since you skip checkOutputSpecs in PigOutputFormat, it seems better to do it consistently 3. Another tab in PigStorage.java: "protected ResourceSchema schema"
        Hide
        Nezih Yigitbasi added a comment -

        Daniel,
        Thanks for the comments.
        1. Good catch.
        2. Updated PigOutputFormat to be consistent with InputOutputFileValidator. Both do checks now.
        3. Fixed.

        Show
        Nezih Yigitbasi added a comment - Daniel, Thanks for the comments. 1. Good catch. 2. Updated PigOutputFormat to be consistent with InputOutputFileValidator. Both do checks now. 3. Fixed.
        Hide
        Daniel Dai added a comment -

        Also add some comment to OverwritingStoreFunc. +1.

        Patch committed to trunk. Thanks Nezih!

        Show
        Daniel Dai added a comment - Also add some comment to OverwritingStoreFunc. +1. Patch committed to trunk. Thanks Nezih!
        Hide
        Nezih Yigitbasi added a comment -

        Daniel, one question. Do you think the isOverwrite() method in the OverwritingStoreFunc interface necessary? If a store func. implements this interface it is very likely that it will return true in isOverwrite(). Maybe we should remove that method, what do you think?

        Show
        Nezih Yigitbasi added a comment - Daniel, one question. Do you think the isOverwrite() method in the OverwritingStoreFunc interface necessary? If a store func. implements this interface it is very likely that it will return true in isOverwrite(). Maybe we should remove that method, what do you think?
        Hide
        Daniel Dai added a comment -

        Yes, good point. Let's remove the method to keep interface simpler.

        Show
        Daniel Dai added a comment - Yes, good point. Let's remove the method to keep interface simpler.
        Hide
        Nezih Yigitbasi added a comment -

        When I removed the isOverwritable method tests failed. Because, the user tells us whether to overwrite or not, and using that flag we determine to catch file not found problems during validation. That is, implementing the interface is not enough to catch file not found problems during validation (user says "-overwrite false" but we only check whether PigStorage implements the OverwritingStoreFunc and ignore his input), so we need a flag that tells us user's input. To make the intent clearer I changed the name of OverwritingStoreFunc to OverwritableStoreFunc and changed the name of the method from "isOverwrite" to "shouldOverwrite", also added some javadoc.

        Show
        Nezih Yigitbasi added a comment - When I removed the isOverwritable method tests failed. Because, the user tells us whether to overwrite or not, and using that flag we determine to catch file not found problems during validation. That is, implementing the interface is not enough to catch file not found problems during validation (user says "-overwrite false" but we only check whether PigStorage implements the OverwritingStoreFunc and ignore his input), so we need a flag that tells us user's input. To make the intent clearer I changed the name of OverwritingStoreFunc to OverwritableStoreFunc and changed the name of the method from "isOverwrite" to "shouldOverwrite", also added some javadoc.
        Hide
        Daniel Dai added a comment -

        Sounds good. Since the patch is committed, can you upload the delta patch?

        Show
        Daniel Dai added a comment - Sounds good. Since the patch is committed, can you upload the delta patch?
        Hide
        Nezih Yigitbasi added a comment -

        Daniel, delta patch added. Thanks for the review.

        Show
        Nezih Yigitbasi added a comment - Daniel, delta patch added. Thanks for the review.
        Hide
        Daniel Dai added a comment -

        PIG-259.9.patch committed. Thanks Nezih!

        Show
        Daniel Dai added a comment - PIG-259 .9.patch committed. Thanks Nezih!

          People

          • Assignee:
            Nezih Yigitbasi
            Reporter:
            Olga Natkovich
          • Votes:
            3 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development