Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      A new API (org.apache.pig.PigRunner) is added that allows users to execute a Pig script within a Java program and gets a PigStats object as a return value.
    • Tags:
      documentation

      Description

      It would be nice to make Pig more friendly for applications like workflow that would be executing pig scripts on user behalf.

      Currently, they would have to use pig command line to execute the code; however, this has limitation on the kind of output that would be delivered. For instance, it is hard to produce error information that is easy to use programatically or collect statistics.

      The proposal is to create a class that mimics the behavior of the Main but gives users a status object back. The the main code of pig would look somethig like:

      public static void main(String args[])

      { PigStatus ps = PigMain.exec(args); exit (PigStatus.rc); }

      We need to define the following:

      • Content of PigStatus. It should at least include
      • return code
      • error string
      • exception
      • statistics
      • A way to propagate the status class through pig code
      1. PIG-1333_1.patch
        184 kB
        Richard Ding
      2. PIG-1333_2.patch
        184 kB
        Richard Ding
      3. PIG-1333_3.patch
        186 kB
        Richard Ding
      4. PIG-1333.patch
        292 kB
        Richard Ding

        Activity

        Hide
        rding Richard Ding added a comment -

        I propose Pig add a new class PigRunner that has a run method that returns a PigStats object:

        package org.apache.pig;
        
        public abstract class PigRunner {
            public static PigStats run(String[] args) {...}
        }
        

        The PigStats class will include the following methods:

        boolean isSuccessful() 
        
        int getReturnCode() // a list of return codes will be defined in PigRunner
        
        String getErrorMessage()
        
        int getErrorCode() // PigException's error code
        
        int getNumberJobs() // number of MR jobs for this invocation
        
        JobPlan getJobPlan() // DAG of MR jobs (a.k.a. an OperatorPlan)
        
        List<String> getOutputLocations() 
        
        long getNumberRecords(String location) // number of records in the given output location
        
        long getNumberBytes(String location)  // number of bytes in the given location
        
        ... ... // a few more
        

        A job in the JobPlan will include these methods:

        
        String getAlias() // the alias associated with this job
        
        String getFeature()  // the Pig feature associated with this job
        
        int getNumberMaps() 
        
        int getNumberReduces() 
        
        ... ... // a few more methods on job statistics retrieved from Hadoop
        
        
        Show
        rding Richard Ding added a comment - I propose Pig add a new class PigRunner that has a run method that returns a PigStats object: package org.apache.pig; public abstract class PigRunner { public static PigStats run( String [] args) {...} } The PigStats class will include the following methods: boolean isSuccessful() int getReturnCode() // a list of return codes will be defined in PigRunner String getErrorMessage() int getErrorCode() // PigException's error code int getNumberJobs() // number of MR jobs for this invocation JobPlan getJobPlan() // DAG of MR jobs (a.k.a. an OperatorPlan) List< String > getOutputLocations() long getNumberRecords( String location) // number of records in the given output location long getNumberBytes( String location) // number of bytes in the given location ... ... // a few more A job in the JobPlan will include these methods: String getAlias() // the alias associated with this job String getFeature() // the Pig feature associated with this job int getNumberMaps() int getNumberReduces() ... ... // a few more methods on job statistics retrieved from Hadoop
        Hide
        rding Richard Ding added a comment -

        This patch implements the proposed API.

        Show
        rding Richard Ding added a comment - This patch implements the proposed API.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12445727/PIG-1333.patch
        against trunk revision 949057.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 99 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/14/console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12445727/PIG-1333.patch against trunk revision 949057. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 99 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/14/console This message is automatically generated.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12445786/PIG-1333.patch
        against trunk revision 949057.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 99 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 147 javac compiler warnings (more than the trunk's current 139 warnings).

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        -1 release audit. The applied patch generated 395 release audit warnings (more than the trunk's current 385 warnings).

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/16/testReport/
        Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/16/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/16/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/16/console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12445786/PIG-1333.patch against trunk revision 949057. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 99 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 147 javac compiler warnings (more than the trunk's current 139 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 395 release audit warnings (more than the trunk's current 385 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/16/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/16/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/16/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/16/console This message is automatically generated.
        Hide
        rding Richard Ding added a comment -

        The release audit warnings are all html related and can be ignored. The javac warnings are caused by using deprecated classes (such as Hadoop Counters) which can't be suppressed.

        Show
        rding Richard Ding added a comment - The release audit warnings are all html related and can be ignored. The javac warnings are caused by using deprecated classes (such as Hadoop Counters) which can't be suppressed.
        Hide
        olgan Olga Natkovich added a comment -

        Patch looks good. A few comments and questions:

        (1) General comment. This patch is very large and combines multiple different issues that could have been separated into multiple patches to make it easier to review and test
        (2) We are missing script level feature collection. (I see the one at job level.) For each script, we want to collect overall script features such as different operators: join, order by, etc., is it a multiquery, does it have UDF. Also, we would want to know if combiner was used and whether the script spilled but maybe both of those can be at the job level.
        (3) We need to add separate comment to the JIRA marked as documentation that describes PigRunner since it is a new interface that we need to include in 0.8.0 documentation.
        (4) MapReduceLauncher. Why was exception handling and temp store handling code removed?
        (5) OutputStats assumes that location is a path which might not be true for non-file stores.
        (6) ScriptState: There are maps/hashes optimized for enums (http://java.sun.com/j2se/1.5.0/docs/api/java/util/EnumMap.html)
        (7) Why JobStats is derived from an operator?
        (8) Why did JOB_NAME_PREFIX got removed from PigContext?
        (9) Why do we need to synchronize getTemporaryFile?

        Show
        olgan Olga Natkovich added a comment - Patch looks good. A few comments and questions: (1) General comment. This patch is very large and combines multiple different issues that could have been separated into multiple patches to make it easier to review and test (2) We are missing script level feature collection. (I see the one at job level.) For each script, we want to collect overall script features such as different operators: join, order by, etc., is it a multiquery, does it have UDF. Also, we would want to know if combiner was used and whether the script spilled but maybe both of those can be at the job level. (3) We need to add separate comment to the JIRA marked as documentation that describes PigRunner since it is a new interface that we need to include in 0.8.0 documentation. (4) MapReduceLauncher. Why was exception handling and temp store handling code removed? (5) OutputStats assumes that location is a path which might not be true for non-file stores. (6) ScriptState: There are maps/hashes optimized for enums ( http://java.sun.com/j2se/1.5.0/docs/api/java/util/EnumMap.html ) (7) Why JobStats is derived from an operator? (8) Why did JOB_NAME_PREFIX got removed from PigContext? (9) Why do we need to synchronize getTemporaryFile?
        Hide
        dvryaboy Dmitriy V. Ryaboy added a comment -

        That's a heck of a patch. I am really looking forward to having this available.

        Not sure you need the map in the PIG_FEATURE enum. You can get an enum by offset using PIG_FEATURE.values(), so no need for the constructor; you can get a string representation using pigFeature.name() or pigFeature.toString(), so no need for getString(); and you can get the ordinal using pigFeature.ordinal(). Granted, the ordinals are 0-based, but you can just throw in an dummy value for the 0th spot to preserve all the offsets as they are.

        I see that you explicitly pull out the known and enumerated Pig counters. Any reason not to make all other job counters available as well via the same interface?

        Show
        dvryaboy Dmitriy V. Ryaboy added a comment - That's a heck of a patch. I am really looking forward to having this available. Not sure you need the map in the PIG_FEATURE enum. You can get an enum by offset using PIG_FEATURE.values(), so no need for the constructor; you can get a string representation using pigFeature.name() or pigFeature.toString(), so no need for getString(); and you can get the ordinal using pigFeature.ordinal(). Granted, the ordinals are 0-based, but you can just throw in an dummy value for the 0th spot to preserve all the offsets as they are. I see that you explicitly pull out the known and enumerated Pig counters. Any reason not to make all other job counters available as well via the same interface?
        Hide
        rding Richard Ding added a comment -

        Thanks for reviewing the patch. My comments are inline.

        (1) General comment. This patch is very large and combines multiple different issues that could have been separated into multiple patches to make it easier to review and test

        In this patch JobStats and OutputStats replaced the HJob as the return value of execution of a pig script. This results in several deprecated/removed classes. I also go through the unit tests to replace most of the deprecated methods used there and this causes modification of many test classes.

        (2) We are missing script level feature collection. (I see the one at job level.) For each script, we want to collect overall script features such as different operators: join, order by, etc., is it a multiquery, does it have UDF. Also, we would want to know if combiner was used and whether the script spilled but maybe both of those can be at the job level.

        I'll add the script level features, in addition to the job level feature. The value of job level feature (pig.job.feature is the key in job xml) will be a string, but the value of script level feature (pig.script.features) will be a bit set represented by a long.

        (3) We need to add separate comment to the JIRA marked as documentation that describes PigRunner since it is a new interface that we need to include in 0.8.0 documentation.

        Will do.

        (4) MapReduceLauncher. Why was exception handling and temp store handling code removed?

        Exception handling is still there. Handling of temp stores for failed jobs is moved to JobStats.

        (5) OutputStats assumes that location is a path which might not be true for non-file stores.

        It isn't necessary to use path in OutputStats (just trying to get a more readable short name for location). I'll move it.

        (6) ScriptState: There are maps/hashes optimized for enums (http://java.sun.com/j2se/1.5.0/docs/api/java/util/EnumMap.html)

        Dmitriy also commented on this. I'll modified the usage of emums based on the comments.

        (7) Why JobStats is derived from an operator?

        The JobGraph of the script is a derived class from BaseOperatorPlan (in experimental package) whose elements are Operators. JobStats is derived from Operator in order to use the methods on the base class.

        (8) Why did JOB_NAME_PREFIX got removed from PigContext?

        JOB_NAME_PREFIX is still there. What is removed is the private member 'jobName' which isn't used in the class and whose getter and setter had been removed long time ago.

        (9) Why do we need to synchronize getTemporaryFile?

        I'll remove the synchronize on this method. It was copied from the corresponding deprecated method.

        Show
        rding Richard Ding added a comment - Thanks for reviewing the patch. My comments are inline. (1) General comment. This patch is very large and combines multiple different issues that could have been separated into multiple patches to make it easier to review and test In this patch JobStats and OutputStats replaced the HJob as the return value of execution of a pig script. This results in several deprecated/removed classes. I also go through the unit tests to replace most of the deprecated methods used there and this causes modification of many test classes. (2) We are missing script level feature collection. (I see the one at job level.) For each script, we want to collect overall script features such as different operators: join, order by, etc., is it a multiquery, does it have UDF. Also, we would want to know if combiner was used and whether the script spilled but maybe both of those can be at the job level. I'll add the script level features, in addition to the job level feature. The value of job level feature (pig.job.feature is the key in job xml) will be a string, but the value of script level feature (pig.script.features) will be a bit set represented by a long. (3) We need to add separate comment to the JIRA marked as documentation that describes PigRunner since it is a new interface that we need to include in 0.8.0 documentation. Will do. (4) MapReduceLauncher. Why was exception handling and temp store handling code removed? Exception handling is still there. Handling of temp stores for failed jobs is moved to JobStats. (5) OutputStats assumes that location is a path which might not be true for non-file stores. It isn't necessary to use path in OutputStats (just trying to get a more readable short name for location). I'll move it. (6) ScriptState: There are maps/hashes optimized for enums ( http://java.sun.com/j2se/1.5.0/docs/api/java/util/EnumMap.html ) Dmitriy also commented on this. I'll modified the usage of emums based on the comments. (7) Why JobStats is derived from an operator? The JobGraph of the script is a derived class from BaseOperatorPlan (in experimental package) whose elements are Operators. JobStats is derived from Operator in order to use the methods on the base class. (8) Why did JOB_NAME_PREFIX got removed from PigContext? JOB_NAME_PREFIX is still there. What is removed is the private member 'jobName' which isn't used in the class and whose getter and setter had been removed long time ago. (9) Why do we need to synchronize getTemporaryFile? I'll remove the synchronize on this method. It was copied from the corresponding deprecated method.
        Hide
        rding Richard Ding added a comment -

        Thanks Dmitriy for the suggestion on enums. It will simplify the code.

        Right now the job counters we retrieve are

        MAP_INPUT_RECORDS
        MAP_OUTPUT_RECORDS
        REDUCE_INPUT_RECORDS
        REDUCE_OUTPUT_RECORDS
        HDFS_BYTES_WRITTEN
        SPILLABLE_MEMORY_MANAGER_SPILL_COUNT
        PROACTIVE_SPILL_COUNT
        

        plus Pig MultiStore counters.

        I'm not sure we should make all Hadoop counters available through the new API. How useful will it be to the users? I'm open to suggestions.

        Show
        rding Richard Ding added a comment - Thanks Dmitriy for the suggestion on enums. It will simplify the code. Right now the job counters we retrieve are MAP_INPUT_RECORDS MAP_OUTPUT_RECORDS REDUCE_INPUT_RECORDS REDUCE_OUTPUT_RECORDS HDFS_BYTES_WRITTEN SPILLABLE_MEMORY_MANAGER_SPILL_COUNT PROACTIVE_SPILL_COUNT plus Pig MultiStore counters. I'm not sure we should make all Hadoop counters available through the new API. How useful will it be to the users? I'm open to suggestions.
        Hide
        rding Richard Ding added a comment -

        Now I also think this patch is too large I'll break it into two: one patch for this jira that concentrates on implementing the new API, the other patch will address the code cleanup in a different jira.

        Show
        rding Richard Ding added a comment - Now I also think this patch is too large I'll break it into two: one patch for this jira that concentrates on implementing the new API, the other patch will address the code cleanup in a different jira.
        Hide
        olgan Olga Natkovich added a comment -

        +1 to that!

        Show
        olgan Olga Natkovich added a comment - +1 to that!
        Hide
        dvryaboy Dmitriy V. Ryaboy added a comment -

        I'm not sure we should make all Hadoop counters available through the new API. How useful will it be to the users? I'm open to suggestions.

        Can't speak for other users, but we use counters quite a bit with Elephant Bird and some internal code for keeping track of timed out service requests, unparsable records, and more. The @MonitoredUDF annotation I proposed in PIG-1427 uses counters to report on runaway udfs that get killed.

        I think the question isn't so much why would you expose them, as why wouldn't you expose them...

        Show
        dvryaboy Dmitriy V. Ryaboy added a comment - I'm not sure we should make all Hadoop counters available through the new API. How useful will it be to the users? I'm open to suggestions. Can't speak for other users, but we use counters quite a bit with Elephant Bird and some internal code for keeping track of timed out service requests, unparsable records, and more. The @MonitoredUDF annotation I proposed in PIG-1427 uses counters to report on runaway udfs that get killed. I think the question isn't so much why would you expose them, as why wouldn't you expose them...
        Hide
        rding Richard Ding added a comment -

        Hi Dmitriy,

        Let me try to clarify the requirement: what you are asking for is to have the new API expose the Hadoop counters, such as

        public class JobStats {
        
            public Counters getCounters();
        
            ......
        
        }
        

        so users can get the counter they are interested in.

        Show
        rding Richard Ding added a comment - Hi Dmitriy, Let me try to clarify the requirement: what you are asking for is to have the new API expose the Hadoop counters, such as public class JobStats { public Counters getCounters(); ...... } so users can get the counter they are interested in.
        Hide
        dvryaboy Dmitriy V. Ryaboy added a comment -

        Yup.

        Show
        dvryaboy Dmitriy V. Ryaboy added a comment - Yup.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12447048/PIG-1333_1.patch
        against trunk revision 953798.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 11 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        -1 release audit. The applied patch generated 387 release audit warnings (more than the trunk's current 383 warnings).

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/329/testReport/
        Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/329/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/329/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/329/console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12447048/PIG-1333_1.patch against trunk revision 953798. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 387 release audit warnings (more than the trunk's current 383 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/329/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/329/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/329/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/329/console This message is automatically generated.
        Hide
        rding Richard Ding added a comment -

        Hi Dmitriy,

        I added the new method on JobStats class:

        public class JobStats {
        
            public Counters getHadoopCounters();
        
            ......
        }
        

        Can you please review the patch?

        Show
        rding Richard Ding added a comment - Hi Dmitriy, I added the new method on JobStats class: public class JobStats { public Counters getHadoopCounters(); ...... } Can you please review the patch?
        Hide
        dvryaboy Dmitriy V. Ryaboy added a comment -

        My apologies for the delayed review.
        Overall, looks great.

        A few comments:

        You should add the new visibility annotations to new classes (JobStats should be Public and Evolving, for example).

        Why is MapReduceLauncher suppressing deprecation warning?

        It is currently possible to have cogroup, groupBy, and hashJoin all be true in MapReduceOper as they are independent variabes. I don't see that happening in the code, but making the three states (+ fourth state, "none") an enum and storing the enum value in a single variable is more defensive and might save us from potential bugs down the line. This change can be confined to MapReduceOper since you would just have to change the accessor methods slightly.

        Why is MAX_SCRIPT_SIZE so short? Is it an arbitrary number? We produce significantly longer scrips all the time. I assume it's to save on space; perhaps you can make it controllable by some property?

        In ScriptState, the error message "LOG.warn("unable to get job predecessors", e);" should be modified to include information about the job to assist debugging.

        Same class. getAlias method – perhaps it makes sense to set the alias to a default if you encounter a visitorException?

        I am not very familiar with the visitors, but it looks like normally, PhyPlanVisitor spawns a walker for the internal plans of Filter, CollectedGroup, and so on; this behavior appears to be gone from AliasVisitor. Should it be reproduced in AliasVisitor?

        Just a style thing, but I prefer writing setter methods that return self instead of being void – that way you can chain them together.

        Everything else looks good.

        Show
        dvryaboy Dmitriy V. Ryaboy added a comment - My apologies for the delayed review. Overall, looks great. A few comments: You should add the new visibility annotations to new classes (JobStats should be Public and Evolving, for example). Why is MapReduceLauncher suppressing deprecation warning? It is currently possible to have cogroup, groupBy, and hashJoin all be true in MapReduceOper as they are independent variabes. I don't see that happening in the code, but making the three states (+ fourth state, "none") an enum and storing the enum value in a single variable is more defensive and might save us from potential bugs down the line. This change can be confined to MapReduceOper since you would just have to change the accessor methods slightly. Why is MAX_SCRIPT_SIZE so short? Is it an arbitrary number? We produce significantly longer scrips all the time. I assume it's to save on space; perhaps you can make it controllable by some property? In ScriptState, the error message "LOG.warn("unable to get job predecessors", e);" should be modified to include information about the job to assist debugging. Same class. getAlias method – perhaps it makes sense to set the alias to a default if you encounter a visitorException? I am not very familiar with the visitors, but it looks like normally, PhyPlanVisitor spawns a walker for the internal plans of Filter, CollectedGroup, and so on; this behavior appears to be gone from AliasVisitor. Should it be reproduced in AliasVisitor? Just a style thing, but I prefer writing setter methods that return self instead of being void – that way you can chain them together. Everything else looks good.
        Hide
        rding Richard Ding added a comment -

        New patch to address the review comments.

        Show
        rding Richard Ding added a comment - New patch to address the review comments.
        Hide
        rding Richard Ding added a comment -

        A few comments about the comments:

        Why is MAX_SCRIPT_SIZE so short? Is it an arbitrary number? We produce significantly longer scrips all the time. I assume it's to save on space; perhaps you can make it controllable by some property?

        This is a compromise. Next release of Hadoop will completely fix this problem (no limit on the length of the scripts). Until then, we won't allow users to change this setting and inadvertently affect Hadoop performance.

        I am not very familiar with the visitors, but it looks like normally, PhyPlanVisitor spawns a walker for the internal plans of Filter, CollectedGroup, and so on; this behavior appears to be gone from AliasVisitor. Should it be reproduced in AliasVisitor?

        I think the top-level aliases are enough to identify the operators. No need to use alias in the inner plans.

        Just a style thing, but I prefer writing setter methods that return self instead of being void - that way you can chain them together.

        I agree with you on this. But I also want to be consistent with the style used through out Pig. So I didn't change the setters.

        Show
        rding Richard Ding added a comment - A few comments about the comments: Why is MAX_SCRIPT_SIZE so short? Is it an arbitrary number? We produce significantly longer scrips all the time. I assume it's to save on space; perhaps you can make it controllable by some property? This is a compromise. Next release of Hadoop will completely fix this problem (no limit on the length of the scripts). Until then, we won't allow users to change this setting and inadvertently affect Hadoop performance. I am not very familiar with the visitors, but it looks like normally, PhyPlanVisitor spawns a walker for the internal plans of Filter, CollectedGroup, and so on; this behavior appears to be gone from AliasVisitor. Should it be reproduced in AliasVisitor? I think the top-level aliases are enough to identify the operators. No need to use alias in the inner plans. Just a style thing, but I prefer writing setter methods that return self instead of being void - that way you can chain them together. I agree with you on this. But I also want to be consistent with the style used through out Pig. So I didn't change the setters.
        Hide
        dvryaboy Dmitriy V. Ryaboy added a comment -

        +1

        Show
        dvryaboy Dmitriy V. Ryaboy added a comment - +1
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12447767/PIG-1333_3.patch
        against trunk revision 957046.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 11 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 140 javac compiler warnings (more than the trunk's current 138 warnings).

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        -1 release audit. The applied patch generated 391 release audit warnings (more than the trunk's current 387 warnings).

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/347/testReport/
        Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/347/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/347/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/347/console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12447767/PIG-1333_3.patch against trunk revision 957046. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 140 javac compiler warnings (more than the trunk's current 138 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 391 release audit warnings (more than the trunk's current 387 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/347/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/347/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/347/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/347/console This message is automatically generated.
        Hide
        rding Richard Ding added a comment -

        The unit test failures are not related to the patch (l run core tests locally and it's successful). Release audit warnings are all html related, and additional javac warnings are due to using a couple of deprecated Hadoop classes.

        I'm going to commit the patch shortly.

        Show
        rding Richard Ding added a comment - The unit test failures are not related to the patch (l run core tests locally and it's successful). Release audit warnings are all html related, and additional javac warnings are due to using a couple of deprecated Hadoop classes. I'm going to commit the patch shortly.

          People

          • Assignee:
            rding Richard Ding
            Reporter:
            olgan Olga Natkovich
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development