Pig
  1. Pig
  2. PIG-1064

Behvaiour of COGROUP with and without schema when using "*" operator

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.6.0
    • Component/s: impl
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I have 2 tab separated files, "1.txt" and "2.txt"

      $ cat 1.txt
      ====================
      1 2

      2 3

      ====================
      $ cat 2.txt

      1 2

      2 3

      I use COGROUP feature of Pig in the following way:

      $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main

      grunt> A = load '1.txt';            
      grunt> B = load '2.txt' as (b0, b1);
      grunt> C = cogroup A by *, B by *;  
      

      2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1012: Each COGroup input has to have the same number of inner plans
      Details at logfile: pig_1256845224752.log
      ==========================================================

      If I reverse, the order of the schema's

      grunt> A = load '1.txt' as (a0, a1);
      grunt> B = load '2.txt';            
      grunt> C = cogroup A by *, B by *;  
      

      2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1013: Grouping attributes can either be star or a list of expressions, but not both.
      Details at logfile: pig_1256845224752.log

      ==========================================================
      Now running without schema??

      grunt> A = load '1.txt';            
      grunt> B = load '2.txt';            
      grunt> C = cogroup A by *, B by *;
      grunt> dump C; 
      

      2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: "file:/tmp/temp-319926700/tmp-1990275961"
      2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 2
      2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 154
      2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
      2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!

      ((1,2),

      {(1,2)},{(1,2)}

      )
      ((2,3),

      {(2,3)},{(2,3)}

      )
      ==========================================================

      Is this a bug or a feature?

      Viraj

      1. PIG-1064-5.patch
        14 kB
        Pradeep Kamath
      2. PIG-1064-4.patch
        21 kB
        Daniel Dai
      3. PIG-1064-3.patch
        21 kB
        Pradeep Kamath
      4. PIG-1064-2.patch
        20 kB
        Pradeep Kamath
      5. PIG-1064.patch
        9 kB
        Pradeep Kamath

        Activity

        Hide
        Pradeep Kamath added a comment -

        A proposal to fix this is to catch the situation wherein the user specifies '*' as the cogrouping key and does not have a schema for the corresponding input to the cogroup. In these situations we would issue an error message - "Cogroup by * is only allowed if the input has a schema" and error out.

        Show
        Pradeep Kamath added a comment - A proposal to fix this is to catch the situation wherein the user specifies '*' as the cogrouping key and does not have a schema for the corresponding input to the cogroup. In these situations we would issue an error message - "Cogroup by * is only allowed if the input has a schema" and error out.
        Hide
        Alan Gates added a comment -

        Why is cogrouping on * without a schema causing trouble? Because we can't guarantee that inputs have the same number of fields?

        Why would anyone ever want to cogroup on *? Do we need to spend any effort fixing this?

        Show
        Alan Gates added a comment - Why is cogrouping on * without a schema causing trouble? Because we can't guarantee that inputs have the same number of fields? Why would anyone ever want to cogroup on *? Do we need to spend any effort fixing this?
        Hide
        Pradeep Kamath added a comment -

        Cogroup needs the same arity for the grouping key from both inputs. If there is a cogroup by , the '' needs to be expanded so we know the arity. This is done in ProjectStarTranslator - the current code leaves the '*' as is when there is no schema. This causes problems in the backend - hence the proposed fix to catch this and error out.

        If we feel that users should not cogroup on '' we should prevent it in the parser. The proposed fix is easy enough that I don't think we need to restrict the use of ''.

        Show
        Pradeep Kamath added a comment - Cogroup needs the same arity for the grouping key from both inputs. If there is a cogroup by , the ' ' needs to be expanded so we know the arity. This is done in ProjectStarTranslator - the current code leaves the '*' as is when there is no schema. This causes problems in the backend - hence the proposed fix to catch this and error out. If we feel that users should not cogroup on ' ' we should prevent it in the parser. The proposed fix is easy enough that I don't think we need to restrict the use of ' '.
        Hide
        Pradeep Kamath added a comment -

        The last paragraph in my previous comment should read:
        If we feel that users should not cogroup on star we should prevent it in the parser. The proposed fix is easy enough that I don't think we need to restrict the use of star.

        Show
        Pradeep Kamath added a comment - The last paragraph in my previous comment should read: If we feel that users should not cogroup on star we should prevent it in the parser. The proposed fix is easy enough that I don't think we need to restrict the use of star.
        Hide
        Pradeep Kamath added a comment -

        The patch implements the proposal to catch the situation wherein the user specifies '*' as the cogrouping key and does not have a schema for the corresponding input to the cogroup. In these situations we would issue an error message - "Cogroup by * is only allowed if the input has a schema" and error out.

        Show
        Pradeep Kamath added a comment - The patch implements the proposal to catch the situation wherein the user specifies '*' as the cogrouping key and does not have a schema for the corresponding input to the cogroup. In these situations we would issue an error message - "Cogroup by * is only allowed if the input has a schema" and error out.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12424676/PIG-1064.patch
        against trunk revision 835005.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/149/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/149/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/149/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424676/PIG-1064.patch against trunk revision 835005. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/149/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/149/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/149/console This message is automatically generated.
        Hide
        Pradeep Kamath added a comment -

        Attached patch address unit test failures - the failures were in other tests wherein cogroup * without schema would be valid in the front end. With the changes in the patch, this is no longer the case. I have removed these testcases and in one case retained it since it tests with different loadfuncs.

        Show
        Pradeep Kamath added a comment - Attached patch address unit test failures - the failures were in other tests wherein cogroup * without schema would be valid in the front end. With the changes in the patch, this is no longer the case. I have removed these testcases and in one case retained it since it tests with different loadfuncs.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12424755/PIG-1064-2.patch
        against trunk revision 835499.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 12 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/153/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/153/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/153/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424755/PIG-1064-2.patch against trunk revision 835499. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/153/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/153/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/153/console This message is automatically generated.
        Hide
        Pradeep Kamath added a comment -

        There were a couple of new tests added by a recent patch (PIG-1038) which had group by star and broke the tests with this patch - attached patch with fix in the tests.

        Show
        Pradeep Kamath added a comment - There were a couple of new tests added by a recent patch ( PIG-1038 ) which had group by star and broke the tests with this patch - attached patch with fix in the tests.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12424792/PIG-1064-3.patch
        against trunk revision 835499.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 15 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/154/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/154/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/154/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424792/PIG-1064-3.patch against trunk revision 835499. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/154/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/154/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/154/console This message is automatically generated.
        Hide
        Daniel Dai added a comment -

        Attach a patch to fix TestSecondarySort unit failure.

        Show
        Daniel Dai added a comment - Attach a patch to fix TestSecondarySort unit failure.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12424878/PIG-1064-4.patch
        against trunk revision 835499.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 15 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/155/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/155/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/155/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424878/PIG-1064-4.patch against trunk revision 835499. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/155/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/155/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/155/console This message is automatically generated.
        Hide
        Pradeep Kamath added a comment -

        Can't make out what is wrong with the unit tests from the report above - am running them all on my local box - will update with the results

        Show
        Pradeep Kamath added a comment - Can't make out what is wrong with the unit tests from the report above - am running them all on my local box - will update with the results
        Hide
        Daniel Dai added a comment -

        With this patch, "group by *" without schema does not work anymore. I think there could be some valid use case on that, eg, people may want to use this to do a count for each distinctive values using statement "group by *; foreach generate group, COUNT;". It is much safe to allow "group by *" work, and only disallow "cogroup by *".

        Show
        Daniel Dai added a comment - With this patch, "group by *" without schema does not work anymore. I think there could be some valid use case on that, eg, people may want to use this to do a count for each distinctive values using statement "group by *; foreach generate group, COUNT ;". It is much safe to allow "group by *" work, and only disallow "cogroup by *".
        Hide
        Pradeep Kamath added a comment -

        Attached patch to ensure group by star with out schema still works.

        Show
        Pradeep Kamath added a comment - Attached patch to ensure group by star with out schema still works.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12425360/PIG-1064-5.patch
        against trunk revision 881008.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/160/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/160/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/160/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12425360/PIG-1064-5.patch against trunk revision 881008. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/160/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/160/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/160/console This message is automatically generated.
        Hide
        Daniel Dai added a comment -

        +1

        Show
        Daniel Dai added a comment - +1
        Hide
        Pradeep Kamath added a comment -

        Patch committed to trunk.

        Show
        Pradeep Kamath added a comment - Patch committed to trunk.

          People

          • Assignee:
            Pradeep Kamath
            Reporter:
            Viraj Bhat
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development