Pig
  1. Pig
  2. PIG-1420

Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.8.0
    • Component/s: impl
    • Labels:
    • Patch Info:
      Patch Available
    • Release Note:
      Hide
      CONCAT handles all fields in the supplied tuple, instead of just the first two. This is backwards compatible unless you were relying on it only using the first two fields, which seems unlikely. DataByteArray now has an append() method.

      Example use before:

      B = FOREACH A GENERATE CONCAT(CONCAT(first_name, ' '), last_name);

      Example extended use now:

      D = FOREACH C GENERATE CONCAT(first_name, ' ', last_name);
      Show
      CONCAT handles all fields in the supplied tuple, instead of just the first two. This is backwards compatible unless you were relying on it only using the first two fields, which seems unlikely. DataByteArray now has an append() method. Example use before: B = FOREACH A GENERATE CONCAT(CONCAT(first_name, ' '), last_name); Example extended use now: D = FOREACH C GENERATE CONCAT(first_name, ' ', last_name);
    • Tags:
      pig builtin concat all fields of a tuple

      Description

      org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and org.apache.pig.builtin.StringConcat (which acts on Strings internally), both act on the first two fields of a tuple. This results in ugly nested CONCAT calls like:

      CONCAT(CONCAT(A, ' '), B)

      The more desirable form is:

      CONCAT(A, ' ', B)

      This change will be backwards compatible, provided that no one was relying on the fact that CONCAT ignores fields after the first two in a tuple. This seems a reasonable assumption to make, or at least a small break in compatibility for a sizable improvement.

      1. PIG-1420.2.patch
        1 kB
        Dmitriy V. Ryaboy
      2. addconcat2.patch
        8 kB
        Russell Jurney

        Issue Links

          Activity

          Hide
          Russell Jurney added a comment -

          Patch that adds:

          1) CONCAT handles all fields in the supplied tuple, instead of just the first two.
          2) StringConcat handles all fields in the supplied tuple, instead of just the first two.
          3) DataByteArray gets an append() to make the implementation of 1 & 2 clean (I think).
          4) Unit Tests for CONCAT and StringCONCAT in TestBuiltin
          5) Unit Tests for DataByteArray.append() in TestDataModel

          Show
          Russell Jurney added a comment - Patch that adds: 1) CONCAT handles all fields in the supplied tuple, instead of just the first two. 2) StringConcat handles all fields in the supplied tuple, instead of just the first two. 3) DataByteArray gets an append() to make the implementation of 1 & 2 clean (I think). 4) Unit Tests for CONCAT and StringCONCAT in TestBuiltin 5) Unit Tests for DataByteArray.append() in TestDataModel
          Hide
          Russell Jurney added a comment -

          Passes all tests for me. I like Asparagus.

          Show
          Russell Jurney added a comment - Passes all tests for me. I like Asparagus.
          Hide
          Russell Jurney added a comment -

          Redoing CONCAT of DataByteArrays using java.nio.ByteBuffer

          Show
          Russell Jurney added a comment - Redoing CONCAT of DataByteArrays using java.nio.ByteBuffer
          Hide
          Russell Jurney added a comment -

          Re-submitting original, java.nio.ByteBuffer isn't very helpful.

          Show
          Russell Jurney added a comment - Re-submitting original, java.nio.ByteBuffer isn't very helpful.
          Hide
          Russell Jurney added a comment -

          I don't know what resume progress does, but I'm about to find out.

          Show
          Russell Jurney added a comment - I don't know what resume progress does, but I'm about to find out.
          Hide
          Dmitriy V. Ryaboy added a comment -

          I intend to look at the patch today.

          Show
          Dmitriy V. Ryaboy added a comment - I intend to look at the patch today.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Russell,
          Reading through the code it looks good, but the patch doesn't apply. I think you are supposed to to 'git diff -p' to generate something that the patch command understands?

          Show
          Dmitriy V. Ryaboy added a comment - Russell, Reading through the code it looks good, but the patch doesn't apply. I think you are supposed to to 'git diff -p' to generate something that the patch command understands?
          Hide
          Russell Jurney added a comment -

          Dmitriy, it applies with -p1

          Show
          Russell Jurney added a comment - Dmitriy, it applies with -p1
          Hide
          Russell Jurney added a comment -

          New, working patch made with git diff --no-prefix, applies with -p0

          Show
          Russell Jurney added a comment - New, working patch made with git diff --no-prefix, applies with -p0
          Hide
          Russell Jurney added a comment -

          Fixed bad comment re: copying bytes.

          Show
          Russell Jurney added a comment - Fixed bad comment re: copying bytes.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Patch looks good, tests pass. I would prefer to append all the DataByteArrays at once instead of performing multiple copies, but that's an edge optimization. Is Hudson on vacation? Haven't heard from it in a while..

          Show
          Dmitriy V. Ryaboy added a comment - Patch looks good, tests pass. I would prefer to append all the DataByteArrays at once instead of performing multiple copies, but that's an edge optimization. Is Hudson on vacation? Haven't heard from it in a while..
          Hide
          Alan Gates added a comment -

          Hudson is in the hospital rather than on vacation. Several of the machines have crashed and we haven't managed to resuscitate them yet. We've been running tests manually here. There's a test-patch target in ant that will run the doc, release audit, etc. checks. Then a run of ant test (takes about 5 hours ) to confirm all the tests pass.

          Show
          Alan Gates added a comment - Hudson is in the hospital rather than on vacation. Several of the machines have crashed and we haven't managed to resuscitate them yet. We've been running tests manually here. There's a test-patch target in ant that will run the doc, release audit, etc. checks. Then a run of ant test (takes about 5 hours ) to confirm all the tests pass.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Sigh. That requires java5 for forrest.
          I'll run ant test, can one of you guys who are set up for that run test-patch?
          Maybe I should send Hudson some hypo-allergenic flowers.

          Show
          Dmitriy V. Ryaboy added a comment - Sigh. That requires java5 for forrest. I'll run ant test, can one of you guys who are set up for that run test-patch? Maybe I should send Hudson some hypo-allergenic flowers.
          Hide
          Alan Gates added a comment -

          Output of ant test-patch

               [exec] -1 overall.
               [exec]
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec]
               [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
               [exec]
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec]
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec]
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec]
               [exec]     -1 release audit.  The applied patch generated 528 release audit warnings (more than the trunk's current 527 warnings).
               [exec]
               [exec]
               [exec]
          

          The output of the release audit warning is:

          262d261
          <      [java]  !????? /home/gates/src/pig/PIG-1420/trunk/build/pig-0.8.0-dev/docs/jdiff/changes/org.apache.pig.data.DataByteArray.html
          

          I think this can be safely ignored, so we're good.

          I'll pass the hat here to get donations to buy Dmitriy a linux box so he can have more than one version of Java.

          Show
          Alan Gates added a comment - Output of ant test-patch [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 528 release audit warnings (more than the trunk's current 527 warnings). [exec] [exec] [exec] The output of the release audit warning is: 262d261 < [java] !????? /home/gates/src/pig/PIG-1420/trunk/build/pig-0.8.0-dev/docs/jdiff/changes/org.apache.pig.data.DataByteArray.html I think this can be safely ignored, so we're good. I'll pass the hat here to get donations to buy Dmitriy a linux box so he can have more than one version of Java.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Thanks Alan,
          Your concern is touching

          I'll commit this patch.

          Show
          Dmitriy V. Ryaboy added a comment - Thanks Alan, Your concern is touching I'll commit this patch.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Patch committed. Thanks Russell!

          Show
          Dmitriy V. Ryaboy added a comment - Patch committed. Thanks Russell!
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12444886/addconcat2.patch
          against trunk revision 948526.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/5/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12444886/addconcat2.patch against trunk revision 948526. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h1.grid.sp2.yahoo.net/5/console This message is automatically generated.
          Hide
          Olga Natkovich added a comment -

          I could not figure out how to re-open this issue. However, the code does not work in pig script. The main reason is that the code that selects which function to use does not deal yet with non-fixed number of arguments.

          grunt> A = load 'studentab10k' as (name: chararray, age: chararray, gpa: chararray);
          grunt> B = foreach A generate CONCAT(name, age, gpa);
          grunt> C = limit B 10;
          grunt> dump C;
          2010-08-17 12:17:41,635 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast.
          Details at logfile: /homes/olgan/pig_1282072550328.log
          grunt>

          Show
          Olga Natkovich added a comment - I could not figure out how to re-open this issue. However, the code does not work in pig script. The main reason is that the code that selects which function to use does not deal yet with non-fixed number of arguments. grunt> A = load 'studentab10k' as (name: chararray, age: chararray, gpa: chararray); grunt> B = foreach A generate CONCAT(name, age, gpa); grunt> C = limit B 10; grunt> dump C; 2010-08-17 12:17:41,635 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast. Details at logfile: /homes/olgan/pig_1282072550328.log grunt>
          Hide
          Dmitriy V. Ryaboy added a comment -

          This should fix the problem . LMK if you'd like me to commit this.

          Show
          Dmitriy V. Ryaboy added a comment - This should fix the problem . LMK if you'd like me to commit this.
          Hide
          Olga Natkovich added a comment -

          This will make it work with bytearrays but not strings

          Show
          Olga Natkovich added a comment - This will make it work with bytearrays but not strings
          Hide
          Dmitriy V. Ryaboy added a comment -

          Right.. i forgot people don't call StringConcat directly.
          I don't know how one specifies a vararg schema. Hints?

          Show
          Dmitriy V. Ryaboy added a comment - Right.. i forgot people don't call StringConcat directly. I don't know how one specifies a vararg schema. Hints?
          Hide
          Olga Natkovich added a comment -

          The only way I know is to actually make the code deal with var number of arguments but I think it is too late for 0.8. Perhaps we can revisit this for 0.9?

          Show
          Olga Natkovich added a comment - The only way I know is to actually make the code deal with var number of arguments but I think it is too late for 0.8. Perhaps we can revisit this for 0.9?
          Hide
          Dmitriy V. Ryaboy added a comment -

          Yeah, let's plan to add a way to specify a vararg in the schema in 0.9.

          In the meantime, what do we do with concat? Option 1: leave broken (only works for 2 arguments). Option 2: take out arg2func mapping, and have people who want to concat strings use StringConcat explicitly.

          Actually, there is an option 3, which makes more sense than option 2: make CONCAT actually do what StringConcat does, and introduce BinConcat (since it seems unlikely people are actually concatting bytearrays...).

          Show
          Dmitriy V. Ryaboy added a comment - Yeah, let's plan to add a way to specify a vararg in the schema in 0.9. In the meantime, what do we do with concat? Option 1: leave broken (only works for 2 arguments). Option 2: take out arg2func mapping, and have people who want to concat strings use StringConcat explicitly. Actually, there is an option 3, which makes more sense than option 2: make CONCAT actually do what StringConcat does, and introduce BinConcat (since it seems unlikely people are actually concatting bytearrays...).
          Hide
          Olga Natkovich added a comment -

          2 and 3 are backward incompatible with 0.7 and we really don't want to break compatibility in this release. So I would propose option 1 and proper fix in 0.9

          Show
          Olga Natkovich added a comment - 2 and 3 are backward incompatible with 0.7 and we really don't want to break compatibility in this release. So I would propose option 1 and proper fix in 0.9
          Hide
          Ashutosh Chauhan added a comment -

          > I could not figure out how to re-open this issue.

          Issues marked as resolved cannot be reopened. Once the patch is committed, commiter should mark issue as resolved, since resolved issues can be reopened before release is rolled out. When the release is rolled out, resolved issues should be marked as closed, since there is no point in reopening an issue which has already been released. If more work needs to be done on that issue new jira should be created for it for future releases.

          Show
          Ashutosh Chauhan added a comment - > I could not figure out how to re-open this issue. Issues marked as resolved cannot be reopened. Once the patch is committed, commiter should mark issue as resolved, since resolved issues can be reopened before release is rolled out. When the release is rolled out, resolved issues should be marked as closed, since there is no point in reopening an issue which has already been released. If more work needs to be done on that issue new jira should be created for it for future releases.
          Hide
          Russell Jurney added a comment -

          I just ran into this. I assume it is still broken? Man, I suck.

          grunt> metrics = foreach enrons generate CONCAT('ISODate(', datetime, ')');
          2012-05-16 22:24:02,276 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045:
          <line 4, column 34> Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast.
          2012-05-16 22:24:02,277 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1059:
          <line 4, column 10> Problem while reconciling output schema of ForEach

          Show
          Russell Jurney added a comment - I just ran into this. I assume it is still broken? Man, I suck. grunt> metrics = foreach enrons generate CONCAT('ISODate(', datetime, ')'); 2012-05-16 22:24:02,276 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: <line 4, column 34> Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast. 2012-05-16 22:24:02,277 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1059: <line 4, column 10> Problem while reconciling output schema of ForEach
          Hide
          Daniel Dai added a comment -

          Yes it is broken. getArgToFuncMapping still require two input variables. We need to open a new Jira.

          Show
          Daniel Dai added a comment - Yes it is broken. getArgToFuncMapping still require two input variables. We need to open a new Jira.
          Hide
          Russell Jurney added a comment -

          Can someone advise on fixing this? I see getArgToFuncMapping in EvalFunc class, not sure where to apply the fix.

          Show
          Russell Jurney added a comment - Can someone advise on fixing this? I see getArgToFuncMapping in EvalFunc class, not sure where to apply the fix.
          Hide
          Jonathan Coveney added a comment -

          We need to just get down to it and allow Schemas to specify a variable number of arguments, or perhaps a special FieldSchema that is a VarArgFieldSchema or something. We should make a JIRA for that and link it to this. We'll never be able to have a meaningful general CONCAT without it (and this annoys me to). I introduced something like this in some other patch floating in not-yet-reviewed land (will get on that soon).

          Show
          Jonathan Coveney added a comment - We need to just get down to it and allow Schemas to specify a variable number of arguments, or perhaps a special FieldSchema that is a VarArgFieldSchema or something. We should make a JIRA for that and link it to this. We'll never be able to have a meaningful general CONCAT without it (and this annoys me to). I introduced something like this in some other patch floating in not-yet-reviewed land (will get on that soon).
          Hide
          Russell Jurney added a comment -

          Looks like a new FuncSpec type needs to be implemented?

          Show
          Russell Jurney added a comment - Looks like a new FuncSpec type needs to be implemented?
          Hide
          Jonathan Coveney added a comment -

          Correction: the thing I added does not apply here, but tackles a similar issue in other parts of the code. We'll have to tackle this fresh.

          Show
          Jonathan Coveney added a comment - Correction: the thing I added does not apply here, but tackles a similar issue in other parts of the code. We'll have to tackle this fresh.
          Hide
          Russell Jurney added a comment -

          Ok, re: your patch, so I should look elsewhere for low hanging fruit? Direct me

          Show
          Russell Jurney added a comment - Ok, re: your patch, so I should look elsewhere for low hanging fruit? Direct me
          Hide
          Russell Jurney added a comment -

          Ok, it looks like... FuncSpec is where the work goes. Reading...

          Show
          Russell Jurney added a comment - Ok, it looks like... FuncSpec is where the work goes. Reading...
          Hide
          Prashant Kommireddi added a comment -

          What is the goal here? Are we trying to still have a Tuple of bytearrays be handled by CONCAT and Tuple of chararrays be handled by StringConcat? If that's the case, wouldn't fixing getArgToFuncMapping() do the job here?

          Show
          Prashant Kommireddi added a comment - What is the goal here? Are we trying to still have a Tuple of bytearrays be handled by CONCAT and Tuple of chararrays be handled by StringConcat? If that's the case, wouldn't fixing getArgToFuncMapping() do the job here?
          Hide
          Prashant Kommireddi added a comment -

          Also, would this change be backward compatible? I am seeing this in javadocs and am afraid users might have their scripts written based on the following assumption.

          /**
           * Generates the concatenation of the first two arguments.  It can be
           * used with two bytearrays or two chararrays (but not a mixture of the two).
           */
          

          And the Pig docs http://pig.apache.org/docs/r0.10.0/func.html#concat

          Show
          Prashant Kommireddi added a comment - Also, would this change be backward compatible? I am seeing this in javadocs and am afraid users might have their scripts written based on the following assumption. /** * Generates the concatenation of the first two arguments. It can be * used with two bytearrays or two chararrays (but not a mixture of the two). */ And the Pig docs http://pig.apache.org/docs/r0.10.0/func.html#concat
          Hide
          Russell Jurney added a comment -

          There was never a time to add a third argument to CONCAT, so it seems backwards compatible... it would not have worked.

          Show
          Russell Jurney added a comment - There was never a time to add a third argument to CONCAT, so it seems backwards compatible... it would not have worked.
          Hide
          Prashant Kommireddi added a comment -

          Sounds right. In that case, the patch should include changes to docs to reflect your work. Thanks Russell Jurney

          Show
          Prashant Kommireddi added a comment - Sounds right. In that case, the patch should include changes to docs to reflect your work. Thanks Russell Jurney
          Hide
          Ido Hadanny added a comment -

          So, is this fixed or not? I couldn't get it to work in 0.10.0, and I can't understand where can I find the new patch from 2013...

          Show
          Ido Hadanny added a comment - So, is this fixed or not? I couldn't get it to work in 0.10.0, and I can't understand where can I find the new patch from 2013...
          Hide
          Russell Jurney added a comment -

          This JIRA is not fixed. I don't know how to re-open it, however.

          Show
          Russell Jurney added a comment - This JIRA is not fixed. I don't know how to re-open it, however.
          Hide
          Daniel Dai added a comment -

          Created PIG-3444 to continue work on it.

          Show
          Daniel Dai added a comment - Created PIG-3444 to continue work on it.

            People

            • Assignee:
              Russell Jurney
              Reporter:
              Russell Jurney
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development