Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11
    • Fix Version/s: 0.12.0
    • Component/s: internal-udfs, parser
    • Labels:
      None
    • Release Note:
      Hide
      Pig now supports CASE expression. It can be used in a place of any expression. For example,
      bar = FOREACH foo GENERATE (
        CASE i % 3
           WHEN 0 THEN '3n'
           WHEN 1 THEN '3n+1'
           ELSE '3n+2'
        END
      );

      Note that CASE is now a reserved keyword, and thus, it can no longer be used as a name of column or field.
      Show
      Pig now supports CASE expression. It can be used in a place of any expression. For example, bar = FOREACH foo GENERATE (   CASE i % 3      WHEN 0 THEN '3n'      WHEN 1 THEN '3n+1'      ELSE '3n+2'   END ); Note that CASE is now a reserved keyword, and thus, it can no longer be used as a name of column or field.

      Description

      Currently, Pig has no support for case statement. To mimic it, users often use nested bincond operators. However, that easily becomes unreadable when there are multiple levels of nesting.

      For example,

      a = LOAD '1.txt' USING PigStorage(',') AS (i:int);
      b = FOREACH a GENERATE (
          i % 3 == 0 ? '3n' : (i % 3 == 1 ? '3n + 1' : '3n + 2')
      );
      

      This can be re-written much more nicely using case statement as follows:

      a = LOAD '1.txt' USING PigStorage(',') AS (i:int);
      b = FOREACH a GENERATE (
          CASE i % 3
              WHEN 0 THEN '3n'
              WHEN 1 THEN '3n + 1'
              ELSE        '3n + 2'
          END
      );
      

      I propose that we implement case statement in the following manner:

      • Add built-in UDFs that take expressions as args. Take for example the aforementioned case statement, we can define a UDF such as builtInUdf(i % 3, 0, '3n', 1, '3n + 1', '3n + 2').
      • Add syntactical sugar for these built-in UDFs.

      In fact, I burrowed this idea from HIVE-164.

      One downside of this approach is that all the possible args schemas of these UDFs must be pre-computed. Specifically, we need to populate the full list of possible args schemas in EvalFunc.getArgToFuncMapping.

      In particular, since we obviously cannot support infinitely long args, it is necessary to impose a limit on the size of when branches. For now, I arbitrarily chose 50, but it can be easily changed.

      1. PIG-3268-7.patch
        17 kB
        Cheolsoo Park
      2. PIG-3268-6.patch
        17 kB
        Cheolsoo Park
      3. PIG-3268-5.patch
        17 kB
        Cheolsoo Park
      4. PIG-3268-4.patch
        15 kB
        Cheolsoo Park
      5. PIG-3268-3.patch
        15 kB
        Cheolsoo Park
      6. PIG-3268-2.patch
        47 kB
        Cheolsoo Park
      7. PIG-3268.patch
        25 kB
        Cheolsoo Park

        Issue Links

          Activity

          Hide
          Cheolsoo Park added a comment -

          Committed to trunk. Thank you Aniket for reviewing the patch!

          Show
          Cheolsoo Park added a comment - Committed to trunk. Thank you Aniket for reviewing the patch!
          Hide
          Cheolsoo Park added a comment -

          All unit tests pass. I rebased the patch and added WHEN, THEN, ELSE, and END to the whitelist. I am updating the patch for the record.

          Show
          Cheolsoo Park added a comment - All unit tests pass. I rebased the patch and added WHEN, THEN, ELSE, and END to the whitelist. I am updating the patch for the record.
          Hide
          Cheolsoo Park added a comment -

          The problem was that one LogicalExpression object (case expr) was shared by multiple BinCondExpression objects (when exprs).

          To fix it, I clone 1st expression following CASE token and insert it before every when expression in QueryParser. Then, I construct a new LogicalExpression object per BinCondExpression in LogicalPlanGenerator.

          In shorts,

          CASE e1
            WHEN e2 THEN e3
            WHEN e4 THEN e5
            ELSE e6
          END
          

          =>

          ^(CASE e1, e2, e3, e1, e4, e5, e6) // Note that there are two e1's.
          

          =>

          e1 == e4 ? e5 : (e1 == e2 ? e3 : e6)
          

          I updated unit tests. I also verified that the explain output of case statement is identical to that of hand-written nested bincond expressions.

          Thanks!

          Show
          Cheolsoo Park added a comment - The problem was that one LogicalExpression object (case expr) was shared by multiple BinCondExpression objects (when exprs). To fix it, I clone 1st expression following CASE token and insert it before every when expression in QueryParser. Then, I construct a new LogicalExpression object per BinCondExpression in LogicalPlanGenerator. In shorts, CASE e1 WHEN e2 THEN e3 WHEN e4 THEN e5 ELSE e6 END => ^(CASE e1, e2, e3, e1, e4, e5, e6) // Note that there are two e1's. => e1 == e4 ? e5 : (e1 == e2 ? e3 : e6) I updated unit tests. I also verified that the explain output of case statement is identical to that of hand-written nested bincond expressions. Thanks!
          Hide
          Cheolsoo Park added a comment -

          Aniket Mokashi, thank you for your comment, but that's not my issue. In your example, I build (c=c2 ? e2 : (c=c1 ? e1 : e3)).

          Let me explain my issue with an example.

          A = LOAD '1.txt' USING PigStorage(',') AS (i:int);
          B = FOREACH A GENERATE i, ( -- Note I have an extra column "i" besides CASE expression
              CASE (i % 3)
                  WHEN 0 THEN '3n'
                  WHEN 1 THEN '3n+1'
                  ELSE        '3n+2'
              END
          );
          

          This fails with the following error:

          org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [Mod (Name: Mod[int] - scope-9 Operator Key: scope-9) children: [[POProject (Name: Project[int][*] - scope-7 Operator Key: scope-7) children: null at []], [ConstantExpression (Name: Constant(3) - scope-8 Operator Key: scope-8) children: null at []]] at []]: java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be cast to java.lang.Number
          

          When I compare explain of CASE against that of manually written nested bin operators, I can only see the following difference:

          CASE
          |   |---Equal To[boolean] - scope-11
          |   |   |
          |   |   |---Mod[int] - scope-9
          |   |   |   |
          |   |   |   |---Project[int][*] - scope-7 // this line
          |   |   |   |
          |   |   |   |---Constant(3) - scope-8
          
          Bincond
          |   |---Equal To[boolean] - scope-11
          |   |   |
          |   |   |---Mod[int] - scope-9
          |   |   |   |
          |   |   |   |---Project[int][1] - scope-7 // this line
          |   |   |   |
          |   |   |   |---Constant(3) - scope-8
          

          I am puzzled why "i" in "(i % 3)" is translated to "Project[int][*]" in CASE, whereas it is "Project[int][1]" in nested bin operators.

          Thanks!

          Show
          Cheolsoo Park added a comment - Aniket Mokashi , thank you for your comment, but that's not my issue. In your example, I build (c=c2 ? e2 : (c=c1 ? e1 : e3)). Let me explain my issue with an example. A = LOAD '1.txt' USING PigStorage(',') AS (i: int ); B = FOREACH A GENERATE i, ( -- Note I have an extra column "i" besides CASE expression CASE (i % 3) WHEN 0 THEN '3n' WHEN 1 THEN '3n+1' ELSE '3n+2' END ); This fails with the following error: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [Mod (Name: Mod[ int ] - scope-9 Operator Key: scope-9) children: [[POProject (Name: Project[ int ][*] - scope-7 Operator Key: scope-7) children: null at []], [ConstantExpression (Name: Constant(3) - scope-8 Operator Key: scope-8) children: null at []]] at []]: java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be cast to java.lang. Number When I compare explain of CASE against that of manually written nested bin operators, I can only see the following difference: CASE | |---Equal To[ boolean ] - scope-11 | | | | | |---Mod[ int ] - scope-9 | | | | | | | |---Project[ int ][*] - scope-7 // this line | | | | | | | |---Constant(3) - scope-8 Bincond | |---Equal To[ boolean ] - scope-11 | | | | | |---Mod[ int ] - scope-9 | | | | | | | |---Project[ int ][1] - scope-7 // this line | | | | | | | |---Constant(3) - scope-8 I am puzzled why "i" in "(i % 3)" is translated to "Project [int] [*] " in CASE, whereas it is "Project [int] [1] " in nested bin operators. Thanks!
          Hide
          Aniket Mokashi added a comment -

          [-cheolsoo], I think you need to construct your tree from left to right-
          case c when c1 then e1 when c2 then e2 else e3 should be translated to (c=c1? e1 : (c=c2 ? e2 : e3)) instead of (c=c2 ? (c=c1 ? e1 : e3 ): e2)). case is evaluated from left to right (aka top to bottom).

          Show
          Aniket Mokashi added a comment - [-cheolsoo] , I think you need to construct your tree from left to right- case c when c1 then e1 when c2 then e2 else e3 should be translated to (c=c1? e1 : (c=c2 ? e2 : e3)) instead of (c=c2 ? (c=c1 ? e1 : e3 ): e2)). case is evaluated from left to right (aka top to bottom).
          Hide
          Cheolsoo Park added a comment -

          I found an interesting bug in my patch. When I project case expression with column refs, the return value of case expression is incorrect. I am debugging it now.

          Show
          Cheolsoo Park added a comment - I found an interesting bug in my patch. When I project case expression with column refs, the return value of case expression is incorrect. I am debugging it now.
          Hide
          Cheolsoo Park added a comment -

          Uploading the patch.

          I took a completely new approach after discussion with Aniket. Instead of using built-in UDFs, I am converting CASE statement to nested BinCondExpression in LogicalPlanGenerator. So there is no longer a limit on number of when branches.

          Thanks Aniket for the suggestion!

          Show
          Cheolsoo Park added a comment - Uploading the patch. I took a completely new approach after discussion with Aniket. Instead of using built-in UDFs, I am converting CASE statement to nested BinCondExpression in LogicalPlanGenerator. So there is no longer a limit on number of when branches. Thanks Aniket for the suggestion!
          Hide
          Aniket Mokashi added a comment -

          I spoke to Cheolsoo offline on the changes required. Cheolsoo can you upload the new patch when you are ready.

          Show
          Aniket Mokashi added a comment - I spoke to Cheolsoo offline on the changes required. Cheolsoo can you upload the new patch when you are ready.
          Hide
          Cheolsoo Park added a comment -

          Added unit tests.

          ReviewBoard request: https://reviews.apache.org/r/10341/

          Show
          Cheolsoo Park added a comment - Added unit tests. ReviewBoard request: https://reviews.apache.org/r/10341/
          Hide
          Cheolsoo Park added a comment -

          Attached is a patch that implements my proposal. I haven't added unit tests yet.

          Please let me know if anyone has an opinion. Thanks!

          Show
          Cheolsoo Park added a comment - Attached is a patch that implements my proposal. I haven't added unit tests yet. Please let me know if anyone has an opinion. Thanks!

            People

            • Assignee:
              Cheolsoo Park
              Reporter:
              Cheolsoo Park
            • Votes:
              2 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development