Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3268

Case statement support

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.11
    • 0.12.0
    • internal-udfs, parser
    • None
    • Hide
      Pig now supports CASE expression. It can be used in a place of any expression. For example,
      bar = FOREACH foo GENERATE (
        CASE i % 3
           WHEN 0 THEN '3n'
           WHEN 1 THEN '3n+1'
           ELSE '3n+2'
        END
      );

      Note that CASE is now a reserved keyword, and thus, it can no longer be used as a name of column or field.
      Show
      Pig now supports CASE expression. It can be used in a place of any expression. For example, bar = FOREACH foo GENERATE (   CASE i % 3      WHEN 0 THEN '3n'      WHEN 1 THEN '3n+1'      ELSE '3n+2'   END ); Note that CASE is now a reserved keyword, and thus, it can no longer be used as a name of column or field.

    Description

      Currently, Pig has no support for case statement. To mimic it, users often use nested bincond operators. However, that easily becomes unreadable when there are multiple levels of nesting.

      For example,

      a = LOAD '1.txt' USING PigStorage(',') AS (i:int);
      b = FOREACH a GENERATE (
          i % 3 == 0 ? '3n' : (i % 3 == 1 ? '3n + 1' : '3n + 2')
      );
      

      This can be re-written much more nicely using case statement as follows:

      a = LOAD '1.txt' USING PigStorage(',') AS (i:int);
      b = FOREACH a GENERATE (
          CASE i % 3
              WHEN 0 THEN '3n'
              WHEN 1 THEN '3n + 1'
              ELSE        '3n + 2'
          END
      );
      

      I propose that we implement case statement in the following manner:

      • Add built-in UDFs that take expressions as args. Take for example the aforementioned case statement, we can define a UDF such as builtInUdf(i % 3, 0, '3n', 1, '3n + 1', '3n + 2').
      • Add syntactical sugar for these built-in UDFs.

      In fact, I burrowed this idea from HIVE-164.

      One downside of this approach is that all the possible args schemas of these UDFs must be pre-computed. Specifically, we need to populate the full list of possible args schemas in EvalFunc.getArgToFuncMapping.

      In particular, since we obviously cannot support infinitely long args, it is necessary to impose a limit on the size of when branches. For now, I arbitrarily chose 50, but it can be easily changed.

      Attachments

        1. PIG-3268.patch
          25 kB
          Cheolsoo Park
        2. PIG-3268-2.patch
          47 kB
          Cheolsoo Park
        3. PIG-3268-3.patch
          15 kB
          Cheolsoo Park
        4. PIG-3268-4.patch
          15 kB
          Cheolsoo Park
        5. PIG-3268-5.patch
          17 kB
          Cheolsoo Park
        6. PIG-3268-6.patch
          17 kB
          Cheolsoo Park
        7. PIG-3268-7.patch
          17 kB
          Cheolsoo Park

        Issue Links

          Activity

            People

              cheolsoo Cheolsoo Park
              cheolsoo Cheolsoo Park
              Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: