Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11
    • Fix Version/s: 0.12.0
    • Component/s: internal-udfs, parser
    • Labels:
      None
    • Release Note:
      Hide
      Pig now supports IN operator, and it can be used in any conditional expressions. For example,
      bar = FILTER foo BY i IN ('a', 'b', 'c');
      Show
      Pig now supports IN operator, and it can be used in any conditional expressions. For example, bar = FILTER foo BY i IN ('a', 'b', 'c');

      Description

      This is another language improvement using the same approach as in PIG-3268.

      Currently, Pig has no support for IN operator. To mimic it, users often have to concatenate several OR operators.

      For example,

      a = LOAD '1.txt' USING PigStorage(',') AS (i:int);
      b = FILTER a BY 
         (i == 1) OR
         (i == 22) OR
         (i == 333) OR
         (i == 4444) OR
         (i == 55555);
      

      But this can be re-rewritten in a more compact manner using IN operator as follows:

      a = LOAD '1.txt' USING PigStorage(',') AS (i:int);
      b = FILTER a BY i IN (1,22,333,4444,55555);
      

      I propose that we implement IN operator in the following manner:

      • Add built-in UDFs that take expressions as args. Take for example the aforementioned IN operator, we can define a UDF such as builtInUdf(i, 1, 22, 333, 4444, 55555).
      • Add syntactical sugar for these built-in UDFs.
      1. PIG-3269.patch
        7 kB
        Cheolsoo Park
      2. PIG-3269-2.patch
        6 kB
        Cheolsoo Park
      3. PIG-3269-3.patch
        17 kB
        Cheolsoo Park
      4. PIG-3269-4.patch
        17 kB
        Cheolsoo Park
      5. PIG-3269-5.patch
        16 kB
        Cheolsoo Park

        Issue Links

          Activity

          Hide
          Cheolsoo Park added a comment -

          Attached is a patch that implements my proposal. I will add unit tests shortly.

          Please let me know if anyone has an opinion. Thanks!

          Show
          Cheolsoo Park added a comment - Attached is a patch that implements my proposal. I will add unit tests shortly. Please let me know if anyone has an opinion. Thanks!
          Hide
          Cheolsoo Park added a comment -

          I realized that I don't need a limit on number of operands. Since the return type of my UDF is always Boolean, I don't have to implement getArgToFuncMapping unlike case statement.

          Updating the patch.

          Show
          Cheolsoo Park added a comment - I realized that I don't need a limit on number of operands. Since the return type of my UDF is always Boolean, I don't have to implement getArgToFuncMapping unlike case statement. Updating the patch.
          Hide
          Cheolsoo Park added a comment -
          Show
          Cheolsoo Park added a comment - ReviewBoard request: https://reviews.apache.org/r/10337/
          Hide
          Cheolsoo Park added a comment -

          Incorporated Aniket's comments in RB.

          Show
          Cheolsoo Park added a comment - Incorporated Aniket's comments in RB.
          Hide
          Cheolsoo Park added a comment -

          The current patch break TestQueryParser and TestMacroExpansion because it introduces a new reserved keyword "IN". But this backward incompatibility can be avoided by PIG-3122, so I am going to wait until PIG-3122 is committed before I commit my patch.

          Show
          Cheolsoo Park added a comment - The current patch break TestQueryParser and TestMacroExpansion because it introduces a new reserved keyword "IN". But this backward incompatibility can be avoided by PIG-3122 , so I am going to wait until PIG-3122 is committed before I commit my patch.
          Hide
          Cheolsoo Park added a comment -

          All unit tests pass after adding IN to the whitelist. I am attaching the updated patch for the record.

          Show
          Cheolsoo Park added a comment - All unit tests pass after adding IN to the whitelist. I am attaching the updated patch for the record.
          Hide
          Cheolsoo Park added a comment -

          I committed to trunk given +1 from Aniket in RB. Thank you Aniket for reviewing the patch!

          Show
          Cheolsoo Park added a comment - I committed to trunk given +1 from Aniket in RB. Thank you Aniket for reviewing the patch!
          Hide
          Aniket Mokashi added a comment -

          UDF approach to IN operator is not friendly with ProjectionPushdown. We need to change this list of or-expressions. Should I open another jira for this?

          Show
          Aniket Mokashi added a comment - UDF approach to IN operator is not friendly with ProjectionPushdown. We need to change this list of or-expressions. Should I open another jira for this?
          Hide
          Cheolsoo Park added a comment -

          Aniket Mokashi, yes, let's open another jira to convert it to or-expressions. Feel free to own it or assign it to me. Either way is fine with me.

          Show
          Cheolsoo Park added a comment - Aniket Mokashi , yes, let's open another jira to convert it to or-expressions. Feel free to own it or assign it to me. Either way is fine with me.
          Hide
          Aniket Mokashi added a comment -

          Cheolsoo Park, I have opened PIG-3336 for this (and assigned it to you )

          Show
          Aniket Mokashi added a comment - Cheolsoo Park , I have opened PIG-3336 for this (and assigned it to you )
          Hide
          Russell Jurney added a comment -

          Does this work for searching IN a bag of tuples?

          Show
          Russell Jurney added a comment - Does this work for searching IN a bag of tuples?
          Hide
          Cheolsoo Park added a comment -

          No, it doesn't.

          If you do, "( tuple ) IN ( { bag } )", it will error out because the types of lhs and rhs are not compatible, i.e. tuple cannot be compared with bag. The IN is internally converted to concatenated OR expressions, i.e. (lhs == rhs_1) OR (lhs == rhs_2), so the type of lhs and rhs must be the same.

          Show
          Cheolsoo Park added a comment - No, it doesn't. If you do, "( tuple ) IN ( { bag } )" , it will error out because the types of lhs and rhs are not compatible, i.e. tuple cannot be compared with bag. The IN is internally converted to concatenated OR expressions, i.e. (lhs == rhs_1) OR (lhs == rhs_2) , so the type of lhs and rhs must be the same.

            People

            • Assignee:
              Cheolsoo Park
              Reporter:
              Cheolsoo Park
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development