Pig
  1. Pig
  2. PIG-3021

Split results missing records when there is null values in the column comparison

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.14.0
    • Component/s: None
    • Labels:
      None

      Description

      Suppose a(x, y)

      split a into b if x==y, c otherwise;

      One will expect the union of b and c will be a. However, if x or y is null, the record won't appear in either b or c.

      To workaround this, I have to change to the following:
      split a into b if x is not null and y is not null and x==y, c otherwise;

      1. PIG-3021-3.patch
        14 kB
        Cheolsoo Park
      2. PIG-3021-2.patch
        9 kB
        Cheolsoo Park
      3. PIG-3021.patch
        9 kB
        Cheolsoo Park

        Activity

        Hide
        Cheolsoo Park added a comment -

        Hi Chang,

        Thank you very much for raising an issue.

        I understand your argument, but that's an expected behavior. Given your example, here are the plans built by Pig:

        EXPLAIN b
        b: (Name: LOStore Schema: x#9:int,y#10:int)
        |
        |---b: (Name: LOSplitOutput Schema: x#9:int,y#10:int)
            |   |
            |   (Name: Equal Type: boolean Uid: 8)
            |   |
            |   |---x:(Name: Project Type: int Uid: 3 Input: 0 Column: 0)
            |   |
            |   |---y:(Name: Project Type: int Uid: 4 Input: 0 Column: 1)
            |
            |---1-4: (Name: LOSplit Schema: x#3:int,y#4:int)
        ...
        
        EXPLAIN c
        c: (Name: LOStore Schema: x#21:int,y#22:int)
        |
        |---c: (Name: LOSplitOutput Schema: x#21:int,y#22:int)
            |   |
            |   (Name: Not Type: boolean Uid: 20)
            |   |
            |   |---(Name: Equal Type: boolean Uid: 19)
            |       |
            |       |---x:(Name: Project Type: int Uid: 13 Input: 0 Column: 0)
            |       |
            |       |---y:(Name: Project Type: int Uid: 14 Input: 0 Column: 1)
            |
            |---1-7: (Name: LOSplit Schema: x#13:int,y#14:int)b: (Name: LOStore Schema: x#9:int,y#10:int)
        ...
        

        As can be seen, b and c are filtered by expressions == and !=, and the Pig manual says the following [1]:

        • Comparison operators: If either subexpression is null, the result is null.
        • FILTER operator: If a filter expression results in null value, the filter does not pass them through

        So if either x or y is null, they will be dropped.

        In addition, the Pig manual also says [2]:

        • SPLIT operator: A tuple may not be assigned to any relation.

        Does this make sense?

        [1] http://pig.apache.org/docs/r0.10.0/basic.html#nulls
        [2] http://pig.apache.org/docs/r0.10.0/basic.html#SPLIT

        Show
        Cheolsoo Park added a comment - Hi Chang, Thank you very much for raising an issue. I understand your argument, but that's an expected behavior. Given your example, here are the plans built by Pig: EXPLAIN b b: (Name: LOStore Schema: x#9: int ,y#10: int ) | |---b: (Name: LOSplitOutput Schema: x#9: int ,y#10: int ) | | | (Name: Equal Type: boolean Uid: 8) | | | |---x:(Name: Project Type: int Uid: 3 Input: 0 Column: 0) | | | |---y:(Name: Project Type: int Uid: 4 Input: 0 Column: 1) | |---1-4: (Name: LOSplit Schema: x#3: int ,y#4: int ) ... EXPLAIN c c: (Name: LOStore Schema: x#21: int ,y#22: int ) | |---c: (Name: LOSplitOutput Schema: x#21: int ,y#22: int ) | | | (Name: Not Type: boolean Uid: 20) | | | |---(Name: Equal Type: boolean Uid: 19) | | | |---x:(Name: Project Type: int Uid: 13 Input: 0 Column: 0) | | | |---y:(Name: Project Type: int Uid: 14 Input: 0 Column: 1) | |---1-7: (Name: LOSplit Schema: x#13: int ,y#14: int )b: (Name: LOStore Schema: x#9: int ,y#10: int ) ... As can be seen, b and c are filtered by expressions == and !=, and the Pig manual says the following [1] : Comparison operators: If either subexpression is null, the result is null. FILTER operator: If a filter expression results in null value, the filter does not pass them through So if either x or y is null, they will be dropped. In addition, the Pig manual also says [2] : SPLIT operator: A tuple may not be assigned to any relation. Does this make sense? [1] http://pig.apache.org/docs/r0.10.0/basic.html#nulls [2] http://pig.apache.org/docs/r0.10.0/basic.html#SPLIT
        Hide
        Chang Luo added a comment -

        Hi Cheolsoo,
        Thanks for the quick response. NULL is very tricky as in the SQL world. I can understand your arguments for expected behavior. However, I don't think this semantic is intuitive. When a user use OTHERWISE to define default collection, he expects the default collection is a "catch all" collection.

        In the above example, if x or y is null, it will fail the condition x==y and users expect it goes to the default collection defined by OTHERWISE.

        Show
        Chang Luo added a comment - Hi Cheolsoo, Thanks for the quick response. NULL is very tricky as in the SQL world. I can understand your arguments for expected behavior. However, I don't think this semantic is intuitive. When a user use OTHERWISE to define default collection, he expects the default collection is a "catch all" collection. In the above example, if x or y is null, it will fail the condition x==y and users expect it goes to the default collection defined by OTHERWISE.
        Hide
        Cheolsoo Park added a comment -

        Hi Chang,

        I think that the current behavior of OTHERWISE is from that we translate it to a "FILTER BY NOT (expr)", and I am open to discussion about whether this is good or bad.

        But it is tricky to to change this because it is backward incompatible. Some people may rely on the current behavior, so if we change semantics, their scripts have to be modified. Therefore, we should change it only if there are more benefits than inconveniences by introducing a backward incompatibility.

        Nevertheless, I am happy to listen to other opinions.

        Thanks!

        Show
        Cheolsoo Park added a comment - Hi Chang, I think that the current behavior of OTHERWISE is from that we translate it to a "FILTER BY NOT (expr)", and I am open to discussion about whether this is good or bad. But it is tricky to to change this because it is backward incompatible. Some people may rely on the current behavior, so if we change semantics, their scripts have to be modified. Therefore, we should change it only if there are more benefits than inconveniences by introducing a backward incompatibility. Nevertheless, I am happy to listen to other opinions. Thanks!
        Hide
        Cheolsoo Park added a comment -

        Given the discussion on the user mailing list, I don't think that we're going to change the current behavior:
        http://search-hadoop.com/m/5KtFWXN4me1/CONCAT%2528null%252C+%2522something%2522%2529+%253D%253D+NULL+%253F&subj=CONCAT+null+something+NULL+

        I am going to close this jira as a 'won't fix'. Please let me if anyone has objections.

        Show
        Cheolsoo Park added a comment - Given the discussion on the user mailing list, I don't think that we're going to change the current behavior: http://search-hadoop.com/m/5KtFWXN4me1/CONCAT%2528null%252C+%2522something%2522%2529+%253D%253D+NULL+%253F&subj=CONCAT+null+something+NULL+ I am going to close this jira as a 'won't fix'. Please let me if anyone has objections.
        Hide
        Chang Luo added a comment -

        I agree with the discussion about NULL. I also agreed CONCAT(null, something) = null. My issue is on the OTHERWISE keyword not NULL. Maybe it only has issues when NULL is involved but maybe there might be other cases where the same confusion occurs.

        I understand it's breaking change and ok with won't-fix. But if I were to redesign SPLIT/OTHERWISE from beginning, I would make OTHERWISE a catch-all default collection.

        One suggestion is to give an explicit example of combination of NULL and OTHERWISE keywords in the SPLIT documentation.

        I can't post to the users mailing list. Could you please forward my comments to that list? Thanks!

        Show
        Chang Luo added a comment - I agree with the discussion about NULL. I also agreed CONCAT(null, something) = null. My issue is on the OTHERWISE keyword not NULL. Maybe it only has issues when NULL is involved but maybe there might be other cases where the same confusion occurs. I understand it's breaking change and ok with won't-fix. But if I were to redesign SPLIT/OTHERWISE from beginning, I would make OTHERWISE a catch-all default collection. One suggestion is to give an explicit example of combination of NULL and OTHERWISE keywords in the SPLIT documentation. I can't post to the users mailing list. Could you please forward my comments to that list? Thanks!
        Hide
        Cheolsoo Park added a comment -

        Hi Chang,

        I won't close this jira if you disagree. We can definitely add an example to the SPLIT doc, and you're welcome to submit a patch for that if you like to. I will be happy to review/commit it.

        I can't post to the users mailing list. Could you please forward my comments to that list?

        You can subscribe user@pig.apache.org. You're welcome to open a discussion on this matter on the mailing list.
        http://pig.apache.org/mailing_lists.html

        Thanks!

        Show
        Cheolsoo Park added a comment - Hi Chang, I won't close this jira if you disagree. We can definitely add an example to the SPLIT doc, and you're welcome to submit a patch for that if you like to. I will be happy to review/commit it. I can't post to the users mailing list. Could you please forward my comments to that list? You can subscribe user@pig.apache.org. You're welcome to open a discussion on this matter on the mailing list. http://pig.apache.org/mailing_lists.html Thanks!
        Hide
        Ville Weijo added a comment -

        Hi,

        I strongly suggest either fixing the implementation or at least changing documentation of OTHERWISE. At the moment what is documented (http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT) and what is implemented does not follow the principle of least surprise, and without knowing the implementation details of 'OTHERWISE', the documentation is even outright wrong.

        Please excuse my perhaps strong wording, but I spent some hours trying to figure out what on earth is happening.

        Show
        Ville Weijo added a comment - Hi, I strongly suggest either fixing the implementation or at least changing documentation of OTHERWISE. At the moment what is documented ( http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT ) and what is implemented does not follow the principle of least surprise, and without knowing the implementation details of 'OTHERWISE', the documentation is even outright wrong. Please excuse my perhaps strong wording, but I spent some hours trying to figure out what on earth is happening.
        Hide
        Cheolsoo Park added a comment -

        OK, I will implement the "catch-all" semantics for OTHERWISE. I will add an "IS NULL" expression to the otherwise branch, so null values can be caught instead of being discarded.

        Please let me know if anyone has objection with the change.

        Show
        Cheolsoo Park added a comment - OK, I will implement the "catch-all" semantics for OTHERWISE. I will add an "IS NULL" expression to the otherwise branch, so null values can be caught instead of being discarded. Please let me know if anyone has objection with the change.
        Hide
        Cheolsoo Park added a comment -

        Attached is a patch that passes through null values in the otherwise branch of SPLIT.

        I also re-wrote TestSplit.java using mock.Storage while updating it.

        I am going to run full unit tests to ensure that I am not breaking any test cases.

        Show
        Cheolsoo Park added a comment - Attached is a patch that passes through null values in the otherwise branch of SPLIT. I also re-wrote TestSplit.java using mock.Storage while updating it. I am going to run full unit tests to ensure that I am not breaking any test cases.
        Hide
        Cheolsoo Park added a comment -

        All unit tests pass. I made some minor change in a new patch.

        Btw, if we really don't want to break backward compatibility, we can introduce an optional keyword like NULLABLE. For example,

        SPLIT foo INTO x IF <cond>, y OTHERWISE [ NULLABLE ];
        

        If NULLABLE is specified, nulls will be stored in y; otherwise, nulls will be discarded.

        Currently, I didn't implement this idea, but please let me know if you like it more. It should be straightforward to implement it.

        Show
        Cheolsoo Park added a comment - All unit tests pass. I made some minor change in a new patch. Btw, if we really don't want to break backward compatibility, we can introduce an optional keyword like NULLABLE. For example, SPLIT foo INTO x IF <cond>, y OTHERWISE [ NULLABLE ]; If NULLABLE is specified, nulls will be stored in y; otherwise, nulls will be discarded. Currently, I didn't implement this idea, but please let me know if you like it more. It should be straightforward to implement it.
        Hide
        Rohini Palaniswamy added a comment -

        Looks good. But the concern is it breaks backward compatibility and users will see different results without any warning. Should we do the NULLABLE idea or have it as a pig configuration that can be turned on in deployments?

        Show
        Rohini Palaniswamy added a comment - Looks good. But the concern is it breaks backward compatibility and users will see different results without any warning. Should we do the NULLABLE idea or have it as a pig configuration that can be turned on in deployments?
        Hide
        Cheolsoo Park added a comment -

        Rohini, Sure. I totally agree that we shouldn't break backward compatibility. Let me update the patch.

        Personally, I like adding a keyword better than adding a property. But please let me know anyone thinks otherwise. Also let me know if anyone has a better name than NULLABLE.

        Thanks!

        Show
        Cheolsoo Park added a comment - Rohini, Sure. I totally agree that we shouldn't break backward compatibility. Let me update the patch. Personally, I like adding a keyword better than adding a property. But please let me know anyone thinks otherwise. Also let me know if anyone has a better name than NULLABLE. Thanks!
        Hide
        Ville Weijo added a comment -

        Hi,

        I think adding a keyword is ok, as long as documentation is updated to make it clear what is the difference between the plain OTHERWISE and OTHERWISE NULLABLE. Of course, that should be done in any case.

        About the name of the optional keyword, something like plain ALL might also do. That way it would be clear from a pig script that the rest of the records will go to the specified place and perhaps notify the reader that it is a good idea to check the documentation on what is going on, although NULLABLE should raise a question on exact mechanism of SPLIT, too.

        Show
        Ville Weijo added a comment - Hi, I think adding a keyword is ok, as long as documentation is updated to make it clear what is the difference between the plain OTHERWISE and OTHERWISE NULLABLE. Of course, that should be done in any case. About the name of the optional keyword, something like plain ALL might also do. That way it would be clear from a pig script that the rest of the records will go to the specified place and perhaps notify the reader that it is a good idea to check the documentation on what is going on, although NULLABLE should raise a question on exact mechanism of SPLIT, too.
        Hide
        Cheolsoo Park added a comment -

        Attached is a patch that introduces the ALL keyword into the split otherwise. If ALL is specified, nulls are stored in the default relation. By default, nulls are discarded (i.e. backward compatible).

        I also updated the document and added a test case to TestSplit.

        ReviewBoard:
        https://reviews.apache.org/r/12321/

        Show
        Cheolsoo Park added a comment - Attached is a patch that introduces the ALL keyword into the split otherwise. If ALL is specified, nulls are stored in the default relation. By default, nulls are discarded (i.e. backward compatible). I also updated the document and added a test case to TestSplit. ReviewBoard: https://reviews.apache.org/r/12321/
        Hide
        Cheolsoo Park added a comment -

        Moving it to 0.13 since the 0.12 branch is already cut.

        Show
        Cheolsoo Park added a comment - Moving it to 0.13 since the 0.12 branch is already cut.
        Hide
        Cheolsoo Park added a comment -

        Canceling the patch until I have time to answer Aniket's comment in the RB.

        Show
        Cheolsoo Park added a comment - Canceling the patch until I have time to answer Aniket's comment in the RB.

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Chang Luo
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development