Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4662

New optimizer rule: filter nulls before inner joins

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:

      Description

      As stated in the docs, rewriting an inner join and filtering nulls from inputs can be a big performance gain: http://pig.apache.org/docs/r0.14.0/perf.html#nulls

      We would like to add an optimizer rule which detects inner joins, and filters nulls in all inputs:
      A = filter A by t is not null;
      B = filter B by x is not null;
      C = join A by t, B by x;

      see also: http://stackoverflow.com/questions/32088389/is-the-pig-optimizer-filtering-nulls-before-joining

      1. PIG-4662-1.patch
        17 kB
        Satish Subhashrao Saley

        Activity

        Hide
        rohini Rohini Palaniswamy added a comment -

        This is a good one. We should definitely do it.

        Show
        rohini Rohini Palaniswamy added a comment - This is a good one. We should definitely do it.
        Hide
        rohini Rohini Palaniswamy added a comment -

        Instead of an optimizer, this can probably be tagged on to POLocalRearrange to skip null keys.

        Show
        rohini Rohini Palaniswamy added a comment - Instead of an optimizer, this can probably be tagged on to POLocalRearrange to skip null keys.
        Hide
        rohini Rohini Palaniswamy added a comment -

        Have already done this in POBuildBloomRearrangeTez. You can refer that.

        Show
        rohini Rohini Palaniswamy added a comment - Have already done this in POBuildBloomRearrangeTez. You can refer that.
        Hide
        daijy Daniel Dai added a comment -

        I prefer to do it in optimizer, it seems to be more clear.

        Show
        daijy Daniel Dai added a comment - I prefer to do it in optimizer, it seems to be more clear.
        Hide
        rohini Rohini Palaniswamy added a comment -

        I prefer to do it in optimizer, it seems to be more clear.

        As I mentioned in bloom join before, it is not good for two reasons.
        1) It would be a lot of inefficiency and performance penalty to add a separate filter when it just involves 3 lines of code in POLocalRearrange. It will just be unnecessary verbosity.
        2) Also extracting out the key is done in POLocalRearrange. Adding a filter operator after that to filter nulls is not easy as we make lot of assumptions throughout the code about POLocalRearrange being the leaf of a map operator.

        Show
        rohini Rohini Palaniswamy added a comment - I prefer to do it in optimizer, it seems to be more clear. As I mentioned in bloom join before, it is not good for two reasons. 1) It would be a lot of inefficiency and performance penalty to add a separate filter when it just involves 3 lines of code in POLocalRearrange. It will just be unnecessary verbosity. 2) Also extracting out the key is done in POLocalRearrange. Adding a filter operator after that to filter nulls is not easy as we make lot of assumptions throughout the code about POLocalRearrange being the leaf of a map operator.
        Hide
        daijy Daniel Dai added a comment -

        I don't think it would make noticeable performance difference going either way. I'd like to see a modular design rather than intermingle different concept together. Also I don't feel it is hard to find the join key in the logical optimizer and adding a filter on it.

        Show
        daijy Daniel Dai added a comment - I don't think it would make noticeable performance difference going either way. I'd like to see a modular design rather than intermingle different concept together. Also I don't feel it is hard to find the join key in the logical optimizer and adding a filter on it.
        Hide
        rohini Rohini Palaniswamy added a comment -

        I'd like to see a modular design rather than intermingle different concept together.

        In that case we can extend and add a POJoinLocalRearrange that handles join specific conditions like this. It would not be mixing up POLocalRearrange then.

        I don't feel it is hard to find the join key in the logical optimizer and adding a filter on it.

        Adding a extra filter operator for 3 lines of check will definitely impact performance when we are dealing with billions of records. We recently had a user who added is null bincond checks for lot of columns in his foreach which dealt with 10+billions of records and it took extra 40+ minutes. Filter and foreach are what we are trying to optimize in PIG-3764 with bytecode generation. As we are trying to improve performance everywhere and trying to save milliseconds we should not be doing this unless it is a major or complicated change in which case it will be cleaner to keep it separate.

        Show
        rohini Rohini Palaniswamy added a comment - I'd like to see a modular design rather than intermingle different concept together. In that case we can extend and add a POJoinLocalRearrange that handles join specific conditions like this. It would not be mixing up POLocalRearrange then. I don't feel it is hard to find the join key in the logical optimizer and adding a filter on it. Adding a extra filter operator for 3 lines of check will definitely impact performance when we are dealing with billions of records. We recently had a user who added is null bincond checks for lot of columns in his foreach which dealt with 10+billions of records and it took extra 40+ minutes. Filter and foreach are what we are trying to optimize in PIG-3764 with bytecode generation. As we are trying to improve performance everywhere and trying to save milliseconds we should not be doing this unless it is a major or complicated change in which case it will be cleaner to keep it separate.

          People

          • Assignee:
            satishsaley Satish Subhashrao Saley
            Reporter:
            ihadanny Ido Hadanny
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development