Pig
  1. Pig
  2. PIG-466

PERFORMANCE: dropping the columns as soon as possible

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2.0
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently, each operator carries all the data until foreach is encountered. This can cause significant performance degradation.

        Activity

        Hide
        Olga Natkovich added a comment -

        This is part of new optimizer work

        Show
        Olga Natkovich added a comment - This is part of new optimizer work
        Hide
        Scott Carey added a comment -

        This is both a performance and usability issue.

        If the optimizer could automatically push projections up to the earliest possible time, it would also unclutter large scripts that manually project 'early and often' for performance reasons.

        I have reason to believe that some of these extra lines of projection interferes with certain other performance optimizations as well (on 0.5, multi-query optimization sometimes fails due to extra projections in between, some forms of projection break combiner use as well).

        Show
        Scott Carey added a comment - This is both a performance and usability issue. If the optimizer could automatically push projections up to the earliest possible time, it would also unclutter large scripts that manually project 'early and often' for performance reasons. I have reason to believe that some of these extra lines of projection interferes with certain other performance optimizations as well (on 0.5, multi-query optimization sometimes fails due to extra projections in between, some forms of projection break combiner use as well).
        Hide
        Dmitriy V. Ryaboy added a comment -

        This was done as PIG-922

        Show
        Dmitriy V. Ryaboy added a comment - This was done as PIG-922
        Hide
        Daniel Dai added a comment -

        PIG-922 partially solve this issue by pushing columns to the loader. However, we can go beyond that. For example:

        a = load '1.txt' as (a0, a1, a2, a3);
        b = filter a by a2==1;
        c = order b by a1;
        d = foreach c generate a0, a1;
        

        PIG-922 is able to figure out a3 is not needed in the script and don't load it. One step further, we can figure out a2 is no longer needed after b, so we can add a foreach and drop a2 after b. This is not covered by PIG-922 and is part of new optimizer work.

        Show
        Daniel Dai added a comment - PIG-922 partially solve this issue by pushing columns to the loader. However, we can go beyond that. For example: a = load '1.txt' as (a0, a1, a2, a3); b = filter a by a2==1; c = order b by a1; d = foreach c generate a0, a1; PIG-922 is able to figure out a3 is not needed in the script and don't load it. One step further, we can figure out a2 is no longer needed after b, so we can add a foreach and drop a2 after b. This is not covered by PIG-922 and is part of new optimizer work.
        Hide
        Olga Natkovich added a comment -

        This is already resolved as part of PIG-1178

        Show
        Olga Natkovich added a comment - This is already resolved as part of PIG-1178

          People

          • Assignee:
            Daniel Dai
            Reporter:
            Olga Natkovich
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development