Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-2747

Support more predicate pushdown to a data source by pulling up multiple predicates from branches using the same data source

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      consider the following example:

      T = load ... ;
      T1 = filter T by col == 'hello';
      T2 = filter T by col =='world';

      currently Pig optimizer does not combine the two predicates and cannot push down the predicates to the data sources (via LoadMetadata). Thus the data source cannot do any filtering. A full table/file scan is required.

      A current more efficient workaround (by hand) is to rewrite the above script to the following equivalent one:

      T = load ...;
      T = filter T by col == 'hello' or col == 'world' ;
      T1 = filter T by col == 'hello';
      T2 = filter T by col == 'world';

      the above script enables Pig to push down the predicate (col == 'hello' or col == 'world') to the data source to use available partitions/indexes for potentially much more efficient processing.

      This JIRA is created to request PIG optimizer to perform the above type of optimization automatically.

      Attachments

        Activity

          People

            Unassigned Unassigned
            yuxu Yu Xu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: