Pig
  1. Pig
  2. PIG-3446 Umbrella jira for Pig on Tez
  3. PIG-3620

TezCompiler adds duplicate predecessors of blocking operators to TezPlan

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: tez-branch
    • Fix Version/s: tez-branch
    • Component/s: tez
    • Labels:
      None

      Description

      Here is a simplest example that reproduces the issue-

      test.pig
      a = LOAD 'foo' AS (x:int, y:chararray);
      b = GROUP a BY x;
      c = FOREACH b GENERATE a.x;
      STORE c INTO 'c';
      d = FOREACH b GENERATE a.y;
      STORE d INTO 'd';
      

      If you run pig -x tez_local -e 'explain -script test.pig', you will see two vertices that contains the following sub-plan-

      Tez vertex scope-27
      # Plan on vertex
      b: Local Rearrange[tuple]{int}(false) - scope-10
      |   |
      |   Project[int][0] - scope-11
      |
      |---a: New For Each(false,false)[bag] - scope-7
          |   |
          |   Cast[int] - scope-2
          |   |
          |   |---Project[bytearray][0] - scope-1
          |   |
          |   Cast[chararray] - scope-5
          |   |
          |   |---Project[bytearray][1] - scope-4
          |
          |---a: Load(file:///Users/cheolsoop/workspace/pig/foo:org.apache.pig.builtin.PigStorage) - scope-0
      

      What's happening is that since there are 2 stores (and thus 2 data flows, i.e. a=>c and a=>d), Pig generates two physical plans. Now TezCompile compiles them into a single tez plan but adds the same sub-plan twice.

      This is an issue with any blocking operators (join, union, etc) followed by split.

      1. PIG-3620-1.patch
        49 kB
        Rohini Palaniswamy

        Activity

        Cheolsoo Park created issue -
        Rohini Palaniswamy made changes -
        Field Original Value New Value
        Assignee Rohini Palaniswamy [ rohini ]
        Hide
        Rohini Palaniswamy added a comment -

        https://reviews.apache.org/r/16272/

        • Removed the duplicate operators in case of split
        • Fixed multiple levels of nested splits to work
        • Added some enhancements to plan printing for easy debugging
        • Print the connectivity between the vertices in a DAG
        • Print to which Tez vertex a POLocalRearrange connects to.
        • Changed TestTezCompiler to also include the combiner optimizer to verify the combiner plan as well.

        Testing:

        • Added tests to TestTezCompiler
        • Will add the e2e tests for Split with PIG-3626. MR multi-query is also broken now. Need to fix that as well for e2e to work.
        Show
        Rohini Palaniswamy added a comment - https://reviews.apache.org/r/16272/ Removed the duplicate operators in case of split Fixed multiple levels of nested splits to work Added some enhancements to plan printing for easy debugging Print the connectivity between the vertices in a DAG Print to which Tez vertex a POLocalRearrange connects to. Changed TestTezCompiler to also include the combiner optimizer to verify the combiner plan as well. Testing: Added tests to TestTezCompiler Will add the e2e tests for Split with PIG-3626 . MR multi-query is also broken now. Need to fix that as well for e2e to work.
        Rohini Palaniswamy made changes -
        Attachment PIG-3620-1.patch [ 12618823 ]
        Rohini Palaniswamy made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Cheolsoo Park added a comment -

        +1.

        Show
        Cheolsoo Park added a comment - +1.
        Hide
        Rohini Palaniswamy added a comment -

        Committed to Tez branch. Thanks for the review Cheolsoo.

        Show
        Rohini Palaniswamy added a comment - Committed to Tez branch. Thanks for the review Cheolsoo.
        Rohini Palaniswamy made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Praveen Rachabattuni made changes -
        Description Here is a simplest example that reproduces the issue-
        {code:title=test.pig}
        a = LOAD 'foo' AS (x:int, y:chararray);
        b = GROUP a BY x;
        c = FOREACH b GENERATE a.x;
        STORE c INTO 'c';
        d = FOREACH b GENERATE a.y;
        STORE d INTO 'd';
        {code}
        If you run {{pig \-x tex_local \-e 'explain \-script test.pig'}}, you will see two vertices that contains the following sub-plan-
        {code}
        Tez vertex scope-27
        # Plan on vertex
        b: Local Rearrange[tuple]{int}(false) - scope-10
        | |
        | Project[int][0] - scope-11
        |
        |---a: New For Each(false,false)[bag] - scope-7
            | |
            | Cast[int] - scope-2
            | |
            | |---Project[bytearray][0] - scope-1
            | |
            | Cast[chararray] - scope-5
            | |
            | |---Project[bytearray][1] - scope-4
            |
            |---a: Load(file:///Users/cheolsoop/workspace/pig/foo:org.apache.pig.builtin.PigStorage) - scope-0
        {code}
        What's happening is that since there are 2 stores (and thus 2 data flows, i.e. a=>c and a=>d), Pig generates two physical plans. Now TezCompile compiles them into a single tez plan but adds the same sub-plan twice.

        This is an issue with any blocking operators (join, union, etc) followed by split.
        Here is a simplest example that reproduces the issue-
        {code:title=test.pig}
        a = LOAD 'foo' AS (x:int, y:chararray);
        b = GROUP a BY x;
        c = FOREACH b GENERATE a.x;
        STORE c INTO 'c';
        d = FOREACH b GENERATE a.y;
        STORE d INTO 'd';
        {code}
        If you run {{pig \-x tez_local \-e 'explain \-script test.pig'}}, you will see two vertices that contains the following sub-plan-
        {code}
        Tez vertex scope-27
        # Plan on vertex
        b: Local Rearrange[tuple]{int}(false) - scope-10
        | |
        | Project[int][0] - scope-11
        |
        |---a: New For Each(false,false)[bag] - scope-7
            | |
            | Cast[int] - scope-2
            | |
            | |---Project[bytearray][0] - scope-1
            | |
            | Cast[chararray] - scope-5
            | |
            | |---Project[bytearray][1] - scope-4
            |
            |---a: Load(file:///Users/cheolsoop/workspace/pig/foo:org.apache.pig.builtin.PigStorage) - scope-0
        {code}
        What's happening is that since there are 2 stores (and thus 2 data flows, i.e. a=>c and a=>d), Pig generates two physical plans. Now TezCompile compiles them into a single tez plan but adds the same sub-plan twice.

        This is an issue with any blocking operators (join, union, etc) followed by split.
        Daniel Dai made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        3d 2h 1 Rohini Palaniswamy 15/Dec/13 20:21
        Patch Available Patch Available Resolved Resolved
        21h 25m 1 Rohini Palaniswamy 16/Dec/13 17:46
        Resolved Resolved Closed Closed
        339d 12h 12m 1 Daniel Dai 21/Nov/14 05:58

          People

          • Assignee:
            Rohini Palaniswamy
            Reporter:
            Cheolsoo Park
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development