Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: tez-branch
    • Fix Version/s: tez-branch
    • Component/s: tez
    • Labels:
      None

      Description

      Unlike MR, union needs to be implemented as multiple vertices because Tez doesn't allow multiple inputs in root vertex.

      I propose that we implement union by connecting load vertices to union vertex with broadcasting edges.

      1. PIG-3585-1.patch
        23 kB
        Cheolsoo Park
      2. PIG-3585-2.patch
        22 kB
        Cheolsoo Park

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Patch Available Patch Available Open Open
        3d 18h 50m 1 Cheolsoo Park 26/Nov/13 18:52
        Open Open Patch Available Patch Available
        4d 12h 46m 2 Cheolsoo Park 01/Dec/13 07:02
        Patch Available Patch Available Resolved Resolved
        16h 9m 1 Cheolsoo Park 01/Dec/13 23:11
        Resolved Resolved Closed Closed
        354d 6h 47m 1 Daniel Dai 21/Nov/14 05:59
        Daniel Dai made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Cheolsoo Park made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Cheolsoo Park added a comment -

        Committed to tez branch.

        Show
        Cheolsoo Park added a comment - Committed to tez branch.
        Cheolsoo Park made changes -
        Attachment PIG-3585-2.patch [ 12616490 ]
        Hide
        Cheolsoo Park added a comment - - edited

        One minor clean-up. I made getPackage() take the number of inputs as parameter and removed getUnion() in TezCompiler.

        Show
        Cheolsoo Park added a comment - - edited One minor clean-up. I made getPackage() take the number of inputs as parameter and removed getUnion() in TezCompiler.
        Hide
        Cheolsoo Park added a comment -

        Yes, it should be possible. Alternatively, I am thinking that we could attach multiple file inputs directly to the root vertex, and each task will pull the inputs in order. But only one of them will actually return records depending on which split the task lands on. The Tez error message implies Tez will support multiple inputs for the root vertex eventually.

        Anyway, we can discuss further later. Thank you again for the review!

        Show
        Cheolsoo Park added a comment - Yes, it should be possible. Alternatively, I am thinking that we could attach multiple file inputs directly to the root vertex, and each task will pull the inputs in order. But only one of them will actually return records depending on which split the task lands on. The Tez error message implies Tez will support multiple inputs for the root vertex eventually. Anyway, we can discuss further later. Thank you again for the review!
        Hide
        Rohini Palaniswamy added a comment -

        +1. The code is good if we have union after some processing. With this patch a simple union takes 3 vertices (2 load and 1 union) instead of 1 vertex. We should make it optimal to handle in a single vertex.

        Tez doesn't allow multiple inputs for the root vertex. We probably can't implement UNION with a single vertex.

        With MR, we just load both inputs as two different PigSplit in PigInputFormat. That should be possible here too right?

        Show
        Rohini Palaniswamy added a comment - +1. The code is good if we have union after some processing. With this patch a simple union takes 3 vertices (2 load and 1 union) instead of 1 vertex. We should make it optimal to handle in a single vertex. Tez doesn't allow multiple inputs for the root vertex. We probably can't implement UNION with a single vertex. With MR, we just load both inputs as two different PigSplit in PigInputFormat. That should be possible here too right?
        Cheolsoo Park made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Cheolsoo Park made changes -
        Attachment PIG-3585-1.patch [ 12616463 ]
        Hide
        Cheolsoo Park added a comment -

        Attaching a patch. RB link-
        https://reviews.apache.org/r/15931/

        Show
        Cheolsoo Park added a comment - Attaching a patch. RB link- https://reviews.apache.org/r/15931/
        Cheolsoo Park made changes -
        Comment [ Adds two test cases for sample and union. ]
        Cheolsoo Park made changes -
        Attachment PIG-3585-1.patch [ 12615414 ]
        Cheolsoo Park made changes -
        Summary Add e2e tests for SAMPLE and UNION Make union work with tez
        Description Unlike MR, union needs to be implemented as multiple vertices because Tez doesn't allow multiple inputs in root vertex.

        I propose that we implement union by connecting load vertices to union vertex with broadcasting edges.
        Hide
        Cheolsoo Park added a comment -

        As for UNION, Tez doesn't allow multiple inputs for the root vertex. So when adding more than one input to the vertex for UNION, TezDagBuilder fails with the following error-

        Caused by: java.lang.IllegalStateException: For now, only a single Root Input can be added to a Vertex
            at org.apache.tez.dag.api.Vertex.addInput(Vertex.java:156)
            at org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder.newVertex(TezDagBuilder.java:415)
            at org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder.visitTezOp(TezDagBuilder.java:136)
            at org.apache.pig.backend.hadoop.executionengine.tez.TezOperator.visit(TezOperator.java:117)
            at org.apache.pig.backend.hadoop.executionengine.tez.TezOperator.visit(TezOperator.java:37)
            at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:70)
            at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:46)
            at org.apache.pig.backend.hadoop.executionengine.tez.TezJobControlCompiler.buildDAG(TezJobControlCompiler.java:56)
            at org.apache.pig.backend.hadoop.executionengine.tez.TezJobControlCompiler.getJob(TezJobControlCompiler.java:103)
            ... 19 more
        

        We probably can't implement UNION with a single vertex.

        Show
        Cheolsoo Park added a comment - As for UNION, Tez doesn't allow multiple inputs for the root vertex. So when adding more than one input to the vertex for UNION, TezDagBuilder fails with the following error- Caused by: java.lang.IllegalStateException: For now, only a single Root Input can be added to a Vertex at org.apache.tez.dag.api.Vertex.addInput(Vertex.java:156) at org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder.newVertex(TezDagBuilder.java:415) at org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder.visitTezOp(TezDagBuilder.java:136) at org.apache.pig.backend.hadoop.executionengine.tez.TezOperator.visit(TezOperator.java:117) at org.apache.pig.backend.hadoop.executionengine.tez.TezOperator.visit(TezOperator.java:37) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:70) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:46) at org.apache.pig.backend.hadoop.executionengine.tez.TezJobControlCompiler.buildDAG(TezJobControlCompiler.java:56) at org.apache.pig.backend.hadoop.executionengine.tez.TezJobControlCompiler.getJob(TezJobControlCompiler.java:103) ... 19 more We probably can't implement UNION with a single vertex.
        Hide
        Cheolsoo Park added a comment -

        The reason why UNION duplicates records in tez is because of the following code:

        PigProcessor.java
        Tuple inputTuple = input.getCurrentTuple();
        
        for (PhysicalOperator root : roots) {
            root.attachInput(inputTuple);
        }
        
        runPipeline(leaf);
        

        Each task loads a tuple from an input split and attaches to every root of the pipeline. Since UNION has multiple roots, it ends up adding the same tuple multiple times.

        Looks like PIG-3527 is fixing this problem, so I will wait until PIG-3527 gets committed.

        Show
        Cheolsoo Park added a comment - The reason why UNION duplicates records in tez is because of the following code: PigProcessor.java Tuple inputTuple = input.getCurrentTuple(); for (PhysicalOperator root : roots) { root.attachInput(inputTuple); } runPipeline(leaf); Each task loads a tuple from an input split and attaches to every root of the pipeline. Since UNION has multiple roots, it ends up adding the same tuple multiple times. Looks like PIG-3527 is fixing this problem, so I will wait until PIG-3527 gets committed.
        Cheolsoo Park made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Cheolsoo Park added a comment -

        I actually realized UINON in tez duplicates records. Let me debug it.

        Show
        Cheolsoo Park added a comment - I actually realized UINON in tez duplicates records. Let me debug it.
        Cheolsoo Park made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Cheolsoo Park made changes -
        Field Original Value New Value
        Attachment PIG-3585-1.patch [ 12615414 ]
        Cheolsoo Park created issue -

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Cheolsoo Park
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development