Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: tez-branch
    • Fix Version/s: tez-branch
    • Component/s: tez
    • Labels:
      None

      Description

      Unlike MR, union needs to be implemented as multiple vertices because Tez doesn't allow multiple inputs in root vertex.

      I propose that we implement union by connecting load vertices to union vertex with broadcasting edges.

      1. PIG-3585-1.patch
        23 kB
        Cheolsoo Park
      2. PIG-3585-2.patch
        22 kB
        Cheolsoo Park

        Activity

        Hide
        Cheolsoo Park added a comment -

        Committed to tez branch.

        Show
        Cheolsoo Park added a comment - Committed to tez branch.
        Hide
        Cheolsoo Park added a comment - - edited

        One minor clean-up. I made getPackage() take the number of inputs as parameter and removed getUnion() in TezCompiler.

        Show
        Cheolsoo Park added a comment - - edited One minor clean-up. I made getPackage() take the number of inputs as parameter and removed getUnion() in TezCompiler.
        Hide
        Cheolsoo Park added a comment -

        Yes, it should be possible. Alternatively, I am thinking that we could attach multiple file inputs directly to the root vertex, and each task will pull the inputs in order. But only one of them will actually return records depending on which split the task lands on. The Tez error message implies Tez will support multiple inputs for the root vertex eventually.

        Anyway, we can discuss further later. Thank you again for the review!

        Show
        Cheolsoo Park added a comment - Yes, it should be possible. Alternatively, I am thinking that we could attach multiple file inputs directly to the root vertex, and each task will pull the inputs in order. But only one of them will actually return records depending on which split the task lands on. The Tez error message implies Tez will support multiple inputs for the root vertex eventually. Anyway, we can discuss further later. Thank you again for the review!
        Hide
        Rohini Palaniswamy added a comment -

        +1. The code is good if we have union after some processing. With this patch a simple union takes 3 vertices (2 load and 1 union) instead of 1 vertex. We should make it optimal to handle in a single vertex.

        Tez doesn't allow multiple inputs for the root vertex. We probably can't implement UNION with a single vertex.

        With MR, we just load both inputs as two different PigSplit in PigInputFormat. That should be possible here too right?

        Show
        Rohini Palaniswamy added a comment - +1. The code is good if we have union after some processing. With this patch a simple union takes 3 vertices (2 load and 1 union) instead of 1 vertex. We should make it optimal to handle in a single vertex. Tez doesn't allow multiple inputs for the root vertex. We probably can't implement UNION with a single vertex. With MR, we just load both inputs as two different PigSplit in PigInputFormat. That should be possible here too right?
        Hide
        Cheolsoo Park added a comment -

        Attaching a patch. RB link-
        https://reviews.apache.org/r/15931/

        Show
        Cheolsoo Park added a comment - Attaching a patch. RB link- https://reviews.apache.org/r/15931/
        Hide
        Cheolsoo Park added a comment -

        As for UNION, Tez doesn't allow multiple inputs for the root vertex. So when adding more than one input to the vertex for UNION, TezDagBuilder fails with the following error-

        Caused by: java.lang.IllegalStateException: For now, only a single Root Input can be added to a Vertex
            at org.apache.tez.dag.api.Vertex.addInput(Vertex.java:156)
            at org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder.newVertex(TezDagBuilder.java:415)
            at org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder.visitTezOp(TezDagBuilder.java:136)
            at org.apache.pig.backend.hadoop.executionengine.tez.TezOperator.visit(TezOperator.java:117)
            at org.apache.pig.backend.hadoop.executionengine.tez.TezOperator.visit(TezOperator.java:37)
            at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:70)
            at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:46)
            at org.apache.pig.backend.hadoop.executionengine.tez.TezJobControlCompiler.buildDAG(TezJobControlCompiler.java:56)
            at org.apache.pig.backend.hadoop.executionengine.tez.TezJobControlCompiler.getJob(TezJobControlCompiler.java:103)
            ... 19 more
        

        We probably can't implement UNION with a single vertex.

        Show
        Cheolsoo Park added a comment - As for UNION, Tez doesn't allow multiple inputs for the root vertex. So when adding more than one input to the vertex for UNION, TezDagBuilder fails with the following error- Caused by: java.lang.IllegalStateException: For now, only a single Root Input can be added to a Vertex at org.apache.tez.dag.api.Vertex.addInput(Vertex.java:156) at org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder.newVertex(TezDagBuilder.java:415) at org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder.visitTezOp(TezDagBuilder.java:136) at org.apache.pig.backend.hadoop.executionengine.tez.TezOperator.visit(TezOperator.java:117) at org.apache.pig.backend.hadoop.executionengine.tez.TezOperator.visit(TezOperator.java:37) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:70) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:46) at org.apache.pig.backend.hadoop.executionengine.tez.TezJobControlCompiler.buildDAG(TezJobControlCompiler.java:56) at org.apache.pig.backend.hadoop.executionengine.tez.TezJobControlCompiler.getJob(TezJobControlCompiler.java:103) ... 19 more We probably can't implement UNION with a single vertex.
        Hide
        Cheolsoo Park added a comment -

        The reason why UNION duplicates records in tez is because of the following code:

        PigProcessor.java
        Tuple inputTuple = input.getCurrentTuple();
        
        for (PhysicalOperator root : roots) {
            root.attachInput(inputTuple);
        }
        
        runPipeline(leaf);
        

        Each task loads a tuple from an input split and attaches to every root of the pipeline. Since UNION has multiple roots, it ends up adding the same tuple multiple times.

        Looks like PIG-3527 is fixing this problem, so I will wait until PIG-3527 gets committed.

        Show
        Cheolsoo Park added a comment - The reason why UNION duplicates records in tez is because of the following code: PigProcessor.java Tuple inputTuple = input.getCurrentTuple(); for (PhysicalOperator root : roots) { root.attachInput(inputTuple); } runPipeline(leaf); Each task loads a tuple from an input split and attaches to every root of the pipeline. Since UNION has multiple roots, it ends up adding the same tuple multiple times. Looks like PIG-3527 is fixing this problem, so I will wait until PIG-3527 gets committed.
        Hide
        Cheolsoo Park added a comment -

        I actually realized UINON in tez duplicates records. Let me debug it.

        Show
        Cheolsoo Park added a comment - I actually realized UINON in tez duplicates records. Let me debug it.

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Cheolsoo Park
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development