Pig
  1. Pig
  2. PIG-3957

Refactor out resetting input key in TezDagBuilder

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.16.0
    • Component/s: tez
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In TezDagBuilder, we reset input key of all POPackage/POValueInputTez/POIdentityInOutTez, that is very confusing. We shall refactor these out.

      1. PIG-3957-1.patch
        10 kB
        Rohini Palaniswamy

        Issue Links

          Activity

          Hide
          Rohini Palaniswamy added a comment -

          This is causing some queries to fail with one of the below errors if a vertex contains input from both union and replicated join table. The input keys are overwritten in TezDAGBuilder and both of they end up pointing to same input.

          Caused by: java.io.IOException: Please check if you are invoking next() even after it returned false. For usage, please refer to KeyValueReader javadocs
          	at org.apache.tez.runtime.library.api.KeyValueReader.hasCompletedProcessing(KeyValueReader.java:77)
          	at org.apache.tez.runtime.library.common.readers.UnorderedKVReader.moveToNextInput(UnorderedKVReader.java:190)
          	at org.apache.tez.runtime.library.common.readers.UnorderedKVReader.next(UnorderedKVReader.java:118)
          	at org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POValueInputTez.getNextTuple(POValueInputTez.java:124)
          

          or

          Caused by: java.lang.ClassCastException: org.apache.pig.impl.io.NullableTuple cannot be cast to org.apache.pig.data.Tuple
          	at org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POValueInputTez.getNextTuple(POValueInputTez.java:126)
          	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
          	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:252)
          
          Show
          Rohini Palaniswamy added a comment - This is causing some queries to fail with one of the below errors if a vertex contains input from both union and replicated join table. The input keys are overwritten in TezDAGBuilder and both of they end up pointing to same input. Caused by: java.io.IOException: Please check if you are invoking next() even after it returned false . For usage, please refer to KeyValueReader javadocs at org.apache.tez.runtime.library.api.KeyValueReader.hasCompletedProcessing(KeyValueReader.java:77) at org.apache.tez.runtime.library.common.readers.UnorderedKVReader.moveToNextInput(UnorderedKVReader.java:190) at org.apache.tez.runtime.library.common.readers.UnorderedKVReader.next(UnorderedKVReader.java:118) at org.apache.pig.backend.hadoop.executionengine.tez.plan. operator .POValueInputTez.getNextTuple(POValueInputTez.java:124) or Caused by: java.lang.ClassCastException: org.apache.pig.impl.io.NullableTuple cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.backend.hadoop.executionengine.tez.plan. operator .POValueInputTez.getNextTuple(POValueInputTez.java:126) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:252)
          Hide
          Rohini Palaniswamy added a comment - - edited

          Only difference between testFRJoinOut8 and testFRJoinOut9 was 'replicated' vs 'repl'. So repurposed testFRJoinOut9 for adding test.

          The code in TezDAGBuilder put in for scalars was not needed any more. So totally removed it. TestScalarAlias, Scalar and CastScalar e2e tests are fine. Running the full e2e suite now.

          Additionally optimized the Scalar more than one row checking by doing that while writing the output. MultiQueryOptimizer and UnionOptimizer still do not handle it. Created PIG-4692 for that. The check in ReadScalarsTez is still there and cannot be removed as it is required to handle cases where multiple source tasks write 0 or 1 record each. In that case it will not fail while writing, but will fail when reading as there will be more than 1 record.

          Show
          Rohini Palaniswamy added a comment - - edited Only difference between testFRJoinOut8 and testFRJoinOut9 was 'replicated' vs 'repl'. So repurposed testFRJoinOut9 for adding test. The code in TezDAGBuilder put in for scalars was not needed any more. So totally removed it. TestScalarAlias, Scalar and CastScalar e2e tests are fine. Running the full e2e suite now. Additionally optimized the Scalar more than one row checking by doing that while writing the output. MultiQueryOptimizer and UnionOptimizer still do not handle it. Created PIG-4692 for that. The check in ReadScalarsTez is still there and cannot be removed as it is required to handle cases where multiple source tasks write 0 or 1 record each. In that case it will not fail while writing, but will fail when reading as there will be more than 1 record.
          Hide
          Daniel Dai added a comment -

          +1

          Show
          Daniel Dai added a comment - +1
          Hide
          Rohini Palaniswamy added a comment -

          Committed to trunk. Thanks for the review Daniel.

          Show
          Rohini Palaniswamy added a comment - Committed to trunk. Thanks for the review Daniel.

            People

            • Assignee:
              Rohini Palaniswamy
              Reporter:
              Daniel Dai
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development