Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-9976

Possible race condition in DynamicPartitionPruner for <200ms tasks

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.2.0
    • Component/s: Tez
    • Labels:
      None

      Description

      Race condition in the DynamicPartitionPruner between DynamicPartitionPruner::processVertex() and DynamicPartitionpruner::addEvent() for tasks which respond with both the result and success in a single heartbeat sequence.

      2015-03-16 07:05:01,589 ERROR [InputInitializer [Map 1] #0] tez.DynamicPartitionPruner: Expecting: 1, received: 0
      2015-03-16 07:05:01,590 ERROR [Dispatcher thread: Central] impl.VertexImpl: Vertex Input: store_sales initializer failed, vertex=vertex_1424502260528_1113_4_04 [Map 1]
      org.apache.tez.dag.app.dag.impl.AMUserCodeException: org.apache.hadoop.hive.ql.metadata.HiveException: Incorrect event count in dynamic parition pruning
      

      All 4 upstream vertices of Map 1 need to finish within ~200ms to trigger this, which seems to be consistently happening with LLAP.

      1. HIVE-9976.1.patch
        35 kB
        Siddharth Seth
      2. HIVE-9976.2.patch
        52 kB
        Siddharth Seth
      3. llap_vertex_200ms.png
        14 kB
        Gopal V

        Activity

        Hide
        hagleitn Gunther Hagleitner added a comment -

        Siddharth Seth I thought this can't happen. The Tez API was supposed to guarantee delivery of events before completion. Should I open a Tez issue?

        Show
        hagleitn Gunther Hagleitner added a comment - Siddharth Seth I thought this can't happen. The Tez API was supposed to guarantee delivery of events before completion. Should I open a Tez issue?
        Hide
        sseth Siddharth Seth added a comment -

        I'll take a look. Assuming this was run with Tez 0.7 snapshot ?

        Show
        sseth Siddharth Seth added a comment - I'll take a look. Assuming this was run with Tez 0.7 snapshot ?
        Hide
        hagleitn Gunther Hagleitner added a comment -

        Gopal V said that it was on tez 0.7.

        Show
        hagleitn Gunther Hagleitner added a comment - Gopal V said that it was on tez 0.7.
        Hide
        sseth Siddharth Seth added a comment -

        This is not limited to LLAP. Assigning to myself - to change the handling of vertex success / init events.

        Show
        sseth Siddharth Seth added a comment - This is not limited to LLAP. Assigning to myself - to change the handling of vertex success / init events.
        Hide
        sseth Siddharth Seth added a comment -

        Patch to handle out of order events. Also initializes the pruner during Input construction - so that events don't show up before the pruner is initialized. Adds a bunch of tests.

        Gunther Hagleitner, Vikram Dixit K - please review.

        Show
        sseth Siddharth Seth added a comment - Patch to handle out of order events. Also initializes the pruner during Input construction - so that events don't show up before the pruner is initialized. Adds a bunch of tests. Gunther Hagleitner , Vikram Dixit K - please review.
        Hide
        hiveqa Hive QA added a comment -

        Overall: -1 at least one tests failed

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12707029/HIVE-9976.1.patch

        ERROR: -1 due to 1 failed/errored test(s), 8347 tests executed
        Failed tests:

        org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_schemeAuthority
        

        Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3139/testReport
        Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3139/console
        Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3139/

        Messages:

        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        Tests exited with: TestsFailedException: 1 tests failed
        

        This message is automatically generated.

        ATTACHMENT ID: 12707029 - PreCommit-HIVE-TRUNK-Build

        Show
        hiveqa Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12707029/HIVE-9976.1.patch ERROR: -1 due to 1 failed/errored test(s), 8347 tests executed Failed tests: org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_schemeAuthority Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3139/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3139/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3139/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed This message is automatically generated. ATTACHMENT ID: 12707029 - PreCommit-HIVE-TRUNK-Build
        Hide
        hagleitn Gunther Hagleitner added a comment -

        Not your fault - but there are 2 paths through HiveSplitGenerator. The class is used once without calling init and once being properly init'd. The reason is that some other code needs to use the "group splits" method. Since you've moved init to the constr now, this has gotten even uglier. Could you move the split grouper methods to a separate util class (static) and leave the pruner to just prune.

        Also - I think you've moved the initialization of the dynamic pruner to the constr of the input initializer, in order to not miss any events. Can you add a comment to the code explaining this?

        Very cool to see a real unit test thanks.

        Show
        hagleitn Gunther Hagleitner added a comment - Not your fault - but there are 2 paths through HiveSplitGenerator. The class is used once without calling init and once being properly init'd. The reason is that some other code needs to use the "group splits" method. Since you've moved init to the constr now, this has gotten even uglier. Could you move the split grouper methods to a separate util class (static) and leave the pruner to just prune. Also - I think you've moved the initialization of the dynamic pruner to the constr of the input initializer, in order to not miss any events. Can you add a comment to the code explaining this? Very cool to see a real unit test thanks.
        Hide
        hagleitn Gunther Hagleitner added a comment -

        i see you've fixed calling close consistently on the data input stream. maybe use try{}finally there?

        Show
        hagleitn Gunther Hagleitner added a comment - i see you've fixed calling close consistently on the data input stream. maybe use try{}finally there?
        Hide
        hagleitn Gunther Hagleitner added a comment -

        it seems you're setting numexpectedevents to 0 first and then turn around and call decrement. Why not just set to -1? Also - why atomic integers? as far as i can tell all access to these maps is synchronized.

        Show
        hagleitn Gunther Hagleitner added a comment - it seems you're setting numexpectedevents to 0 first and then turn around and call decrement. Why not just set to -1? Also - why atomic integers? as far as i can tell all access to these maps is synchronized.
        Hide
        hagleitn Gunther Hagleitner added a comment -

        does it make sense to make initialize in the pruner private now? (can't be used to init anymore - only from the constr). Also, the parameters aren't used anymore, right?

        Show
        hagleitn Gunther Hagleitner added a comment - does it make sense to make initialize in the pruner private now? (can't be used to init anymore - only from the constr). Also, the parameters aren't used anymore, right?
        Hide
        hagleitn Gunther Hagleitner added a comment -

        other than the above looks good to me. like the extra comments and conditions you've put in!

        Show
        hagleitn Gunther Hagleitner added a comment - other than the above looks good to me. like the extra comments and conditions you've put in!
        Hide
        sseth Siddharth Seth added a comment -

        Thanks for the review. Updated patch with comments addressed, and some more changes.

        Not your fault - but there are 2 paths through HiveSplitGenerator.

        Moved the methods into SplitGrouper. There's a static cache in there which seems a little strange. Will create a follow up jira to investigate this. For now I've changed that to a ConcurrentMap since split generation can run in parallel.

        i see you've fixed calling close consistently on the data input stream. maybe use try{}finally there?

        Fixed. There was a bug with some of the other conditions which I'd changed. Fixed that as well.

        it seems you're setting numexpectedevents to 0 first and then turn around and call decrement. Why not just set to -1? Also - why atomic integers? as far as i can tell all access to these maps is synchronized.

        numExpectedEvents is decremented for each column for which a source will send events. That's used to track total number of expected events from that source. Added a comment for this.
        Moved from AtomicIntegers to MutableInt - this was just to avoid re-inserting the Integer into the map, and not for thread safety.

        does it make sense to make initialize in the pruner private now? (can't be used to init anymore - only from the constr). Also, the parameters aren't used anymore, right?

        Done, along with some other methods.

        Show
        sseth Siddharth Seth added a comment - Thanks for the review. Updated patch with comments addressed, and some more changes. Not your fault - but there are 2 paths through HiveSplitGenerator. Moved the methods into SplitGrouper. There's a static cache in there which seems a little strange. Will create a follow up jira to investigate this. For now I've changed that to a ConcurrentMap since split generation can run in parallel. i see you've fixed calling close consistently on the data input stream. maybe use try{}finally there? Fixed. There was a bug with some of the other conditions which I'd changed. Fixed that as well. it seems you're setting numexpectedevents to 0 first and then turn around and call decrement. Why not just set to -1? Also - why atomic integers? as far as i can tell all access to these maps is synchronized. numExpectedEvents is decremented for each column for which a source will send events. That's used to track total number of expected events from that source. Added a comment for this. Moved from AtomicIntegers to MutableInt - this was just to avoid re-inserting the Integer into the map, and not for thread safety. does it make sense to make initialize in the pruner private now? (can't be used to init anymore - only from the constr). Also, the parameters aren't used anymore, right? Done, along with some other methods.
        Hide
        hagleitn Gunther Hagleitner added a comment -

        +1

        Show
        hagleitn Gunther Hagleitner added a comment - +1
        Hide
        hiveqa Hive QA added a comment -

        Overall: +1 all checks pass

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12707303/HIVE-9976.2.patch

        SUCCESS: +1 8347 tests passed

        Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3159/testReport
        Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3159/console
        Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3159/

        Messages:

        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        

        This message is automatically generated.

        ATTACHMENT ID: 12707303 - PreCommit-HIVE-TRUNK-Build

        Show
        hiveqa Hive QA added a comment - Overall : +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12707303/HIVE-9976.2.patch SUCCESS: +1 8347 tests passed Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3159/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3159/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3159/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase This message is automatically generated. ATTACHMENT ID: 12707303 - PreCommit-HIVE-TRUNK-Build
        Hide
        hagleitn Gunther Hagleitner added a comment -

        Committed to trunk. Thanks Siddharth Seth!

        Show
        hagleitn Gunther Hagleitner added a comment - Committed to trunk. Thanks Siddharth Seth !
        Hide
        sushanth Sushanth Sowmyan added a comment -

        This issue has been fixed and released as part of the 1.2.0 release. If you find an issue which seems to be related to this one, please create a new jira and link this one with new jira.

        Show
        sushanth Sushanth Sowmyan added a comment - This issue has been fixed and released as part of the 1.2.0 release. If you find an issue which seems to be related to this one, please create a new jira and link this one with new jira.

          People

          • Assignee:
            sseth Siddharth Seth
            Reporter:
            gopalv Gopal V
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development