Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2504

MR 279: race in JobHistoryEventHandler stop

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: mrv2
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The condition to stop the eventHandling thread currently requires it to be 'stopped' AND interrupted. If an interrupt arrives after a take, but before handleEvent is called - the interrupt status ends up being handled by hadoop.util.Shell.runCommand() - which ignores it (and in the process resets the flag).
      The eventHandling thread subsequently hangs on eventQueue.take()
      This currently randomly fails unit tests - and can hang MR AMs.

      1. MR2504_3.patch
        4 kB
        Siddharth Seth
      2. MR2504_2.patch
        4 kB
        Siddharth Seth
      3. MR2504.patch
        4 kB
        Siddharth Seth

        Activity

        Hide
        Siddharth Seth added a comment -

        Patch. Fixes the condition to be 'stopped' OR 'interrupted'.
        Unsets the interrupted status if it is set before calling handleEvent

        Show
        Siddharth Seth added a comment - Patch. Fixes the condition to be 'stopped' OR 'interrupted'. Unsets the interrupted status if it is set before calling handleEvent
        Hide
        Luke Lu added a comment -

        The code looks correct but a bit too tricky for application (vs library) code. Need a bit more comment about ensuring all remaining events are handled in the event of interrupt from the stop method. I think we should probably refactor the event queue into the common event package and add some unit tests for it.

        Show
        Luke Lu added a comment - The code looks correct but a bit too tricky for application (vs library) code. Need a bit more comment about ensuring all remaining events are handled in the event of interrupt from the stop method. I think we should probably refactor the event queue into the common event package and add some unit tests for it.
        Hide
        Siddharth Seth added a comment -

        Added some comments about the interrupt status reset.

        Show
        Siddharth Seth added a comment - Added some comments about the interrupt status reset.
        Hide
        Siddharth Seth added a comment -

        Removed part of the interrupted check + set which went in accidentally.

        Show
        Siddharth Seth added a comment - Removed part of the interrupted check + set which went in accidentally.
        Hide
        Mahadev konar added a comment -

        I just committed this to MR-279. thanks sid!

        Show
        Mahadev konar added a comment - I just committed this to MR-279. thanks sid!

          People

          • Assignee:
            Siddharth Seth
            Reporter:
            Siddharth Seth
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development