Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.204.0
    • Fix Version/s: 1.1.0
    • Component/s: tasktracker
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      MR-2413 doesn't have any test coverage that eg tests that the TT can survive disk failure.

      1. MR2850.v1.patch
        9 kB
        Ravi Gummadi
      2. MR2850.v1.3.patch
        10 kB
        Eli Collins
      3. MR2850.v1.2.patch
        11 kB
        Ravi Gummadi
      4. MR2850.v1.1.patch
        9 kB
        Ravi Gummadi
      5. MR2850.v0.patch
        9 kB
        Ravi Gummadi

        Issue Links

          Activity

          Eli Collins created issue -
          Eli Collins made changes -
          Field Original Value New Value
          Link This issue relates to MAPREDUCE-2413 [ MAPREDUCE-2413 ]
          Eli Collins made changes -
          Parent MAPREDUCE-2657 [ 12497097 ]
          Issue Type Test [ 6 ] Sub-task [ 7 ]
          Ravi Gummadi made changes -
          Assignee Ravi Gummadi [ ravidotg ]
          Hide
          Ravi Gummadi added a comment -

          Attaching first-cut patch that adds unit test for MR-2413.
          The unit test checks if good mapred local dirs of TaskTracker are getting updated or not when disks fail. To mimic disk failure, I replaced the specific mapred-local-directory with a regular-file of the same name — thus leading to DiskChecker.checkDir() to fail.

          Please review the patch and provide your comments.

          Show
          Ravi Gummadi added a comment - Attaching first-cut patch that adds unit test for MR-2413. The unit test checks if good mapred local dirs of TaskTracker are getting updated or not when disks fail. To mimic disk failure, I replaced the specific mapred-local-directory with a regular-file of the same name — thus leading to DiskChecker.checkDir() to fail. Please review the patch and provide your comments.
          Ravi Gummadi made changes -
          Attachment MR2850.v0.patch [ 12494437 ]
          Hide
          Matt Foley added a comment -

          Eli, would you like to code review these?

          Show
          Matt Foley added a comment - Eli, would you like to code review these?
          Hide
          Eli Collins added a comment -

          Took a quick look:

          In prepareDirToFail it says the file is set to perms "000" but File#createNewFile uses the default perms (eg 644 with umask 022), so it should still be accessible right?

          If you want not always have waitForDiskHealthCheck wait for 10s at a time seems like you can lower the DISK_HEALTH_CHECK_INTERVAL to eg 1s.

          Would also be good to test startup with a failed directory. Feel free to punt this to MAPREDUCE-2921.

          Otherwise looks good.

          Show
          Eli Collins added a comment - Took a quick look: In prepareDirToFail it says the file is set to perms "000" but File#createNewFile uses the default perms (eg 644 with umask 022), so it should still be accessible right? If you want not always have waitForDiskHealthCheck wait for 10s at a time seems like you can lower the DISK_HEALTH_CHECK_INTERVAL to eg 1s. Would also be good to test startup with a failed directory. Feel free to punt this to MAPREDUCE-2921 . Otherwise looks good.
          Matt Foley made changes -
          Fix Version/s 0.20.205.0 [ 12316391 ]
          Target Version/s 0.20.206.0 [ 12317960 ]
          Hide
          Ravi Gummadi added a comment -

          >> In prepareDirToFail it says the file is set to perms "000" but File#createNewFile uses the default perms (eg 644 with umask 022), so it should still be accessible right?

          No. The comment was wrong. I replace the directory with a file so that DiskChecker.checkDirs() will fail because it tries to do mkdir with the same name and this will be reported as a disk failure. Updating the comment accordingly.

          >> If you want not always have waitForDiskHealthCheck wait for 10s at a time seems like you can lower the DISK_HEALTH_CHECK_INTERVAL to eg 1s.

          OK. Changing to 1 sec.

          >> Would also be good to test startup with a failed directory. Feel free to punt this to MAPREDUCE-2921.

          This change would need a handle into the code of MiniMRCluster.TaskTrackerRunner() and needs some refactoring and exposing some api in MiniMRCluster. So not doing it as part of current JIRA.

          Show
          Ravi Gummadi added a comment - >> In prepareDirToFail it says the file is set to perms "000" but File#createNewFile uses the default perms (eg 644 with umask 022), so it should still be accessible right? No. The comment was wrong. I replace the directory with a file so that DiskChecker.checkDirs() will fail because it tries to do mkdir with the same name and this will be reported as a disk failure. Updating the comment accordingly. >> If you want not always have waitForDiskHealthCheck wait for 10s at a time seems like you can lower the DISK_HEALTH_CHECK_INTERVAL to eg 1s. OK. Changing to 1 sec. >> Would also be good to test startup with a failed directory. Feel free to punt this to MAPREDUCE-2921 . This change would need a handle into the code of MiniMRCluster.TaskTrackerRunner() and needs some refactoring and exposing some api in MiniMRCluster. So not doing it as part of current JIRA.
          Hide
          Ravi Gummadi added a comment -

          Attaching new patch incorporating Eli's review comments.

          Show
          Ravi Gummadi added a comment - Attaching new patch incorporating Eli's review comments.
          Ravi Gummadi made changes -
          Attachment MR2850.v1.patch [ 12498733 ]
          Hide
          Eli Collins added a comment -

          +1 lgtm

          Nit: you can remove the isDead=true in MiniMRCluster line 219 now that you unconditionally set it later.

          Show
          Eli Collins added a comment - +1 lgtm Nit: you can remove the isDead=true in MiniMRCluster line 219 now that you unconditionally set it later.
          Hide
          Ravi Gummadi added a comment -

          Attaching new patch by removing the unnecessary line "isDead=true".

          test-patch passed on my local machine.

          Show
          Ravi Gummadi added a comment - Attaching new patch by removing the unnecessary line "isDead=true". test-patch passed on my local machine.
          Ravi Gummadi made changes -
          Attachment MR2850.v1.1.patch [ 12500162 ]
          Hide
          Eli Collins added a comment -

          +1

          Show
          Eli Collins added a comment - +1
          Hide
          Ravi Gummadi added a comment -

          I saw ConcurrentModificationException in LocalStorage.checkDirs() in the logs of the new testcase because we are removing item from list while iterating over the list.

          Attaching new patch here by fixing that code also as part of this JIRA.

          Please review and provide your comments.

          Show
          Ravi Gummadi added a comment - I saw ConcurrentModificationException in LocalStorage.checkDirs() in the logs of the new testcase because we are removing item from list while iterating over the list. Attaching new patch here by fixing that code also as part of this JIRA. Please review and provide your comments.
          Ravi Gummadi made changes -
          Attachment MR2850.v1.2.patch [ 12500405 ]
          Hide
          Eli Collins added a comment -

          Good catch wrt CME. How about using an iterator instead? See attached, also removes some of the changes from another patch that were inadvertently added.

          Show
          Eli Collins added a comment - Good catch wrt CME. How about using an iterator instead? See attached, also removes some of the changes from another patch that were inadvertently added.
          Eli Collins made changes -
          Attachment MR2850.v1.3.patch [ 12500574 ]
          Hide
          Ravi Gummadi added a comment -

          test-patch passed on my local machine.

          I just committed this to branch-0.20-security. Thanks Eli.

          Show
          Ravi Gummadi added a comment - test-patch passed on my local machine. I just committed this to branch-0.20-security. Thanks Eli.
          Eli Collins made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Resolution Fixed [ 1 ]
          Eli Collins made changes -
          Summary TaskTracker disk failure handling (MR-2413) has no test coverage Add test for TaskTracker disk failure handling (MR-2413)
          Eli Collins made changes -
          Link This issue is related to MAPREDUCE-3419 [ MAPREDUCE-3419 ]
          Matt Foley made changes -
          Fix Version/s 1.1.0 [ 12317960 ]
          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop-1.1.0.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop-1.1.0.
          Matt Foley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          77d 21h 38m 1 Eli Collins 03/Nov/11 20:15
          Resolved Resolved Closed Closed
          348d 22h 11m 1 Matt Foley 17/Oct/12 19:27

            People

            • Assignee:
              Ravi Gummadi
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development