Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.204.0
    • Fix Version/s: 1.1.0
    • Component/s: tasktracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Target Version/s:

      Description

      MR-2413 doesn't have any test coverage that eg tests that the TT can survive disk failure.

      1. MR2850.v1.patch
        9 kB
        Ravi Gummadi
      2. MR2850.v1.3.patch
        10 kB
        Eli Collins
      3. MR2850.v1.2.patch
        11 kB
        Ravi Gummadi
      4. MR2850.v1.1.patch
        9 kB
        Ravi Gummadi
      5. MR2850.v0.patch
        9 kB
        Ravi Gummadi

        Issue Links

          Activity

          Hide
          Ravi Gummadi added a comment -

          Attaching first-cut patch that adds unit test for MR-2413.
          The unit test checks if good mapred local dirs of TaskTracker are getting updated or not when disks fail. To mimic disk failure, I replaced the specific mapred-local-directory with a regular-file of the same name — thus leading to DiskChecker.checkDir() to fail.

          Please review the patch and provide your comments.

          Show
          Ravi Gummadi added a comment - Attaching first-cut patch that adds unit test for MR-2413. The unit test checks if good mapred local dirs of TaskTracker are getting updated or not when disks fail. To mimic disk failure, I replaced the specific mapred-local-directory with a regular-file of the same name — thus leading to DiskChecker.checkDir() to fail. Please review the patch and provide your comments.
          Hide
          Matt Foley added a comment -

          Eli, would you like to code review these?

          Show
          Matt Foley added a comment - Eli, would you like to code review these?
          Hide
          Eli Collins added a comment -

          Took a quick look:

          In prepareDirToFail it says the file is set to perms "000" but File#createNewFile uses the default perms (eg 644 with umask 022), so it should still be accessible right?

          If you want not always have waitForDiskHealthCheck wait for 10s at a time seems like you can lower the DISK_HEALTH_CHECK_INTERVAL to eg 1s.

          Would also be good to test startup with a failed directory. Feel free to punt this to MAPREDUCE-2921.

          Otherwise looks good.

          Show
          Eli Collins added a comment - Took a quick look: In prepareDirToFail it says the file is set to perms "000" but File#createNewFile uses the default perms (eg 644 with umask 022), so it should still be accessible right? If you want not always have waitForDiskHealthCheck wait for 10s at a time seems like you can lower the DISK_HEALTH_CHECK_INTERVAL to eg 1s. Would also be good to test startup with a failed directory. Feel free to punt this to MAPREDUCE-2921 . Otherwise looks good.
          Hide
          Ravi Gummadi added a comment -

          >> In prepareDirToFail it says the file is set to perms "000" but File#createNewFile uses the default perms (eg 644 with umask 022), so it should still be accessible right?

          No. The comment was wrong. I replace the directory with a file so that DiskChecker.checkDirs() will fail because it tries to do mkdir with the same name and this will be reported as a disk failure. Updating the comment accordingly.

          >> If you want not always have waitForDiskHealthCheck wait for 10s at a time seems like you can lower the DISK_HEALTH_CHECK_INTERVAL to eg 1s.

          OK. Changing to 1 sec.

          >> Would also be good to test startup with a failed directory. Feel free to punt this to MAPREDUCE-2921.

          This change would need a handle into the code of MiniMRCluster.TaskTrackerRunner() and needs some refactoring and exposing some api in MiniMRCluster. So not doing it as part of current JIRA.

          Show
          Ravi Gummadi added a comment - >> In prepareDirToFail it says the file is set to perms "000" but File#createNewFile uses the default perms (eg 644 with umask 022), so it should still be accessible right? No. The comment was wrong. I replace the directory with a file so that DiskChecker.checkDirs() will fail because it tries to do mkdir with the same name and this will be reported as a disk failure. Updating the comment accordingly. >> If you want not always have waitForDiskHealthCheck wait for 10s at a time seems like you can lower the DISK_HEALTH_CHECK_INTERVAL to eg 1s. OK. Changing to 1 sec. >> Would also be good to test startup with a failed directory. Feel free to punt this to MAPREDUCE-2921 . This change would need a handle into the code of MiniMRCluster.TaskTrackerRunner() and needs some refactoring and exposing some api in MiniMRCluster. So not doing it as part of current JIRA.
          Hide
          Ravi Gummadi added a comment -

          Attaching new patch incorporating Eli's review comments.

          Show
          Ravi Gummadi added a comment - Attaching new patch incorporating Eli's review comments.
          Hide
          Eli Collins added a comment -

          +1 lgtm

          Nit: you can remove the isDead=true in MiniMRCluster line 219 now that you unconditionally set it later.

          Show
          Eli Collins added a comment - +1 lgtm Nit: you can remove the isDead=true in MiniMRCluster line 219 now that you unconditionally set it later.
          Hide
          Ravi Gummadi added a comment -

          Attaching new patch by removing the unnecessary line "isDead=true".

          test-patch passed on my local machine.

          Show
          Ravi Gummadi added a comment - Attaching new patch by removing the unnecessary line "isDead=true". test-patch passed on my local machine.
          Hide
          Eli Collins added a comment -

          +1

          Show
          Eli Collins added a comment - +1
          Hide
          Ravi Gummadi added a comment -

          I saw ConcurrentModificationException in LocalStorage.checkDirs() in the logs of the new testcase because we are removing item from list while iterating over the list.

          Attaching new patch here by fixing that code also as part of this JIRA.

          Please review and provide your comments.

          Show
          Ravi Gummadi added a comment - I saw ConcurrentModificationException in LocalStorage.checkDirs() in the logs of the new testcase because we are removing item from list while iterating over the list. Attaching new patch here by fixing that code also as part of this JIRA. Please review and provide your comments.
          Hide
          Eli Collins added a comment -

          Good catch wrt CME. How about using an iterator instead? See attached, also removes some of the changes from another patch that were inadvertently added.

          Show
          Eli Collins added a comment - Good catch wrt CME. How about using an iterator instead? See attached, also removes some of the changes from another patch that were inadvertently added.
          Hide
          Ravi Gummadi added a comment -

          test-patch passed on my local machine.

          I just committed this to branch-0.20-security. Thanks Eli.

          Show
          Ravi Gummadi added a comment - test-patch passed on my local machine. I just committed this to branch-0.20-security. Thanks Eli.
          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop-1.1.0.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop-1.1.0.

            People

            • Assignee:
              Ravi Gummadi
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development