Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.1
    • Fix Version/s: 1.1.2
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      Looks like the tests are extremely flaky and just hang.

      1. MAPREDUCE-4859.patch
        8 kB
        Tom White
      2. MAPREDUCE-4859.patch
        7 kB
        Tom White
      3. MAPREDUCE-4859.patch
        8 kB
        Arun C Murthy

        Activity

        Hide
        acmurthy Arun C Murthy added a comment -

        Sigh, I give up.

        TestRecoveryManager is hopeless. Mainly in the sense that it uses the confounded UtilsForTests which are broken.

        testJobTrackerRestartsWithMissingJobFile & testJobTrackerRestartWithBadJobs hang on both Linux and MacOSX.
        testJobResubmission works on MacOSX and hangs on Linux similar to the other two.

        I managed to track and fix one bug in testJobTrackerInfoCreation.

        I'll ignore them for 1.1.2 (sad to have a stable release with unit-test failures due to flaky test code) so we can revisit them.

        Show
        acmurthy Arun C Murthy added a comment - Sigh, I give up. TestRecoveryManager is hopeless. Mainly in the sense that it uses the confounded UtilsForTests which are broken. testJobTrackerRestartsWithMissingJobFile & testJobTrackerRestartWithBadJobs hang on both Linux and MacOSX. testJobResubmission works on MacOSX and hangs on Linux similar to the other two. I managed to track and fix one bug in testJobTrackerInfoCreation. I'll ignore them for 1.1.2 (sad to have a stable release with unit-test failures due to flaky test code) so we can revisit them.
        Hide
        acmurthy Arun C Murthy added a comment -

        Fixed testJobTrackerInfoCreation, ignored others after converting them to junit4.

        Show
        acmurthy Arun C Murthy added a comment - Fixed testJobTrackerInfoCreation, ignored others after converting them to junit4.
        Hide
        mattf Matt Foley added a comment -

        +1. Please commit to branch-1 and branch-1.1. Thanks, Arun!

        Show
        mattf Matt Foley added a comment - +1. Please commit to branch-1 and branch-1.1. Thanks, Arun!
        Hide
        acmurthy Arun C Murthy added a comment -

        I just committed this. Thanks for the quick check Matt!

        Show
        acmurthy Arun C Murthy added a comment - I just committed this. Thanks for the quick check Matt!
        Hide
        tomwhite Tom White added a comment -

        I had a look at the failing tests. A couple of the tests were hanging because they don't wait for the jobs to complete, so the tasktrackers never exit and mini cluster shutdown waits forever. testJobResubmission was failing due to a race where the old TIP gets removed while the recovered TIP is running so the TT thinks it has never completed. I also made the output directories unique since there were occasional clashes between tests despite the test directory being deleted each time.

        Tests pass for me on Mac and Linux.

        Matt/Arun - can you see if the patch works for you please?

        Show
        tomwhite Tom White added a comment - I had a look at the failing tests. A couple of the tests were hanging because they don't wait for the jobs to complete, so the tasktrackers never exit and mini cluster shutdown waits forever. testJobResubmission was failing due to a race where the old TIP gets removed while the recovered TIP is running so the TT thinks it has never completed. I also made the output directories unique since there were occasional clashes between tests despite the test directory being deleted each time. Tests pass for me on Mac and Linux. Matt/Arun - can you see if the patch works for you please?
        Hide
        tucu00 Alejandro Abdelnur added a comment -

        Tom,

        Changes in the patch look good. I'm able to run the test successfully in OSX, however, in Centos6 fails with

        Testcase: testJobTrackerRestartsWithMissingJobFile took 99.749 sec
        Testcase: testJobResubmission took 16.062 sec
        Testcase: testJobTrackerRestartWithBadJobs took 64.081 sec
        Testcase: testRestartCount took 5.711 sec
        Testcase: testJobTrackerInfoCreation took 2.726 sec
        	Caused an ERROR
        null
        java.lang.NullPointerException
        	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:426)
        	at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:284)
        	at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:124)
        	at org.apache.hadoop.mapred.TestRecoveryManager.testJobTrackerInfoCreation(TestRecoveryManager.java:454)
        
        Show
        tucu00 Alejandro Abdelnur added a comment - Tom, Changes in the patch look good. I'm able to run the test successfully in OSX, however, in Centos6 fails with Testcase: testJobTrackerRestartsWithMissingJobFile took 99.749 sec Testcase: testJobResubmission took 16.062 sec Testcase: testJobTrackerRestartWithBadJobs took 64.081 sec Testcase: testRestartCount took 5.711 sec Testcase: testJobTrackerInfoCreation took 2.726 sec Caused an ERROR null java.lang.NullPointerException at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:426) at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:284) at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:124) at org.apache.hadoop.mapred.TestRecoveryManager.testJobTrackerInfoCreation(TestRecoveryManager.java:454)
        Hide
        tucu00 Alejandro Abdelnur added a comment -

        Another thing, but I'd say for another JIRA, the while (!jip...) loops (5 of them), we should add a timeout there of a few mins and fail, not to have to wait until the build time out kicks in.

        Show
        tucu00 Alejandro Abdelnur added a comment - Another thing, but I'd say for another JIRA, the while (!jip...) loops (5 of them), we should add a timeout there of a few mins and fail, not to have to wait until the build time out kicks in.
        Hide
        tomwhite Tom White added a comment -

        Alejandro, does testJobTrackerInfoCreation fail without the patch on Centos 6? I ran it on Centos 5.5 and it passed.

        Regarding the timeouts I've created a new patch with per-test timeouts.

        Show
        tomwhite Tom White added a comment - Alejandro, does testJobTrackerInfoCreation fail without the patch on Centos 6? I ran it on Centos 5.5 and it passed. Regarding the timeouts I've created a new patch with per-test timeouts.
        Hide
        tucu00 Alejandro Abdelnur added a comment -

        Tom, HEAD of branch-1 fails on Centos6 with the same error. So it seems either a problem in my Centos6 setup on in Centos6 in general.

        +1

        Show
        tucu00 Alejandro Abdelnur added a comment - Tom, HEAD of branch-1 fails on Centos6 with the same error. So it seems either a problem in my Centos6 setup on in Centos6 in general. +1
        Hide
        vicaya Luke Lu added a comment -

        Alejandro, make sure your umask is 022, otherwise (say umask 002) you'll see this dfs cluster NPE due to lack of "valid" data stores. This is obviously a workaround that needs to be fixed in a separate JIRA.

        Show
        vicaya Luke Lu added a comment - Alejandro, make sure your umask is 022, otherwise (say umask 002) you'll see this dfs cluster NPE due to lack of "valid" data stores. This is obviously a workaround that needs to be fixed in a separate JIRA.
        Hide
        tucu00 Alejandro Abdelnur added a comment - - edited

        Yep, the umask did it, now the test passes on HEAD and with the patch. Thx Luke.

        Show
        tucu00 Alejandro Abdelnur added a comment - - edited Yep, the umask did it, now the test passes on HEAD and with the patch. Thx Luke.
        Hide
        asanjar Amir Sanjar added a comment -

        is there a patch for release 1.1.1? .. thanks

        Show
        asanjar Amir Sanjar added a comment - is there a patch for release 1.1.1? .. thanks
        Hide
        tomwhite Tom White added a comment -

        I committed the updated fix to branch-1 and branch-1.1.

        Show
        tomwhite Tom White added a comment - I committed the updated fix to branch-1 and branch-1.1.
        Hide
        asanjar Amir Sanjar added a comment -

        thanks Tom

        Show
        asanjar Amir Sanjar added a comment - thanks Tom
        Hide
        mattf Matt Foley added a comment -

        Closed upon successful release of 1.1.2.

        Show
        mattf Matt Foley added a comment - Closed upon successful release of 1.1.2.

          People

          • Assignee:
            acmurthy Arun C Murthy
            Reporter:
            acmurthy Arun C Murthy
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development