Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.1
    • Fix Version/s: 1.1.2
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      Looks like the tests are extremely flaky and just hang.

      1. MAPREDUCE-4859.patch
        8 kB
        Tom White
      2. MAPREDUCE-4859.patch
        7 kB
        Tom White
      3. MAPREDUCE-4859.patch
        8 kB
        Arun C Murthy

        Activity

        Hide
        Matt Foley added a comment -

        Closed upon successful release of 1.1.2.

        Show
        Matt Foley added a comment - Closed upon successful release of 1.1.2.
        Hide
        Amir Sanjar added a comment -

        thanks Tom

        Show
        Amir Sanjar added a comment - thanks Tom
        Hide
        Tom White added a comment -

        I committed the updated fix to branch-1 and branch-1.1.

        Show
        Tom White added a comment - I committed the updated fix to branch-1 and branch-1.1.
        Hide
        Amir Sanjar added a comment -

        is there a patch for release 1.1.1? .. thanks

        Show
        Amir Sanjar added a comment - is there a patch for release 1.1.1? .. thanks
        Hide
        Alejandro Abdelnur added a comment - - edited

        Yep, the umask did it, now the test passes on HEAD and with the patch. Thx Luke.

        Show
        Alejandro Abdelnur added a comment - - edited Yep, the umask did it, now the test passes on HEAD and with the patch. Thx Luke.
        Hide
        Luke Lu added a comment -

        Alejandro, make sure your umask is 022, otherwise (say umask 002) you'll see this dfs cluster NPE due to lack of "valid" data stores. This is obviously a workaround that needs to be fixed in a separate JIRA.

        Show
        Luke Lu added a comment - Alejandro, make sure your umask is 022, otherwise (say umask 002) you'll see this dfs cluster NPE due to lack of "valid" data stores. This is obviously a workaround that needs to be fixed in a separate JIRA.
        Hide
        Alejandro Abdelnur added a comment -

        Tom, HEAD of branch-1 fails on Centos6 with the same error. So it seems either a problem in my Centos6 setup on in Centos6 in general.

        +1

        Show
        Alejandro Abdelnur added a comment - Tom, HEAD of branch-1 fails on Centos6 with the same error. So it seems either a problem in my Centos6 setup on in Centos6 in general. +1
        Hide
        Tom White added a comment -

        Alejandro, does testJobTrackerInfoCreation fail without the patch on Centos 6? I ran it on Centos 5.5 and it passed.

        Regarding the timeouts I've created a new patch with per-test timeouts.

        Show
        Tom White added a comment - Alejandro, does testJobTrackerInfoCreation fail without the patch on Centos 6? I ran it on Centos 5.5 and it passed. Regarding the timeouts I've created a new patch with per-test timeouts.
        Hide
        Alejandro Abdelnur added a comment -

        Another thing, but I'd say for another JIRA, the while (!jip...) loops (5 of them), we should add a timeout there of a few mins and fail, not to have to wait until the build time out kicks in.

        Show
        Alejandro Abdelnur added a comment - Another thing, but I'd say for another JIRA, the while (!jip...) loops (5 of them), we should add a timeout there of a few mins and fail, not to have to wait until the build time out kicks in.
        Hide
        Alejandro Abdelnur added a comment -

        Tom,

        Changes in the patch look good. I'm able to run the test successfully in OSX, however, in Centos6 fails with

        Testcase: testJobTrackerRestartsWithMissingJobFile took 99.749 sec
        Testcase: testJobResubmission took 16.062 sec
        Testcase: testJobTrackerRestartWithBadJobs took 64.081 sec
        Testcase: testRestartCount took 5.711 sec
        Testcase: testJobTrackerInfoCreation took 2.726 sec
        	Caused an ERROR
        null
        java.lang.NullPointerException
        	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:426)
        	at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:284)
        	at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:124)
        	at org.apache.hadoop.mapred.TestRecoveryManager.testJobTrackerInfoCreation(TestRecoveryManager.java:454)
        
        Show
        Alejandro Abdelnur added a comment - Tom, Changes in the patch look good. I'm able to run the test successfully in OSX, however, in Centos6 fails with Testcase: testJobTrackerRestartsWithMissingJobFile took 99.749 sec Testcase: testJobResubmission took 16.062 sec Testcase: testJobTrackerRestartWithBadJobs took 64.081 sec Testcase: testRestartCount took 5.711 sec Testcase: testJobTrackerInfoCreation took 2.726 sec Caused an ERROR null java.lang.NullPointerException at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:426) at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:284) at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:124) at org.apache.hadoop.mapred.TestRecoveryManager.testJobTrackerInfoCreation(TestRecoveryManager.java:454)
        Hide
        Tom White added a comment -

        I had a look at the failing tests. A couple of the tests were hanging because they don't wait for the jobs to complete, so the tasktrackers never exit and mini cluster shutdown waits forever. testJobResubmission was failing due to a race where the old TIP gets removed while the recovered TIP is running so the TT thinks it has never completed. I also made the output directories unique since there were occasional clashes between tests despite the test directory being deleted each time.

        Tests pass for me on Mac and Linux.

        Matt/Arun - can you see if the patch works for you please?

        Show
        Tom White added a comment - I had a look at the failing tests. A couple of the tests were hanging because they don't wait for the jobs to complete, so the tasktrackers never exit and mini cluster shutdown waits forever. testJobResubmission was failing due to a race where the old TIP gets removed while the recovered TIP is running so the TT thinks it has never completed. I also made the output directories unique since there were occasional clashes between tests despite the test directory being deleted each time. Tests pass for me on Mac and Linux. Matt/Arun - can you see if the patch works for you please?
        Hide
        Arun C Murthy added a comment -

        I just committed this. Thanks for the quick check Matt!

        Show
        Arun C Murthy added a comment - I just committed this. Thanks for the quick check Matt!
        Hide
        Matt Foley added a comment -

        +1. Please commit to branch-1 and branch-1.1. Thanks, Arun!

        Show
        Matt Foley added a comment - +1. Please commit to branch-1 and branch-1.1. Thanks, Arun!
        Hide
        Arun C Murthy added a comment -

        Fixed testJobTrackerInfoCreation, ignored others after converting them to junit4.

        Show
        Arun C Murthy added a comment - Fixed testJobTrackerInfoCreation, ignored others after converting them to junit4.
        Hide
        Arun C Murthy added a comment -

        Sigh, I give up.

        TestRecoveryManager is hopeless. Mainly in the sense that it uses the confounded UtilsForTests which are broken.

        testJobTrackerRestartsWithMissingJobFile & testJobTrackerRestartWithBadJobs hang on both Linux and MacOSX.
        testJobResubmission works on MacOSX and hangs on Linux similar to the other two.

        I managed to track and fix one bug in testJobTrackerInfoCreation.

        I'll ignore them for 1.1.2 (sad to have a stable release with unit-test failures due to flaky test code) so we can revisit them.

        Show
        Arun C Murthy added a comment - Sigh, I give up. TestRecoveryManager is hopeless. Mainly in the sense that it uses the confounded UtilsForTests which are broken. testJobTrackerRestartsWithMissingJobFile & testJobTrackerRestartWithBadJobs hang on both Linux and MacOSX. testJobResubmission works on MacOSX and hangs on Linux similar to the other two. I managed to track and fix one bug in testJobTrackerInfoCreation. I'll ignore them for 1.1.2 (sad to have a stable release with unit-test failures due to flaky test code) so we can revisit them.

          People

          • Assignee:
            Arun C Murthy
            Reporter:
            Arun C Murthy
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development