Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1878

TestHDFSServerPorts unit test failure - race condition in FSNamesystem.close() causes NullPointerException without serious consequence

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.20.204.0
    • Fix Version/s: 0.20.204.0
    • Component/s: namenode
    • Labels:
      None

      Description

      In 20.204, TestHDFSServerPorts was observed to intermittently throw a NullPointerException. This only happens when FSNamesystem.close() is called, which means system termination for the Namenode, so this is not a serious bug for .204. TestHDFSServerPorts is more likely than normal execution to stimulate the race, because it runs two Namenodes in the same JVM, causing more interleaving and more potential to see a race condition.

      The race is in FSNamesystem.close(), line 566, we have:
      if (replthread != null) replthread.interrupt();
      if (replmon != null) replmon = null;

      Since the interrupted replthread is not waited on, there is a potential race condition with replmon being nulled before replthread is dead, but replthread references replmon in computeDatanodeWork() where the NullPointerException occurs.

      The solution is either to wait on replthread or just don't null replmon. The latter is preferred, since none of the sibling Namenode processing threads are waited on in close().

      I'll attach a patch for .205.

      1. 1878-1.patch
        0.7 kB
        Matt Foley

        Activity

        Matt Foley created issue -
        Matt Foley made changes -
        Field Original Value New Value
        Description TestHDFSServerPorts was observed to intermittently throw a NullPointerException. This only happens when FSNamesystem.close() is called, which means system termination for the Namenode, so this is not a serious bug for .204. TestHDFSServerPorts is more likely than normal execution to stimulate the race, because it runs two Namenodes in the same JVM, causing more interleaving and more potential to see a race condition.

        The race is in FSNamesystem.close(), line 566, we have:
              if (replthread != null) replthread.interrupt();
              if (replmon != null) replmon = null;

        Since the interrupted replthread is not waited on, there is a potential race condition with replmon being nulled before replthread is dead, but replthread references replmon in computeDatanodeWork() where the NullPointerException occurs.

        The solution is either to wait on replthread or just don't null replmon. The latter is preferred, since none of the sibling Namenode processing threads are waited on in close().

        I'll attach a patch for .205.
        In 20.204, TestHDFSServerPorts was observed to intermittently throw a NullPointerException. This only happens when FSNamesystem.close() is called, which means system termination for the Namenode, so this is not a serious bug for .204. TestHDFSServerPorts is more likely than normal execution to stimulate the race, because it runs two Namenodes in the same JVM, causing more interleaving and more potential to see a race condition.

        The race is in FSNamesystem.close(), line 566, we have:
              if (replthread != null) replthread.interrupt();
              if (replmon != null) replmon = null;

        Since the interrupted replthread is not waited on, there is a potential race condition with replmon being nulled before replthread is dead, but replthread references replmon in computeDatanodeWork() where the NullPointerException occurs.

        The solution is either to wait on replthread or just don't null replmon. The latter is preferred, since none of the sibling Namenode processing threads are waited on in close().

        I'll attach a patch for .205.
        Hide
        Matt Foley added a comment -

        Not turning on "Submit Patch" since patch is for 20.205.

        Show
        Matt Foley added a comment - Not turning on "Submit Patch" since patch is for 20.205.
        Matt Foley made changes -
        Attachment 1878-1.patch [ 12478020 ]
        Matt Foley made changes -
        Summary race condition in FSNamesystem.close() causes NullPointerException without serious consequence - TestHDFSServerPorts unit test failure TestHDFSServerPorts unit test failure - race condition in FSNamesystem.close() causes NullPointerException without serious consequence
        Hide
        Tsz Wo Nicholas Sze added a comment -

        +1 patch looks good.

        Show
        Tsz Wo Nicholas Sze added a comment - +1 patch looks good.
        Hide
        Eli Collins added a comment -

        Does this affect trunk?

        Show
        Eli Collins added a comment - Does this affect trunk?
        Hide
        Matt Foley added a comment -

        No, doesn't affect trunk. I didn't port the QueueProcessingStatistics stuff the bug relates to, to v22, because after some experience with it I concluded it was better to just use simple logs.

        Show
        Matt Foley added a comment - No, doesn't affect trunk. I didn't port the QueueProcessingStatistics stuff the bug relates to, to v22, because after some experience with it I concluded it was better to just use simple logs.
        Hide
        Matt Foley added a comment -

        Committed to 0.20-security and 0.20-security-205.

        Show
        Matt Foley added a comment - Committed to 0.20-security and 0.20-security-205.
        Matt Foley made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Owen O'Malley made changes -
        Fix Version/s 0.20.204.0 [ 12316319 ]
        Owen O'Malley made changes -
        Fix Version/s 0.20.205.0 [ 12316392 ]
        Hide
        Owen O'Malley added a comment -

        Hadoop 0.20.204.0 was released.

        Show
        Owen O'Malley added a comment - Hadoop 0.20.204.0 was released.
        Owen O'Malley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Matt Foley
            Reporter:
            Matt Foley
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development