Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-437

JobTracker must ask for a new FS instance and close it when terminated.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Cannot Reproduce
    • Affects Version/s: 0.20.1, 0.21.0, 0.22.0
    • Fix Version/s: 0.22.1
    • Component/s: jobtracker
    • Labels:
      None

      Description

      This is something I've been experimenting with HADOOP-3268; I'm not sure what the right action is here.

      -currently, the JobTracker does not close() its filesystem when it is shut down. This will cause it to leak filesystem references if JobTrackers are started and stopped in the same process.

      -The TestMRServerPorts test explicitly closes the filesystem
      jt.fs.close();
      jt.stopTracker();

      -If you move the close() operation into the stopTracker()/terminate logic, the filesystem gets cleaned up, but
      TestRackAwareTaskPlacement and TestMultipleLevelCaching fail with a FilesystemClosed error (stack traces to follow)

      Should the JobTracker close its filesystem whenever it is terminated? If so, there are some tests that need to be reworked slightly to not expect the fileystem to be live after the jobtracker is taken down.

        Issue Links

          Activity

          Hide
          Steve Loughran added a comment -

          implicitly fixed post YARN as the JT is dynamically created in its own process

          Show
          Steve Loughran added a comment - implicitly fixed post YARN as the JT is dynamically created in its own process
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12470964/MAPREDUCE-437.patch
          against trunk revision 1075422.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/99//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12470964/MAPREDUCE-437.patch against trunk revision 1075422. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/99//console This message is automatically generated.
          Hide
          Tom White added a comment -

          This looks good, except the patch seems to be reversed. Also, there's a comment in the test saying "repeat the call", which looks incomplete or redundant.

          Show
          Tom White added a comment - This looks good, except the patch seems to be reversed. Also, there's a comment in the test saying "repeat the call", which looks incomplete or redundant.
          Hide
          Steve Loughran added a comment -

          This patch
          1. creates a new private FS instance in the job tracker when needed.
          2. closes and sets this instance to null when the JT is shutdown.
          3. tests both operations.

          Show
          Steve Loughran added a comment - This patch 1. creates a new private FS instance in the job tracker when needed. 2. closes and sets this instance to null when the JT is shutdown. 3. tests both operations.
          Hide
          Steve Loughran added a comment -

          Patch including tests that
          -check the FS instance not shared
          -check the FS is closed and set to null on shutdown.

          Show
          Steve Loughran added a comment - Patch including tests that -check the FS instance not shared -check the FS is closed and set to null on shutdown.
          Hide
          Steve Loughran added a comment -

          changing title, marking versions it affects. Leaving as minor as this will not show up in production if you start the JT in its own VM

          Show
          Steve Loughran added a comment - changing title, marking versions it affects. Leaving as minor as this will not show up in production if you start the JT in its own VM
          Hide
          Steve Loughran added a comment -

          Reviewing the code in trunk, the problem is a bit more serious and relates to what happens when a cached FS instance is closed: everyone who has a reference to that instance cannot use the filesystem.

          this does not normally surface in production as the JT runs in its own VM. It does exist in MiniMR clusters, in testing, but hasn't shown up because nobody other than me has tried to shut down an FS instance while the JT is still live.

          Proposed actions
          1-rename this issue to be more explicit: JT must ask for a new FS instance and close it when terminated.
          2-add a test to verify that a miniMR cluster will fail if you get the same instance and close it
          3-have the JT get a new instance on startup/going live and verify that test 2 now passes
          4-have the JT close its filesystem on shutdown, set its local reference to null
          I can't think of an easy way to test #4 unless there is a method to get the JT filesystem reference

          Show
          Steve Loughran added a comment - Reviewing the code in trunk, the problem is a bit more serious and relates to what happens when a cached FS instance is closed: everyone who has a reference to that instance cannot use the filesystem. this does not normally surface in production as the JT runs in its own VM. It does exist in MiniMR clusters, in testing, but hasn't shown up because nobody other than me has tried to shut down an FS instance while the JT is still live. Proposed actions 1-rename this issue to be more explicit: JT must ask for a new FS instance and close it when terminated. 2-add a test to verify that a miniMR cluster will fail if you get the same instance and close it 3-have the JT get a new instance on startup/going live and verify that test 2 now passes 4-have the JT close its filesystem on shutdown, set its local reference to null I can't think of an easy way to test #4 unless there is a method to get the JT filesystem reference
          Hide
          Steve Loughran added a comment -

          The HDFS-925 patch is what I used to see who else was closing the shared instance

          Show
          Steve Loughran added a comment - The HDFS-925 patch is what I used to see who else was closing the shared instance
          Hide
          Vinod Kumar Vavilapalli added a comment -

          +1 for the solution #2. I think this should be THE general pattern in all the services, but guess fixing that's a superset of this issue...

          Show
          Vinod Kumar Vavilapalli added a comment - +1 for the solution #2. I think this should be THE general pattern in all the services, but guess fixing that's a superset of this issue...
          Hide
          steve_l added a comment -

          the problem with the tests failing if the JT is set to close its filesystem when shut down is triggered by the FS caching

          Two solutions

          1. loading these tests using the HADOOP-6231 trick with a config with fs.hdfs.impl.disable.cache=true. Fixes the tests, leaves the problem lurking around
          2. have the JT connect to the filesystem using FileSystem.newInstance() instead of FileSystem.get()

          I would favour #2 as I can't see any reason why you'd want to share the filesystem reference for the JT with anything else running in the same VM, including JUnit tests, as it only changes system behavior. With a move over to newInstance() the JT can close the filesystem on termination without any concerns about adverse consequences.

          Show
          steve_l added a comment - the problem with the tests failing if the JT is set to close its filesystem when shut down is triggered by the FS caching Two solutions loading these tests using the HADOOP-6231 trick with a config with fs.hdfs.impl.disable.cache=true . Fixes the tests, leaves the problem lurking around have the JT connect to the filesystem using FileSystem.newInstance() instead of FileSystem.get() I would favour #2 as I can't see any reason why you'd want to share the filesystem reference for the JT with anything else running in the same VM, including JUnit tests, as it only changes system behavior. With a move over to newInstance() the JT can close the filesystem on termination without any concerns about adverse consequences.
          Hide
          steve_l added a comment -

          both these failures are triggered by the same event; inside launchJobAndTestCounters the jobtracker gets terminated; if this is set to shut down the filesystem client then the RPC proxy gets closed,

          public synchronized void close() throws IOException

          { checkOpen(); clientRunning = false; leasechecker.close(); // close connections to the namenode RPC.stopProxy(rpcNamenode); }

          and there, apparently goes filesystem access to that namenode, across the entire JVM. Which seems a bit of overkill.

          Show
          steve_l added a comment - both these failures are triggered by the same event; inside launchJobAndTestCounters the jobtracker gets terminated; if this is set to shut down the filesystem client then the RPC proxy gets closed, public synchronized void close() throws IOException { checkOpen(); clientRunning = false; leasechecker.close(); // close connections to the namenode RPC.stopProxy(rpcNamenode); } and there, apparently goes filesystem access to that namenode, across the entire JVM. Which seems a bit of overkill.
          Hide
          steve_l added a comment -

          stack traces of tests that fail once the JobTracker closes its filesystem when terminated

          TestMultipleLevelCaching testMultiLevelCaching Error Filesystem closed

          java.io.IOException: Filesystem closed
          at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:197)
          at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:537)
          at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:201)
          at org.apache.hadoop.mapred.TestMultipleLevelCaching.testCachingAtLevel(TestMultipleLevelCaching.java:116)
          at org.apache.hadoop.mapred.TestMultipleLevelCaching.testMultiLevelCaching(TestMultipleLevelCaching.java:69)

          TestRackAwareTaskPlacement testTaskPlacement Error Filesystem closed

          java.io.IOException: Filesystem closed
          at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:197)
          at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:574)
          at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:400)
          at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:651)
          at org.apache.hadoop.mapred.TestRackAwareTaskPlacement.launchJobAndTestCounters(TestRackAwareTaskPlacement.java:78)
          at org.apache.hadoop.mapred.TestRackAwareTaskPlacement.testTaskPlacement(TestRackAwareTaskPlacement.java:156)

          Show
          steve_l added a comment - stack traces of tests that fail once the JobTracker closes its filesystem when terminated TestMultipleLevelCaching testMultiLevelCaching Error Filesystem closed java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:197) at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:537) at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:201) at org.apache.hadoop.mapred.TestMultipleLevelCaching.testCachingAtLevel(TestMultipleLevelCaching.java:116) at org.apache.hadoop.mapred.TestMultipleLevelCaching.testMultiLevelCaching(TestMultipleLevelCaching.java:69) TestRackAwareTaskPlacement testTaskPlacement Error Filesystem closed java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:197) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:574) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:400) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:651) at org.apache.hadoop.mapred.TestRackAwareTaskPlacement.launchJobAndTestCounters(TestRackAwareTaskPlacement.java:78) at org.apache.hadoop.mapred.TestRackAwareTaskPlacement.testTaskPlacement(TestRackAwareTaskPlacement.java:156)

            People

            • Assignee:
              Steve Loughran
              Reporter:
              Steve Loughran
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1.5h Original Estimate - 1.5h
                1.5h
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h
                2h

                  Development