Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5438

TimelineClientImpl leaking FileSystem Instances causing Long running services like HiverServer2 daemon going OOM

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.8.0, 2.7.3
    • 2.8.0, 3.0.0-alpha1
    • timelineserver
    • None
    • Reviewed

    Description

      TimelineClientImpl leaking FileSystem Instances causing Long running services like HiverServer2 daemon going OOM

      In org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl, FileSystem.newInstance is invoked and is not closed. Causing over time Filesystem instances getting accumulated in long runninh Client (like Hiveserver2), finally causing them to OOM

      Attachments

        1. YARN-5438.0.patch
          0.9 kB
          Rohith Sharma K S

        Activity

          Updated patch for closing the FileSystem while stopping TimelineClient

          rohithsharma Rohith Sharma K S added a comment - Updated patch for closing the FileSystem while stopping TimelineClient
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 15s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 6m 39s trunk passed
          +1 compile 0m 26s trunk passed
          +1 checkstyle 0m 18s trunk passed
          +1 mvnsite 0m 30s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 0m 55s trunk passed
          +1 javadoc 0m 28s trunk passed
          +1 mvninstall 0m 28s the patch passed
          +1 compile 0m 23s the patch passed
          +1 javac 0m 23s the patch passed
          +1 checkstyle 0m 15s the patch passed
          +1 mvnsite 0m 26s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 0s the patch passed
          +1 javadoc 0m 26s the patch passed
          +1 unit 2m 16s hadoop-yarn-common in the patch passed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          15m 59s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12820503/YARN-5438.0.patch
          JIRA Issue YARN-5438
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 872a7339fda9 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 54fe17a
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12523/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12523/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 39s trunk passed +1 compile 0m 26s trunk passed +1 checkstyle 0m 18s trunk passed +1 mvnsite 0m 30s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 55s trunk passed +1 javadoc 0m 28s trunk passed +1 mvninstall 0m 28s the patch passed +1 compile 0m 23s the patch passed +1 javac 0m 23s the patch passed +1 checkstyle 0m 15s the patch passed +1 mvnsite 0m 26s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 0s the patch passed +1 javadoc 0m 26s the patch passed +1 unit 2m 16s hadoop-yarn-common in the patch passed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 15m 59s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12820503/YARN-5438.0.patch JIRA Issue YARN-5438 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 872a7339fda9 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 54fe17a Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12523/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common Console output https://builds.apache.org/job/PreCommit-YARN-Build/12523/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.

          Thanks for the patch, Rohith! This probably works for the HiveServer2 case iff the server never tries to use the filesystem after the timeline client is closed. However the timeline client is not just used by HS2, and I think this patch will be problematic for any code that could still use the filesystem after the timeline client is closed. Since the filesystem cache will implicitly link what looks like two separate creations of a filesystem to a single instance, closing one will break any subsequent use of the other.

          This makes me think HS2 is missing a closeAllforUGI call in it somewhere to make sure when it's done for a certain user it cleans up all the filesystems associated with that user. It also makes me wonder why we haven't implemented a reference-counting cache for the filesystem by now.

          jlowe Jason Darrell Lowe added a comment - Thanks for the patch, Rohith! This probably works for the HiveServer2 case iff the server never tries to use the filesystem after the timeline client is closed. However the timeline client is not just used by HS2, and I think this patch will be problematic for any code that could still use the filesystem after the timeline client is closed. Since the filesystem cache will implicitly link what looks like two separate creations of a filesystem to a single instance, closing one will break any subsequent use of the other. This makes me think HS2 is missing a closeAllforUGI call in it somewhere to make sure when it's done for a certain user it cleans up all the filesystems associated with that user. It also makes me wonder why we haven't implemented a reference-counting cache for the filesystem by now.

          Since the filesystem cache will implicitly link what looks like two separate creations of a filesystem to a single instance, closing one will break any subsequent use of the other.

          If the user creates file system object using api FileSystem#newInstance with in the JVM then always new FS object is given. For every newInstance api call, object created using the combination of URI, Conf and UniqueKey. If FS object is created using FS#get then this api search from cache. This API always creates object with combination of URI and CONF only. So mainly it matters how the FS object is being created.
          Basically closing one instance which is created using FileSystem#newInstance should not affect other FS object which is created using FS#get. And also note that if two FS objects are created using FS#get then closing one will definitely affect other FS object.

          rohithsharma Rohith Sharma K S added a comment - Since the filesystem cache will implicitly link what looks like two separate creations of a filesystem to a single instance, closing one will break any subsequent use of the other. If the user creates file system object using api FileSystem#newInstance with in the JVM then always new FS object is given. For every newInstance api call, object created using the combination of URI, Conf and UniqueKey . If FS object is created using FS#get then this api search from cache. This API always creates object with combination of URI and CONF only. So mainly it matters how the FS object is being created. Basically closing one instance which is created using FileSystem#newInstance should not affect other FS object which is created using FS#get . And also note that if two FS objects are created using FS#get then closing one will definitely affect other FS object.

          Ah, thanks Rohith. My bad, I missed that it was creating the filesystem in a way that essentially avoids the cache.

          +1 lgtm. Will commit this tomorrow if there are no objections.

          jlowe Jason Darrell Lowe added a comment - Ah, thanks Rohith. My bad, I missed that it was creating the filesystem in a way that essentially avoids the cache. +1 lgtm. Will commit this tomorrow if there are no objections.

          Thanks, Rohith! I committed this to trunk, branch-2, and branch-2.8.

          jlowe Jason Darrell Lowe added a comment - Thanks, Rohith! I committed this to trunk, branch-2, and branch-2.8.
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #10172 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10172/)
          YARN-5438. TimelineClientImpl leaking FileSystem Instances causing Long (jlowe: rev a1890c32c52fed69ec09efad0fccf49ed8c2e21e)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/FileSystemTimelineWriter.java
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10172 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10172/ ) YARN-5438 . TimelineClientImpl leaking FileSystem Instances causing Long (jlowe: rev a1890c32c52fed69ec09efad0fccf49ed8c2e21e) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/FileSystemTimelineWriter.java

          People

            rohithsharma Rohith Sharma K S
            karams Karam Singh
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: