Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5438

TimelineClientImpl leaking FileSystem Instances causing Long running services like HiverServer2 daemon going OOM

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.8.0, 2.7.3
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: timelineserver
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      TimelineClientImpl leaking FileSystem Instances causing Long running services like HiverServer2 daemon going OOM

      In org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl, FileSystem.newInstance is invoked and is not closed. Causing over time Filesystem instances getting accumulated in long runninh Client (like Hiveserver2), finally causing them to OOM

      1. YARN-5438.0.patch
        0.9 kB
        Rohith Sharma K S

        Activity

        Hide
        rohithsharma Rohith Sharma K S added a comment -

        Updated patch for closing the FileSystem while stopping TimelineClient

        Show
        rohithsharma Rohith Sharma K S added a comment - Updated patch for closing the FileSystem while stopping TimelineClient
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 15s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 6m 39s trunk passed
        +1 compile 0m 26s trunk passed
        +1 checkstyle 0m 18s trunk passed
        +1 mvnsite 0m 30s trunk passed
        +1 mvneclipse 0m 13s trunk passed
        +1 findbugs 0m 55s trunk passed
        +1 javadoc 0m 28s trunk passed
        +1 mvninstall 0m 28s the patch passed
        +1 compile 0m 23s the patch passed
        +1 javac 0m 23s the patch passed
        +1 checkstyle 0m 15s the patch passed
        +1 mvnsite 0m 26s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 0s the patch passed
        +1 javadoc 0m 26s the patch passed
        +1 unit 2m 16s hadoop-yarn-common in the patch passed.
        +1 asflicense 0m 15s The patch does not generate ASF License warnings.
        15m 59s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12820503/YARN-5438.0.patch
        JIRA Issue YARN-5438
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 872a7339fda9 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 54fe17a
        Default Java 1.8.0_101
        findbugs v3.0.0
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12523/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/12523/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 39s trunk passed +1 compile 0m 26s trunk passed +1 checkstyle 0m 18s trunk passed +1 mvnsite 0m 30s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 55s trunk passed +1 javadoc 0m 28s trunk passed +1 mvninstall 0m 28s the patch passed +1 compile 0m 23s the patch passed +1 javac 0m 23s the patch passed +1 checkstyle 0m 15s the patch passed +1 mvnsite 0m 26s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 0s the patch passed +1 javadoc 0m 26s the patch passed +1 unit 2m 16s hadoop-yarn-common in the patch passed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 15m 59s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12820503/YARN-5438.0.patch JIRA Issue YARN-5438 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 872a7339fda9 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 54fe17a Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12523/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common Console output https://builds.apache.org/job/PreCommit-YARN-Build/12523/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        jlowe Jason Lowe added a comment -

        Thanks for the patch, Rohith! This probably works for the HiveServer2 case iff the server never tries to use the filesystem after the timeline client is closed. However the timeline client is not just used by HS2, and I think this patch will be problematic for any code that could still use the filesystem after the timeline client is closed. Since the filesystem cache will implicitly link what looks like two separate creations of a filesystem to a single instance, closing one will break any subsequent use of the other.

        This makes me think HS2 is missing a closeAllforUGI call in it somewhere to make sure when it's done for a certain user it cleans up all the filesystems associated with that user. It also makes me wonder why we haven't implemented a reference-counting cache for the filesystem by now.

        Show
        jlowe Jason Lowe added a comment - Thanks for the patch, Rohith! This probably works for the HiveServer2 case iff the server never tries to use the filesystem after the timeline client is closed. However the timeline client is not just used by HS2, and I think this patch will be problematic for any code that could still use the filesystem after the timeline client is closed. Since the filesystem cache will implicitly link what looks like two separate creations of a filesystem to a single instance, closing one will break any subsequent use of the other. This makes me think HS2 is missing a closeAllforUGI call in it somewhere to make sure when it's done for a certain user it cleans up all the filesystems associated with that user. It also makes me wonder why we haven't implemented a reference-counting cache for the filesystem by now.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        Since the filesystem cache will implicitly link what looks like two separate creations of a filesystem to a single instance, closing one will break any subsequent use of the other.

        If the user creates file system object using api FileSystem#newInstance with in the JVM then always new FS object is given. For every newInstance api call, object created using the combination of URI, Conf and UniqueKey. If FS object is created using FS#get then this api search from cache. This API always creates object with combination of URI and CONF only. So mainly it matters how the FS object is being created.
        Basically closing one instance which is created using FileSystem#newInstance should not affect other FS object which is created using FS#get. And also note that if two FS objects are created using FS#get then closing one will definitely affect other FS object.

        Show
        rohithsharma Rohith Sharma K S added a comment - Since the filesystem cache will implicitly link what looks like two separate creations of a filesystem to a single instance, closing one will break any subsequent use of the other. If the user creates file system object using api FileSystem#newInstance with in the JVM then always new FS object is given. For every newInstance api call, object created using the combination of URI, Conf and UniqueKey . If FS object is created using FS#get then this api search from cache. This API always creates object with combination of URI and CONF only. So mainly it matters how the FS object is being created. Basically closing one instance which is created using FileSystem#newInstance should not affect other FS object which is created using FS#get . And also note that if two FS objects are created using FS#get then closing one will definitely affect other FS object.
        Hide
        jlowe Jason Lowe added a comment -

        Ah, thanks Rohith. My bad, I missed that it was creating the filesystem in a way that essentially avoids the cache.

        +1 lgtm. Will commit this tomorrow if there are no objections.

        Show
        jlowe Jason Lowe added a comment - Ah, thanks Rohith. My bad, I missed that it was creating the filesystem in a way that essentially avoids the cache. +1 lgtm. Will commit this tomorrow if there are no objections.
        Hide
        jlowe Jason Lowe added a comment -

        Thanks, Rohith! I committed this to trunk, branch-2, and branch-2.8.

        Show
        jlowe Jason Lowe added a comment - Thanks, Rohith! I committed this to trunk, branch-2, and branch-2.8.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #10172 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10172/)
        YARN-5438. TimelineClientImpl leaking FileSystem Instances causing Long (jlowe: rev a1890c32c52fed69ec09efad0fccf49ed8c2e21e)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/FileSystemTimelineWriter.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10172 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10172/ ) YARN-5438 . TimelineClientImpl leaking FileSystem Instances causing Long (jlowe: rev a1890c32c52fed69ec09efad0fccf49ed8c2e21e) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/FileSystemTimelineWriter.java

          People

          • Assignee:
            rohithsharma Rohith Sharma K S
            Reporter:
            karams Karam Singh
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development