Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-25023

ClassLoader leak on JM/TM through indirectly-started Hadoop threads out of user code

    XMLWordPrintableJSON

Details

    Description

      If a Flink job is using HDFS through Flink's filesystem abstraction (either on the JM or TM), that code may actually spawn a few threads, e.g. from static class members:

      • org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner
      • IPC Parameter Sending Thread#*

      These threads are started as soon as the classes are loaded which may be in the context of the user code. In this specific scenario, however, the created threads may contain references to the context class loader (I did not see that though) or, as happened here, it may inherit thread contexts such as the ProtectionDomain (from an AccessController).

      Hence user contexts and user class loaders are leaked into long-running threads that are run in Flink's (parent) classloader.

      Fortunately, it seems to only leak a single ChildFirstClassLoader in this concrete example but that may depend on which code paths each client execution is walking.

       

      A proper solution doesn't seem so simple:

      • We could try to proactively initialize available file systems in the hope to start all threads in the parent classloader with parent context.
      • We could create a default ProtectionDomain for spawned threads as discussed at https://dzone.com/articles/javalangoutofmemory-permgen, however, the StatisticsDataReferenceCleaner isn't actually actively spawned from any callback but as a static variable and this with the class loading itself (but maybe this is still possible somehow).

      Attachments

        Issue Links

          Activity

            People

              dmvk David Morávek
              nkruber Nico Kruber
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: