Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24569

LLAP daemon leaks file descriptors/log4j appenders

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      With HIVE-9756 query logs in LLAP are directed to different files (file per query) using a Log4j2 routing appender. Without a purge policy in place, appenders are created dynamically by the routing appender, one for each query, and remain in memory forever. The dynamic appenders write to files so each appender holds to a file descriptor.

      Further work HIVE-14224 has mitigated the issue by introducing a custom purging policy (LlapRoutingAppenderPurgePolicy) which deletes the dynamic appenders (and closes the respective files) when the query is completed (org.apache.hadoop.hive.llap.daemon.impl.QueryTracker#handleLogOnQueryCompletion).

      However, in the presence of multiple threads appending to the logs there are race conditions. In an internal Hive cluster the number of file descriptors started going up approx one descriptor leaking per query. After some debugging it turns out that one thread (running the QueryTracker#handleLogOnQueryCompletion) signals that the query has finished and thus the purge policy should get rid of the respective appender (and close the file) while another (Task-Executor-0) attempts to append another log message for the same query. The initial appender is closed after the request from the query tracker but a new one is created to accomodate the message from the task executor and the latter is never removed thus creating a leak.

      Similar leaks have been identified and fixed for HS2 with the most similar one being that described here.

      The problem relies on the timing of threads so it may not manifestate in all versions between 2.2.0 and 4.0.0. Usually the leak can be seen either via lsof (or other similar command) with the following output:

      # 1494391 is the PID of the LLAP daemon process
      ls -ltr /proc/1494391/fd
      ...
      lrwx------ 1 hive hadoop 64 Dec 24 12:08 978 -> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121724_66ce273d-54a9-4dcd-a9fb-20cb5691cef7-dag_1608659125567_0008_194.log
      lrwx------ 1 hive hadoop 64 Dec 24 12:08 977 -> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121804_ce53eeb5-c73f-4999-b7a4-b4dd04d4e4de-dag_1608659125567_0008_197.log
      lrwx------ 1 hive hadoop 64 Dec 24 12:08 974 -> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224122002_1693bd7d-2f0e-4673-a8d1-b7cb14a02204-dag_1608659125567_0008_204.log
      lrwx------ 1 hive hadoop 64 Dec 24 12:08 989 -> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121909_6a56218f-06c7-4906-9907-4b6dd824b100-dag_1608659125567_0008_201.log
      lrwx------ 1 hive hadoop 64 Dec 24 12:08 984 -> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121754_78ef49a0-bc23-478f-9a16-87fa25e7a287-dag_1608659125567_0008_196.log
      lrwx------ 1 hive hadoop 64 Dec 24 12:08 983 -> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121855_e65b9ebf-b2ec-4159-9570-1904442b7048-dag_1608659125567_0008_200.log
      lrwx------ 1 hive hadoop 64 Dec 24 12:08 981 -> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121818_e9051ae3-1316-46af-aabb-22c53ed2fda7-dag_1608659125567_0008_198.log
      lrwx------ 1 hive hadoop 64 Dec 24 12:08 980 -> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121744_fcf37921-4351-4368-95ee-b5be2592d89a-dag_1608659125567_0008_195.log
      lrwx------ 1 hive hadoop 64 Dec 24 12:08 979 -> /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121837_e80c0024-f6bc-4b3c-85ed-5c0c85c55787-dag_1608659125567_0008_199.log
      

      or in the heap dump with many appenders (in my case LlapWrappedAppender) holding indirectly open file descriptors:

      Attachments

        1. llap-appender-gc-roots.png
          102 kB
          Stamatis Zampetakis

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zabetak Stamatis Zampetakis Assign to me
            zabetak Stamatis Zampetakis
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 1h 40m
              1h 40m

              Slack

                Issue deployment