Details

    • Type: Improvement
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 2.7.3, 2.6.4, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None

      Description

      Problem:
      HistoryFileManager.addIfAbsent produces large amount of logs if number of
      cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes
      larger than mapreduce.jobhistory.joblist.cache.size by far.

      Example:
      For example, if the cache contains 50000 entries in total and 10,000 entries
      newer than mapreduce.jobhistory.max-age-ms where
      mapreduce.jobhistory.joblist.cache.size is 20000, HistoryFileManager.addIfAbsent
      method produces 50000 - 20000 = 30000 lines of "Waiting to remove <key> from
      JobListCache because it is not in done yet" message.

      It will attach a stacktrace.

      Impact:
      In addition to large disk consumption, this issue blocks JobHistory.getJob
      long time and slows job execution down significantly because getJob is called
      by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport.
      This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded
      eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When
      multiple threads call scanIfNeeded simultaneously, one of them acquires lock
      and the other threads are blocked until the first thread completes long-running
      HistoryFileManager.addIfAbsent call.

      Solution:

      • Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take too long time.
      • Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips
        scanning if another thread is already scanning. This changes semantics of
        some HistoryFileManager methods (such as getAllFileInfo and getFileInfo)
        because scanIfNeeded keep outdated state.
      • Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls are
        not blocked by a loop at scale of tens of thousands.

      This patch implemented the first item.

        Attachments

        1. stacktrace1.txt
          59 kB
          Ryu Kobayashi
        2. stacktrace2.txt
          58 kB
          Ryu Kobayashi
        3. stacktrace3.txt
          60 kB
          Ryu Kobayashi
        4. MAPREDUCE-6436.1.patch
          4 kB
          Ryu Kobayashi
        5. MAPREDUCE-6436.2.patch
          3 kB
          Kai Sasaki
        6. MAPREDUCE-6436.3.patch
          3 kB
          Kai Sasaki
        7. MAPREDUCE-6436.4.patch
          3 kB
          Kai Sasaki

          Issue Links

            Activity

              People

              • Assignee:
                lewuathe Kai Sasaki
                Reporter:
                ryu_kobayashi Ryu Kobayashi
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: