Details

    • Type: Improvement
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 2.7.3, 2.6.4, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None

      Description

      Problem:
      HistoryFileManager.addIfAbsent produces large amount of logs if number of
      cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes
      larger than mapreduce.jobhistory.joblist.cache.size by far.

      Example:
      For example, if the cache contains 50000 entries in total and 10,000 entries
      newer than mapreduce.jobhistory.max-age-ms where
      mapreduce.jobhistory.joblist.cache.size is 20000, HistoryFileManager.addIfAbsent
      method produces 50000 - 20000 = 30000 lines of "Waiting to remove <key> from
      JobListCache because it is not in done yet" message.

      It will attach a stacktrace.

      Impact:
      In addition to large disk consumption, this issue blocks JobHistory.getJob
      long time and slows job execution down significantly because getJob is called
      by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport.
      This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded
      eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When
      multiple threads call scanIfNeeded simultaneously, one of them acquires lock
      and the other threads are blocked until the first thread completes long-running
      HistoryFileManager.addIfAbsent call.

      Solution:

      • Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take too long time.
      • Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips
        scanning if another thread is already scanning. This changes semantics of
        some HistoryFileManager methods (such as getAllFileInfo and getFileInfo)
        because scanIfNeeded keep outdated state.
      • Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls are
        not blocked by a loop at scale of tens of thousands.

      This patch implemented the first item.

        Attachments

        1. MAPREDUCE-6436.4.patch
          3 kB
          Kai Sasaki
        2. MAPREDUCE-6436.3.patch
          3 kB
          Kai Sasaki
        3. MAPREDUCE-6436.2.patch
          3 kB
          Kai Sasaki
        4. MAPREDUCE-6436.1.patch
          4 kB
          Ryu Kobayashi
        5. stacktrace3.txt
          60 kB
          Ryu Kobayashi
        6. stacktrace2.txt
          58 kB
          Ryu Kobayashi
        7. stacktrace1.txt
          59 kB
          Ryu Kobayashi

          Issue Links

            Activity

              People

              • Assignee:
                lewuathe Kai Sasaki
                Reporter:
                ryu_kobayashi Ryu Kobayashi
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: