Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6436

JobHistory cache issue

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • None
    • 2.8.0, 2.7.3, 2.6.4, 3.0.0-alpha1
    • None
    • None

    Description

      Problem:
      HistoryFileManager.addIfAbsent produces large amount of logs if number of
      cached entries whose age is less than mapreduce.jobhistory.max-age-ms becomes
      larger than mapreduce.jobhistory.joblist.cache.size by far.

      Example:
      For example, if the cache contains 50000 entries in total and 10,000 entries
      newer than mapreduce.jobhistory.max-age-ms where
      mapreduce.jobhistory.joblist.cache.size is 20000, HistoryFileManager.addIfAbsent
      method produces 50000 - 20000 = 30000 lines of "Waiting to remove <key> from
      JobListCache because it is not in done yet" message.

      It will attach a stacktrace.

      Impact:
      In addition to large disk consumption, this issue blocks JobHistory.getJob
      long time and slows job execution down significantly because getJob is called
      by RPC such as HistoryClientService.HSClientProtocolHandler.getJobReport.
      This impact happens because HistoryFileManager.UserLogDir.scanIfNeeded
      eventually calls HistoryFileManager.addIfAbsent in a synchronized block. When
      multiple threads call scanIfNeeded simultaneously, one of them acquires lock
      and the other threads are blocked until the first thread completes long-running
      HistoryFileManager.addIfAbsent call.

      Solution:

      • Reduce amount of logs so that HistoryFileManager.addIfAbsent doesn't take too long time.
      • Good to have if possible: HistoryFileManager.UserLogDir.scanIfNeeded skips
        scanning if another thread is already scanning. This changes semantics of
        some HistoryFileManager methods (such as getAllFileInfo and getFileInfo)
        because scanIfNeeded keep outdated state.
      • Good to have if possible: Make scanIfNeeded asynchronous so that RPC calls are
        not blocked by a loop at scale of tens of thousands.

      This patch implemented the first item.

      Attachments

        1. MAPREDUCE-6436.1.patch
          4 kB
          Ryu Kobayashi
        2. MAPREDUCE-6436.2.patch
          3 kB
          Kai
        3. MAPREDUCE-6436.3.patch
          3 kB
          Kai
        4. MAPREDUCE-6436.4.patch
          3 kB
          Kai
        5. stacktrace1.txt
          59 kB
          Ryu Kobayashi
        6. stacktrace2.txt
          58 kB
          Ryu Kobayashi
        7. stacktrace3.txt
          60 kB
          Ryu Kobayashi

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            lewuathe Kai
            ryu_kobayashi Ryu Kobayashi
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment