Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8234

Improve RM system metrics publisher's performance by pushing events to timeline server in batch

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Hide
      When Timeline Service V1 or V1.5 is used, if "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.enable-batch" is set to true, ResourceManager sends timeline events in batch. The default value is false. If this functionality is enabled, the maximum number that events published in batch is configured by "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.batch-size". The default value is 1000. The interval of publishing events can be configured by "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.interval-seconds". By default, it is set to 60 seconds.
      Show
      When Timeline Service V1 or V1.5 is used, if "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.enable-batch" is set to true, ResourceManager sends timeline events in batch. The default value is false. If this functionality is enabled, the maximum number that events published in batch is configured by "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.batch-size". The default value is 1000. The interval of publishing events can be configured by "yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.interval-seconds". By default, it is set to 60 seconds.

    Description

      When system metrics publisher is enabled, RM will push events to timeline server via restful api. If the cluster load is heavy, many events are sent to timeline server and the timeline server's event handler thread locked. YARN-7266 talked about the detail of this problem. Because of the lock, timeline server can't receive event as fast as it generated in RM and lots of timeline event stays in RM's memory. Finally, those events will consume all RM's memory and RM will start a full gc (which cause an JVM stop-world and cause a timeout from rm to zookeeper) or even get an OOM. 

      The main problem here is that timeline can't receive timeline server's event as fast as it generated. Now, RM system metrics publisher put only one event in a request, and most time costs on handling http header or some thing about the net connection on timeline side. Only few time is spent on dealing with the timeline event which is truly valuable.

      In this issue, we add a buffer in system metrics publisher and let publisher send events to timeline server in batch via one request. When sets the batch size to 1000, in out experiment the speed of the timeline server receives events has 100x improvement. We have implement this function int our product environment which accepts 20000 app's in one hour and it works fine.

      Attachments

        1. YARN-8234.001.patch
          12 kB
          Hu Ziqian
        2. YARN-8234.002.patch
          14 kB
          Hu Ziqian
        3. YARN-8234.003.patch
          15 kB
          Hu Ziqian
        4. YARN-8234.004.patch
          14 kB
          Hu Ziqian
        5. YARN-8234-branch-2.8.3.001.patch
          9 kB
          Hu Ziqian
        6. YARN-8234-branch-2.8.3.002.patch
          12 kB
          Hu Ziqian
        7. YARN-8234-branch-2.8.3.003.patch
          14 kB
          Hu Ziqian
        8. YARN-8234-branch-2.8.3.004.patch
          13 kB
          Hu Ziqian

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            groot Ashutosh Gupta
            ziqian hu Hu Ziqian
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 7.5h
                7.5h

                Slack

                  Issue deployment