[YARN-6136] YARN registry service should avoid scanning whole ZK tree for every container/application finish - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Invalid
Affects Version/s: None
Fix Version/s: None
Component/s: api, resourcemanager
Labels:
None

Description

In existing registry service implementation, purge operation triggered by container finish event:

  public void onContainerFinished(ContainerId id) throws IOException {
    LOG.info("Container {} finished, purging container-level records",
        id);
    purgeRecordsAsync("/",
        id.toString(),
        PersistencePolicies.CONTAINER);
  }

Since this happens on every container finish, so it essentially scans all (or almost) ZK node from the root.

We have a cluster which have hundreds of ZK nodes for service registry, and have 20K+ ZK nodes for other purposes. The existing implementation could generate massive ZK operations and internal Java objects (RegistryPathStatus) as well. The RM becomes very unstable when there're batch container finish events because of full GC pause and ZK connection failure.

Attachments

Issue Links

is caused by

YARN-2571 RM to support YARN registry

Resolved

Activity

People

Assignee:: Wangda Tan

Reporter:: Wangda Tan

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 31/Jan/17 21:53

Updated:: 22/Mar/18 18:20

Resolved:: 22/Mar/18 18:20