Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8579

New AM attempt could not retrieve previous attempt component data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.1.1
    • 3.2.0, 3.1.2
    • None
    • None

    Description

      Steps:
      1) Launch httpd-docker
      2) Wait for app to be in STABLE state
      3) Run validation for app (It takes around 3 mins)
      4) Stop all Zks 
      5) Wait 60 sec
      6) Kill AM
      7) wait for 30 sec
      8) Start all ZKs
      9) Wait for application to finish
      10) Validate expected containers of the app

      Expected behavior:
      New attempt of AM should start and docker containers launched by 1st attempt should be recovered by new attempt.

      Actual behavior:
      New AM attempt starts. It can not recover 1st attempt docker containers. It can not read component details from ZK. 
      Thus, it starts new attempt for all containers.

      2018-07-19 22:42:47,595 [main] INFO  service.ServiceScheduler - Registering appattempt_1531977563978_0015_000002, fault-test-zkrm-httpd-docker into registry
      2018-07-19 22:42:47,611 [main] INFO  service.ServiceScheduler - Received 1 containers from previous attempt.
      2018-07-19 22:42:47,642 [main] INFO  service.ServiceScheduler - Could not read component paths: `/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components': No such file or directory: KeeperErrorCode = NoNode for /registry/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components
      2018-07-19 22:42:47,643 [main] INFO  service.ServiceScheduler - Handling container_e08_1531977563978_0015_01_000003 from previous attempt
      2018-07-19 22:42:47,643 [main] INFO  service.ServiceScheduler - Record not found in registry for container container_e08_1531977563978_0015_01_000003 from previous attempt, releasing
      2018-07-19 22:42:47,649 [AMRM Callback Handler Thread] INFO  impl.TimelineV2ClientImpl - Updated timeline service address to xxx:33019
      2018-07-19 22:42:47,651 [main] INFO  service.ServiceScheduler - Triggering initial evaluation of component httpd
      2018-07-19 22:42:47,652 [main] INFO  component.Component - [INIT COMPONENT httpd]: 2 instances.
      2018-07-19 22:42:47,652 [main] INFO  component.Component - [COMPONENT httpd] Requesting for 2 container(s)

      Attachments

        1. YARN-8579.001.patch
          4 kB
          Gour Saha
        2. YARN-8579.002.patch
          5 kB
          Gour Saha
        3. YARN-8579.003.patch
          5 kB
          Gour Saha
        4. YARN-8579.004.patch
          7 kB
          Gour Saha

        Issue Links

          Activity

            People

              gsaha Gour Saha
              yeshavora Yesha Vora
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: