Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-128 [Umbrella] RM Restart Phase 1: State storage and non-work-preserving recovery
  3. YARN-1210

During RM restart, RM should start a new attempt only when previous attempt exits for real



    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 2.3.0
    • None
    • None
    • Reviewed


      When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart.

      In the mean while, new apps will proceed as usual as existing apps wait for recovery.

      This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt.


        1. YARN-1210.7.patch
          62 kB
          Omkar Vinit Joshi
        2. YARN-1210.6.patch
          60 kB
          Omkar Vinit Joshi
        3. YARN-1210.5.patch
          44 kB
          Omkar Vinit Joshi
        4. YARN-1210.4.patch
          54 kB
          Omkar Vinit Joshi
        5. YARN-1210.4.patch
          54 kB
          Omkar Vinit Joshi
        6. YARN-1210.3.patch
          39 kB
          Omkar Vinit Joshi
        7. YARN-1210.2.patch
          46 kB
          Omkar Vinit Joshi
        8. YARN-1210.1.patch
          37 kB
          Omkar Vinit Joshi

        Issue Links



              ojoshi Omkar Vinit Joshi
              vinodkv Vinod Kumar Vavilapalli
              0 Vote for this issue
              10 Start watching this issue