Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-18240

During a Rolling Downgrade Oozie Long Running Jobs Can Fail

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.4.0
    • trunk, 2.4.1
    • ambari-server
    • None

    Description

      • Install HDP-2.3.2.0-2950 with Ambari 2.4.0
      • Being a long-running job (LRJ) in Oozie
      • Start upgrading to HDP-2.5.0.0-1235
      • Before finalizing step, start downgrading to HDP-2.3.2.0-2950.

      Sometimes, the LRJ will fail:

      /usr/hdp/current/oozie-client/bin/oozie job -oozie http://natr66-grls-dlm10toeriedwngdsec-r6-10.openstacklocal:11000/oozie   -info 0000001-160821214718970-oozie-oozi-C@248 
      ID : 0000001-160821214718970-oozie-oozi-C@248
      ------------------------------------------------------------------------------------------------------------------------------------
      Action Number        : 248
      Console URL          : -
      Error Code           : -
      Error Message        : -
      External ID          : 0000030-160822042035608-oozie-oozi-W
      External Status      : -
      Job ID               : 0000001-160821214718970-oozie-oozi-C
      Tracker URI          : -
      Created              : 2016-08-22 00:37 GMT
      Nominal Time         : 2009-01-01 21:35 GMT
      Status               : FAILED
      Last Modified        : 2016-08-22 05:15 GMT
      First Missing Dependency : -
      ------------------------------------------------------------------------------------------------------------------------------------
      [hrt_qa@natr66-grls-dlm10toeriedwngdsec-r6-21 ~]$  /usr/hdp/current/oozie-client/bin/oozie job -oozie http://natr66-grls-dlm10toeriedwngdsec-r6-10.openstacklocal:11000/oozie   -info 0000030-160822042035608-oozie-oozi-W
      Job ID : 0000030-160822042035608-oozie-oozi-W
      ------------------------------------------------------------------------------------------------------------------------------------
      Workflow Name : wordcount
      App Path      : hdfs://nameservice/user/hrt_qa/test_oozie_long_running
      Status        : FAILED
      Run           : 0
      User          : hrt_qa
      Group         : -
      Created       : 2016-08-22 05:08 GMT
      Started       : 2016-08-22 05:08 GMT
      Last Modified : 2016-08-22 05:15 GMT
      Ended         : 2016-08-22 05:15 GMT
      CoordAction ID: 0000001-160821214718970-oozie-oozi-C@248
      
      Actions
      ------------------------------------------------------------------------------------------------------------------------------------
      ID                                                                            Status    Ext ID                 Ext Status Err Code  
      ------------------------------------------------------------------------------------------------------------------------------------
      0000030-160822042035608-oozie-oozi-W@wc                                       FAILED    job_1471842441396_0002 FAILED     JA017     
      ------------------------------------------------------------------------------------------------------------------------------------
      0000030-160822042035608-oozie-oozi-W@:start:                                  OK        -                      OK         -         
      ------------------------------------------------------------------------------------------------------------------------------------
      

      This is caused by an outage of both NameNodes during the downgrade.

      • We have two NNs at the "Finalize Upgrade" state;
        • nn1 is standby (out of safemode)
        • nn2 is active (out of safemode)
      • A downgrade begins and we restart nn1
        • After the restart of nn1, it hasn't come online yet. Our code tries to contact it and can't, so we move onto nn2.
        • nn2 is online and active and out of safemode (because it hasn't been downgraded yet), so we let the downgrade continue
      • The downgrade continues and we restart nn2
        • However, nn1 is still coming online and isn't even standby yet

      Now we have an nn1 which isn't fully loaded and an nn2 which is restarting and trying to figure out whether to be active or standby. It's during this gap that the tests must be failing.

      So, it seems like we need to be a little bit smarter about waiting for the namenode to restart; we can't just look at the "active" one and say things are OK because it might be the next one to restart.

      Attachments

        1. AMBARI-18240.patch
          32 kB
          Jonathan Hurley

        Issue Links

          Activity

            People

              jonathanhurley Jonathan Hurley
              jonathanhurley Jonathan Hurley
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: