Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-1940

StatusTransitService has race condition

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 4.2.0
    • None
    • None

    Description

      StatusTransitService doesn't acquire lock while updating DB.

      We noticed one such issue while doing HA testing, thanks to mchiang
      We issue a change command to change pause time, which got executed on one server. While change command was running on one server, other server started executing StatusTransitService.

      Server 1 log

      2014-07-16 17:28:05,268  INFO StatusTransitService$StatusTransitRunnable:539 [pool-1-thread-13] - USER[-] GROUP[-] Acquired lock for [org.apache.oozie.service.StatusTransitService]
      2014-07-16 17:28:09,694  INFO StatusTransitService$StatusTransitRunnable:539 [pool-1-thread-13] - USER[-] GROUP[-] Set coordinator job [0011385-140716042555-oozie-oozi-C] status to 'SUCCEEDED' from 'RUNNING' 
      2014-07-16 17:28:15,416  INFO StatusTransitService$StatusTransitRunnable:539 [pool-1-thread-13] - USER[-] GROUP[-] Released lock for [org.apache.oozie.service.StatusTransitService]
      

      Server 2 log

      2014-07-16 17:28:06,499 DEBUG CoordChangeXCommand:545 [http-0.0.0.0-4443-5] - USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] New pause/end date is : Wed Jul 16 17:30:00 UTC 2014 and last action number is : 3
      2014-07-16 17:28:06,508  INFO CoordChangeXCommand:539 [http-0.0.0.0-4443-5] - USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] ENDED CoordChangeXCommand for jobId=0011385-140716042555-oozie-oozi-C
      

      CoordMaterializeTransitionXCommand has created all actions( few were in waiting and few were in running state) and set doneMaterialization to true.

      Change command deletes all waiting coords, except 3 running/SUCCEEDED action and reset doneMaterialization.

      StatusTransitService first loads a set of pending jobs and for each job it make DB calls to check coord action status. Coord jobs are loaded only once in beginning.

      This is what happened.
      1.StatusTransitService loads the coord job which doneMaterialization is set to true at 17:28:05,268 (server 1)
      2.Change command deletes waiting cation and reset doneMaterialization at 17:28:06,508 (server 2)
      3.StatusTransitService load actions for job, only 3 and in SUCCEEDED status. It never reload the doneMaterialization at 17:28:09,694 (server 1)
      StatusTransitService overrides set job status to SUCCEEDED, bcz it's doneMaterialization and all action are SUCCEEDED.

      Attachments

        1. OOZIE-1940-V8.patch
          112 kB
          Purshotam Shah
        2. OOZIE-1940-V7.patch
          107 kB
          Purshotam Shah
        3. OOZIE-1940-V6.patch
          100 kB
          Purshotam Shah
        4. OOZIE-1940-V5.patch
          100 kB
          Purshotam Shah

        Issue Links

          Activity

            People

              puru Purshotam Shah
              puru Purshotam Shah
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: