Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-2509

SLA job status can stuck in running state

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 4.3.0
    • None
    • None

    Description

      There are few places where job status is not updated properly
      1. Receiving event which is out of order.

      Ex "oozie.service.EventHandlerService.batch.size" is set to 50.
      oozie.service.EventHandlerService.worker.threads is set to 15. Which means that there will be 15 thread processing event in the batch of 50.

      It can happen that 51th event gets process before the 49th event.

      If 49th is job started event and 51th is job completed event, then the job status will get overridden to running

      2.

      case COORDINATOR_ACTION:
                          CoordinatorActionBean ca = jpaService.execute(new CoordActionGetForSLAJPAExecutor(slaCalc.getId()));
                          if (ca.isTerminalWithFailure()) {
                              isEndMiss = ended = true;
                              slaCalc.setActualEnd(ca.getLastModifiedTime());
                          }
                          if (ca.getExternalId() != null) {
                              wf = jpaService.execute(new WorkflowJobGetForSLAJPAExecutor(ca.getExternalId()));
                              if (wf.getEndTime() != null) {
                                  ended = true;
                                  if (wf.getEndTime().getTime() > slaCalc.getExpectedEnd().getTime()) {
                                      isEndMiss = true;
                                  }
                              }
                              slaCalc.setActualEnd(wf.getEndTime());
                              slaCalc.setActualStart(wf.getStartTime());
                          }
      

      Oozie checks the wf status and update the sla status with coord job status.
      We might have a case where coord is still running,but wf has ended.

      3. HistoryPurgeWorker updates endtime but doesn't update status.

      4. There other few locking issues.

      Attachments

        1. OOZIE-2509-V8.patch
          185 kB
          Purshotam Shah
        2. OOZIE-2509-V7.patch
          185 kB
          Purshotam Shah
        3. OOZIE-2509-V6.patch
          169 kB
          Purshotam Shah
        4. OOZIE-2509-V5.patch
          169 kB
          Purshotam Shah
        5. OOZIE-2509-V4.patch
          168 kB
          Purshotam Shah
        6. OOZIE-2509-V3.patch
          165 kB
          Purshotam Shah
        7. OOZIE-2509-V2.patch
          163 kB
          Purshotam Shah
        8. OOZIE-2509-V1.patch
          120 kB
          Purshotam Shah

        Issue Links

          Activity

            People

              puru Purshotam Shah
              puru Purshotam Shah
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: