Uploaded image for project: 'OODT (Retired)'
  1. OODT (Retired)
  2. OODT-692

Use lsof to stop Workflow/Resource Manager task/job PIDs

    XMLWordPrintableJSON

Details

    • Expert (Hard) - Guru knowledge of this project could be required

    Description

      We can exploit a combination of LSOF, JobDir, and WorkflowInstanceId to actually kill the process ID and fully stop a job kicked off by the resource manager and workflow manager. I've been testing this process by hand on the ASO process and it's totally useable by hand in practice, so we should automate it. For example:

      [snowdeploy@trango-private bin]$ lsof -p 37558
      COMMAND   PID       USER   FD   TYPE   DEVICE SIZE/OFF      NODE NAME
      idl     37558 snowdeploy  cwd    DIR    253,2     4096 488284165 /data/jobs/CASI/ISSP/20140511f1_184151_1399903013836
      ..
      

      Reveals to use that the process ID 37558 (one of the IDL jobs running in ASO for the ORTHO process) corresponds to JobDir

      /data/jobs/CASI/ISSP/20140511f1_184151_1399903013836
      

      We can also find out from WorklowInstanceMetadata that the JobDir corresponding to the line 184151 is 726af17c-c131-4682-845e-4ef6b4a7eeee.

      So, from a Workflow Instance Id, we need:

      1. the resolved JobDir by CAS-PGE. If it's not a CAS-PGE job, we need the WorkflowTask to specify a JobDir, or else this functionality will simply print out a message saying Kill without JobDir not supported.
      2. a map for processes to interrogate with lsof e.g., PCS_JobKillProcessName
      3. the use of lsof to interrogate the PID table, find the job corresponding JobDir, and then kill. If PCS_JobKillProcessName is not specified, then interrogate all jobs to determine the job to kill.

      Attachments

        Activity

          People

            chrismattmann Chris A. Mattmann
            chrismattmann Chris A. Mattmann
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: