Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.18
-
None
-
None
Description
Today at 12:16 am the EmailBasedMonitor just appears to have stopped working and died silently:
From Hipchat
@marlon I didn't see any errors in the gfac log, its just that the last log messages from the EmailBasedMonitor where it is processing emails occurs at 2017-09-20 00:16:49,447
Here are the last messages from the EmailBasedMonitor in the logs
2017-09-20 00:16:25,815 [Thread-5] ERROR o.a.a.g.m.e.EmailBasedMonitor - FROM: root <root@ncsa.illinois.edu> 2017-09-20 00:16:25,815 [Thread-5] ERROR o.a.a.g.m.e.EmailBasedMonitor - TO: gw77jobs@scigap.org 2017-09-20 00:16:25,815 [Thread-5] ERROR o.a.a.g.m.e.EmailBasedMonitor - SUBJECT: Non-zero exit code for job 3231343 2017-09-20 00:16:41,930 [Thread-5] INFO o.a.a.g.m.e.EmailBasedMonitor - [EJM]: 5 job/s in job monitor map 2017-09-20 00:16:42,167 [Thread-5] INFO o.a.a.g.m.e.EmailBasedMonitor - [EJM]: Retrieving unseen emails 2017-09-20 00:16:42,913 [Thread-5] INFO o.a.a.g.m.e.EmailBasedMonitor - [EJM]: 75 new email/s received 2017-09-20 00:16:49,447 [Thread-5] ERROR o.a.a.g.m.e.p.PBSEmailParser - [EJM]: No matched found for content => PBS Job Id: 48.torque-server Job Name: A746448754 Exec host: compute-1/0-3 An error has occurred processing your job, see below. Post job file processing error; job 48.torque-server on host compute-1Unknown resource type REJHOST=compute-1 MSG=Root cannot open home directory '/home/grid_user' specified, errno=2 (No such file or directory) -- Ignore if root squashin g is enabled 2017-09-20 00:16:49,447 [Thread-5] INFO o.a.a.g.m.e.EmailBasedMonitor - Returned null for job id, message subject--> PBS JOB 48.torque-server 2017-09-20 00:16:49,447 [Thread-5] INFO o.a.a.g.m.e.EmailBasedMonitor - Returned null for job name, message subject - -> PBS JOB 48.torque-server
If an error was thrown I think it would have been logged since the EmailBasedMonitor thread catches an logs Throwable.