OFBiz
  1. OFBiz
  2. OFBIZ-3867

JobManager.poll() enters an endless loop when it can't get a connection

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      JobManager.poll(), line 157, where it calls storeByCondition, can fail when there is no connection available from the database(due to a connection leak, or just load, or whatever). An exception then gets thrown by storeByCondition(deep inside ofbiz/commons-dbcp/postgres). The catch(Throwable) then logs the error, and the loop tries again. Since pollDone never gets set to true, this loop is very tight, and the log file starts to fill up very fast, each each thread of JobPoller tries the same thing over and over.

      I'm filing this bug mainly to see if anyone else works on it, but if not, it's a reminder for me.

        Activity

        Hide
        Jacques Le Roux added a comment -

        Yes this should certainly be taken into consideration. BTW 200 tenants is impressive!

        Show
        Jacques Le Roux added a comment - Yes this should certainly be taken into consideration. BTW 200 tenants is impressive!
        Hide
        Adam Heath added a comment -

        Bob brought up interesting points, when multiple tenants are active. Each tenant has it's own pool of threads to do job management. So when lots of tenants, you get lots of threads, and if each tenant has lots of jobs, the system will go into an overload(both memory for each job, and thread contention).

        Show
        Adam Heath added a comment - Bob brought up interesting points, when multiple tenants are active. Each tenant has it's own pool of threads to do job management. So when lots of tenants, you get lots of threads, and if each tenant has lots of jobs, the system will go into an overload(both memory for each job, and thread contention).
        Hide
        Jacques Le Roux added a comment -

        Looks like Adrian's suggestion

        Show
        Jacques Le Roux added a comment - Looks like Adrian's suggestion
        Hide
        Adam Heath added a comment -

        Being offline for an extended period, then recovering, could be done with a flag that says whether the job supports catching up on it's own, or whether it needs to be fired once for each missed period, with the date of the period then being passed in.

        Show
        Adam Heath added a comment - Being offline for an extended period, then recovering, could be done with a flag that says whether the job supports catching up on it's own, or whether it needs to be fired once for each missed period, with the date of the period then being passed in.
        Hide
        Bob Morley added a comment -

        Naturally we are willing to share our code if you would like. Our trunk build currently is running on a single application server with about 200 tenants. We have a few enhancements in the hopper on this as we bring application services online we need to properly distribute the jobs across the managers for example. We have had a number of issues from the job manager ... one that we want to tackle is because the "job template" is a job_sandbox record like any other, it eventually gets purge. For example, say you are running a job with a "HOURLY" temporal expression and you then decide you want to change that to run "NIGHTLY". There is no clean way to do this other than updating every job_sandbox record that is either RUNNING or PENDING.

        Another thought is that we moved the rescheduling of the next instance of a recurring job to the end (rather than when a job is init'd) and we made sure to schedule it in the future to avoid overloading. For example, if a database is offline for say a week and then it comes back online, you really do not want "sendEmail" to be scheduled every 5 minutes over that week period.

        Show
        Bob Morley added a comment - Naturally we are willing to share our code if you would like. Our trunk build currently is running on a single application server with about 200 tenants. We have a few enhancements in the hopper on this as we bring application services online we need to properly distribute the jobs across the managers for example. We have had a number of issues from the job manager ... one that we want to tackle is because the "job template" is a job_sandbox record like any other, it eventually gets purge. For example, say you are running a job with a "HOURLY" temporal expression and you then decide you want to change that to run "NIGHTLY". There is no clean way to do this other than updating every job_sandbox record that is either RUNNING or PENDING. Another thought is that we moved the rescheduling of the next instance of a recurring job to the end (rather than when a job is init'd) and we made sure to schedule it in the future to avoid overloading. For example, if a database is offline for say a week and then it comes back online, you really do not want "sendEmail" to be scheduled every 5 minutes over that week period.
        Hide
        Adam Heath added a comment -

        Hmm, interesting points. Having a single, global pool of available worker threads makes sense. If you have 10 tenants, each with a 4-thread pool, then you could have 40 threads all trying to do work. That sucks.

        But, if you have a single global pool, with 4 threads, it's possible for a single tenant to overload the global pool with work, and starve the other tenants. So, some kind of PriorityQueue needs to be used. With appropriate configuration knobs.

        I think the jobs should not be stored in the main database. That makes database backups more difficult.

        But, maybe have a separate feature for wanting to run a job on all tenants.

        There are several ideas floating here, each one would need to be done separately.

        Show
        Adam Heath added a comment - Hmm, interesting points. Having a single, global pool of available worker threads makes sense. If you have 10 tenants, each with a 4-thread pool, then you could have 40 threads all trying to do work. That sucks. But, if you have a single global pool, with 4 threads, it's possible for a single tenant to overload the global pool with work, and starve the other tenants. So, some kind of PriorityQueue needs to be used. With appropriate configuration knobs. I think the jobs should not be stored in the main database. That makes database backups more difficult. But, maybe have a separate feature for wanting to run a job on all tenants. There are several ideas floating here, each one would need to be done separately.
        Hide
        Bob Morley added a comment -

        We have made a number of changes to the JobScheduler to properly work with multi-tenancy. In this spot we created a list of the databases that were down and when polling for jobs we would exclude these jobs. We then had a separate polling period (default 5 minutes) that would check the offline databases to see if they have gone back online.

        This might not match what you are trying to do exactly because we have a technique of storing all persisted jobs in our "main" database which has a "delegatorName" column (which represents the tenant). Jobs that are destined to run for all tenants would be "exploded" into a job per tenant (targeted for it). This allows a "sendEmail" job (for example) to execute on all tenant databases that are online, and safely skip non-online tenants until they go back online. This also creates a singleton jobManager so you do not have one running for each tenant ...

        Anyway those are my thoughts on it

        Show
        Bob Morley added a comment - We have made a number of changes to the JobScheduler to properly work with multi-tenancy. In this spot we created a list of the databases that were down and when polling for jobs we would exclude these jobs. We then had a separate polling period (default 5 minutes) that would check the offline databases to see if they have gone back online. This might not match what you are trying to do exactly because we have a technique of storing all persisted jobs in our "main" database which has a "delegatorName" column (which represents the tenant). Jobs that are destined to run for all tenants would be "exploded" into a job per tenant (targeted for it). This allows a "sendEmail" job (for example) to execute on all tenant databases that are online, and safely skip non-online tenants until they go back online. This also creates a singleton jobManager so you do not have one running for each tenant ... Anyway those are my thoughts on it
        Hide
        Adrian Crum added a comment -

        One idea off the top of my head and without looking at the code would be to give the JobManager a state. It could enter a suspended state and then try switching to an active state from time to time. State change log entries would be informational, not warnings.

        Having a hook where outside events could monitor/trigger state changes could be useful. A process monitoring the request load could suspend the JobManager during peak traffic times.

        Show
        Adrian Crum added a comment - One idea off the top of my head and without looking at the code would be to give the JobManager a state. It could enter a suspended state and then try switching to an active state from time to time. State change log entries would be informational, not warnings. Having a hook where outside events could monitor/trigger state changes could be useful. A process monitoring the request load could suspend the JobManager during peak traffic times.
        Hide
        Adam Heath added a comment -

        The question here, is what to do. Setting pollDone is possible, which will cause the loop to exit with zero items. The outer poller will then retry after the period expires. Which will then log another exception. So the disk can still fill up, but at a much slower rate.

        Should the poller be shut down in this case? Maybe go to a single-threaded version, with a longer poll period?

        Show
        Adam Heath added a comment - The question here, is what to do. Setting pollDone is possible, which will cause the loop to exit with zero items. The outer poller will then retry after the period expires. Which will then log another exception. So the disk can still fill up, but at a much slower rate. Should the poller be shut down in this case? Maybe go to a single-threaded version, with a longer poll period?
        Hide
        Adrian Crum added a comment -

        I have experienced this problem and it is a serious issue. Our OFBiz server lost its connection to the database server one weekend, and by Monday the OFBiz server's hard disk had filled to capacity with JobManager log entries - shutting everything down.

        Show
        Adrian Crum added a comment - I have experienced this problem and it is a serious issue. Our OFBiz server lost its connection to the database server one weekend, and by Monday the OFBiz server's hard disk had filled to capacity with JobManager log entries - shutting everything down.

          People

          • Assignee:
            Adam Heath
            Reporter:
            Adam Heath
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development