Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-336

Gobblin Cluster Job Isolation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Gobblin cluster runs Gobblin jobs. Each cluster worker host runs jobs in a thread pool in a single JVM. The thread pool is reused for next jobs after previous jobs finish.

      Gobblin cluster recently ran into issues with resource leakage. The cluster would fail all job executions when certain resources such as threads were exhausted. To recover, the whole cluster has to be restarted and jobs have to be retried. With the expected increase in the number of jobs executed, such errors happen more frequently. We have identified the causes and fixes have been verfied. However, there are concerns that unknown similar bugs may show up later that may bring the whole cluster down.

      In general, any bug in one job’s code may affect the executions of another job since they run in the same JVM. It’s also possible that a bug will only be triggered by certain input data which is specific to a subset of jobs.

      The cluster will be more robust if a job execution is better isolated from another job.

      In the future, we expect jobs will become more diverse as more use cases are on-boarded. The need for job isolation will become more important over time.
      In the future job isolation may be required for security reasons too.

      Attachments

        Activity

          People

            Unassigned Unassigned
            HappyRay Ray Yang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: