Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1401

DB deadlock caused stuck job update

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.9.0
    • Scheduler
    • None
    • Twitter Aurora Q2'15 Sprint 7
    • 2

    Description

      We observed the symptom of a job update that was stuck in ROLLING_FORWARD, but was not active. Looking in scheduler logs revealed a DB-level deadlock:

      E0715 19:04:53.085 THREAD1471 org.apache.aurora.scheduler.updater.JobUpdateEventSubscriber.startUp: Failed to resume job updates: org.apache.aurora.scheduler.storage.Storage$StorageException: 
      ### Error querying database.  Cause: org.h2.jdbc.JdbcSQLException: Deadlock detected. The current transaction was rolled back. Details: "
      Session #3 (user: ) on thread RowGarbageCollector RUNNING is waiting to lock PUBLIC.LOCKS (shared) while locking PUBLIC.JOB_KEYS (exclusive).
      Session #2 (user: ) on thread JobUpdateEventSubscriber STARTING is waiting to lock PUBLIC.JOB_KEYS (shared) while locking PUBLIC.JOB_INSTANCE_UPDATE_EVENTS (exclusive), PUBLIC.JOB_UPDATE_EVENTS (exclusive), PUBLIC.LOCKS (exclusive), PUBLIC.JOB_UPDATE_LOCKS (exclusive)."; SQL statement:
      SELECT
            lock_token
          FROM job_update_locks AS l
          INNER JOIN job_updates u ON l.update_row_id = u.id
           
          INNER JOIN job_keys AS j ON j.id = u.job_key_id
         
          WHERE  
          u.update_id = ?
          AND j.role = ?
          AND j.environment = ?
          AND j.name = ? [40001-187]
      ### The error may exist in org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml
      ### The error may involve defaultParameterMap
      ### The error occurred while setting parameters
      ### SQL: SELECT       lock_token     FROM job_update_locks AS l     INNER JOIN job_updates u ON l.update_row_id = u.id           INNER JOIN job_keys AS j ON j.id = u.job_key_id         WHERE       u.update_id = ?     AND j.role = ?     AND j.environment = ?     AND j.name = ?
      ### Cause: org.h2.jdbc.JdbcSQLException: Deadlock detected. The current transaction was rolled back. Details: "
      Session #3 (user: ) on thread RowGarbageCollector RUNNING is waiting to lock PUBLIC.LOCKS (shared) while locking PUBLIC.JOB_KEYS (exclusive).
      Session #2 (user: ) on thread JobUpdateEventSubscriber STARTING is waiting to lock PUBLIC.JOB_KEYS (shared) while locking PUBLIC.JOB_INSTANCE_UPDATE_EVENTS (exclusive), PUBLIC.JOB_UPDATE_EVENTS (exclusive), PUBLIC.LOCKS (exclusive), PUBLIC.JOB_UPDATE_LOCKS (exclusive)."; SQL statement:
      SELECT
            lock_token
          FROM job_update_locks AS l
          INNER JOIN job_updates u ON l.update_row_id = u.id
           
          INNER JOIN job_keys AS j ON j.id = u.job_key_id
         
          WHERE  
          u.update_id = ?
          AND j.role = ?
          AND j.environment = ?
          AND j.name = ? [40001-187]
      org.apache.aurora.scheduler.storage.Storage$StorageException: 
      ### Error querying database.  Cause: org.h2.jdbc.JdbcSQLException: Deadlock detected. The current transaction was rolled back. Details: "
      Session #3 (user: ) on thread RowGarbageCollector RUNNING is waiting to lock PUBLIC.LOCKS (shared) while locking PUBLIC.JOB_KEYS (exclusive).
      Session #2 (user: ) on thread JobUpdateEventSubscriber STARTING is waiting to lock PUBLIC.JOB_KEYS (shared) while locking PUBLIC.JOB_INSTANCE_UPDATE_EVENTS (exclusive), PUBLIC.JOB_UPDATE_EVENTS (exclusive), PUBLIC.LOCKS (exclusive), PUBLIC.JOB_UPDATE_LOCKS (exclusive)."; SQL statement:
      SELECT
            lock_token
          FROM job_update_locks AS l
          INNER JOIN job_updates u ON l.update_row_id = u.id
           
          INNER JOIN job_keys AS j ON j.id = u.job_key_id
         
          WHERE  
          u.update_id = ?
          AND j.role = ?
          AND j.environment = ?
          AND j.name = ? [40001-187]
      ### The error may exist in org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml
      ### The error may involve defaultParameterMap
      ### The error occurred while setting parameters
      ### SQL: SELECT       lock_token     FROM job_update_locks AS l     INNER JOIN job_updates u ON l.update_row_id = u.id           INNER JOIN job_keys AS j ON j.id = u.job_key_id         WHERE       u.update_id = ?     AND j.role = ?     AND j.environment = ?     AND j.name = ?
      ### Cause: org.h2.jdbc.JdbcSQLException: Deadlock detected. The current transaction was rolled back. Details: "
      Session #3 (user: ) on thread RowGarbageCollector RUNNING is waiting to lock PUBLIC.LOCKS (shared) while locking PUBLIC.JOB_KEYS (exclusive).
      Session #2 (user: ) on thread JobUpdateEventSubscriber STARTING is waiting to lock PUBLIC.JOB_KEYS (shared) while locking PUBLIC.JOB_INSTANCE_UPDATE_EVENTS (exclusive), PUBLIC.JOB_UPDATE_EVENTS (exclusive), PUBLIC.LOCKS (exclusive), PUBLIC.JOB_UPDATE_LOCKS (exclusive)."; SQL statement:
      SELECT
            lock_token
          FROM job_update_locks AS l
          INNER JOIN job_updates u ON l.update_row_id = u.id
           
          INNER JOIN job_keys AS j ON j.id = u.job_key_id
         
          WHERE  
          u.update_id = ?
          AND j.role = ?
          AND j.environment = ?
          AND j.name = ? [40001-187]
              at org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:145)
              at org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
              at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
              at org.apache.aurora.scheduler.storage.log.LogStorage.doInTransaction(LogStorage.java:602)
              at org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:635)
              at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:130)
              at org.apache.aurora.scheduler.updater.JobUpdateControllerImpl.systemResume(JobUpdateControllerImpl.java:263)
              at org.apache.aurora.scheduler.updater.JobUpdateEventSubscriber.startUp(JobUpdateEventSubscriber.java:85)
              at com.google.common.util.concurrent.AbstractIdleService$2$1.run(AbstractIdleService.java:54)
              at com.google.common.util.concurrent.Callables$3.run(Callables.java:93)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.ibatis.exceptions.PersistenceException: 
      ### Error querying database.  Cause: org.h2.jdbc.JdbcSQLException: Deadlock detected. The current transaction was rolled back. Details: "
      Session #3 (user: ) on thread RowGarbageCollector RUNNING is waiting to lock PUBLIC.LOCKS (shared) while locking PUBLIC.JOB_KEYS (exclusive).
      Session #2 (user: ) on thread JobUpdateEventSubscriber STARTING is waiting to lock PUBLIC.JOB_KEYS (shared) while locking PUBLIC.JOB_INSTANCE_UPDATE_EVENTS (exclusive), PUBLIC.JOB_UPDATE_EVENTS (exclusive), PUBLIC.LOCKS (exclusive), PUBLIC.JOB_UPDATE_LOCKS (exclusive)."; SQL statement:
      SELECT
            lock_token
          FROM job_update_locks AS l
          INNER JOIN job_updates u ON l.update_row_id = u.id
           
          INNER JOIN job_keys AS j ON j.id = u.job_key_id
         
          WHERE  
          u.update_id = ?
          AND j.role = ?
          AND j.environment = ?
          AND j.name = ? [40001-187]
      ### The error may exist in org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml
      ### The error may involve defaultParameterMap
      ### The error occurred while setting parameters
      ### SQL: SELECT       lock_token     FROM job_update_locks AS l     INNER JOIN job_updates u ON l.update_row_id = u.id           INNER JOIN job_keys AS j ON j.id = u.job_key_id         WHERE       u.update_id = ?     AND j.role = ?     AND j.environment = ?     AND j.name = ?
      ### Cause: org.h2.jdbc.JdbcSQLException: Deadlock detected. The current transaction was rolled back. Details: "
      Session #3 (user: ) on thread RowGarbageCollector RUNNING is waiting to lock PUBLIC.LOCKS (shared) while locking PUBLIC.JOB_KEYS (exclusive).
      Session #2 (user: ) on thread JobUpdateEventSubscriber STARTING is waiting to lock PUBLIC.JOB_KEYS (shared) while locking PUBLIC.JOB_INSTANCE_UPDATE_EVENTS (exclusive), PUBLIC.JOB_UPDATE_EVENTS (exclusive), PUBLIC.LOCKS (exclusive), PUBLIC.JOB_UPDATE_LOCKS (exclusive)."; SQL statement:
      SELECT
            lock_token
          FROM job_update_locks AS l
          INNER JOIN job_updates u ON l.update_row_id = u.id
           
          INNER JOIN job_keys AS j ON j.id = u.job_key_id
         
          WHERE  
          u.update_id = ?
          AND j.role = ?
          AND j.environment = ?
          AND j.name = ? [40001-187]
              at org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:30)
              at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:122)
              at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:113)
              at org.apache.ibatis.session.defaults.DefaultSqlSession.selectOne(DefaultSqlSession.java:73)
              at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:497)
              at org.apache.ibatis.session.SqlSessionManager$SqlSessionInterceptor.invoke(SqlSessionManager.java:334)
              at com.sun.proxy.$Proxy57.selectOne(Unknown Source)
              at org.apache.ibatis.session.SqlSessionManager.selectOne(SqlSessionManager.java:165)
              at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:69)
              at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:53)
              at org.apache.aurora.scheduler.storage.db.$Proxy64.selectLockToken(Unknown Source)
              at org.apache.aurora.scheduler.storage.db.DbJobUpdateStore.getLockToken(DbJobUpdateStore.java:257)
              at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
              at org.apache.aurora.scheduler.storage.ForwardingStore.getLockToken(ForwardingStore.java:178)
              at org.apache.aurora.scheduler.updater.JobUpdateControllerImpl.changeJobUpdateStatus(JobUpdateControllerImpl.java:454)
              at org.apache.aurora.scheduler.updater.JobUpdateControllerImpl.access$1000(JobUpdateControllerImpl.java:108)
              at org.apache.aurora.scheduler.updater.JobUpdateControllerImpl$4.execute(JobUpdateControllerImpl.java:283)
              at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:131)
              at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:148)
              at org.apache.aurora.scheduler.storage.log.LogStorage$24.apply(LogStorage.java:605)
              at org.apache.aurora.scheduler.storage.log.LogStorage$24.apply(LogStorage.java:602)
              at org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:143)
              ... 10 more
      Caused by: org.h2.jdbc.JdbcSQLException: Deadlock detected. The current transaction was rolled back. Details: "
      Session #3 (user: ) on thread RowGarbageCollector RUNNING is waiting to lock PUBLIC.LOCKS (shared) while locking PUBLIC.JOB_KEYS (exclusive).
      Session #2 (user: ) on thread JobUpdateEventSubscriber STARTING is waiting to lock PUBLIC.JOB_KEYS (shared) while locking PUBLIC.JOB_INSTANCE_UPDATE_EVENTS (exclusive), PUBLIC.JOB_UPDATE_EVENTS (exclusive), PUBLIC.LOCKS (exclusive), PUBLIC.JOB_UPDATE_LOCKS (exclusive)."; SQL statement:
      SELECT
            lock_token
          FROM job_update_locks AS l
          INNER JOIN job_updates u ON l.update_row_id = u.id
           
          INNER JOIN job_keys AS j ON j.id = u.job_key_id
         
          WHERE  
          u.update_id = ?
          AND j.role = ?
          AND j.environment = ?
          AND j.name = ? [40001-187]
              at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
              at org.h2.message.DbException.get(DbException.java:179)
              at org.h2.message.DbException.get(DbException.java:155)
              at org.h2.table.RegularTable.doLock1(RegularTable.java:496)
              at org.h2.table.RegularTable.lock(RegularTable.java:472)
              at org.h2.table.TableFilter.lock(TableFilter.java:146)
              at org.h2.command.dml.Select.queryWithoutCache(Select.java:671)
              at org.h2.command.dml.Query.query(Query.java:322)
              at org.h2.command.dml.Query.query(Query.java:290)
              at org.h2.command.dml.Query.query(Query.java:36)
              at org.h2.command.CommandContainer.query(CommandContainer.java:90)
              at org.h2.command.Command.executeQuery(Command.java:197)
              at org.h2.jdbc.JdbcPreparedStatement.execute(JdbcPreparedStatement.java:192)
              at org.apache.ibatis.executor.statement.PreparedStatementHandler.query(PreparedStatementHandler.java:62)
              at org.apache.ibatis.executor.statement.RoutingStatementHandler.query(RoutingStatementHandler.java:78)
              at org.apache.ibatis.executor.SimpleExecutor.doQuery(SimpleExecutor.java:62)
              at org.apache.ibatis.executor.BaseExecutor.queryFromDatabase(BaseExecutor.java:303)
              at org.apache.ibatis.executor.BaseExecutor.query(BaseExecutor.java:154)
              at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:102)
              at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:82)
              at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:120)
              ... 32 more
      

      Attachments

        Issue Links

          Activity

            People

              wfarner Bill Farner
              wfarner Bill Farner
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: