Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-9612

Improve checkpoint mark phase speed.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.7
    • None
    • None

    Description

      I'm observing regular slow checkpoints due to long mark duration, which is not related to dirty pages number:

      2018-09-01 14:55:20.408 [INFO ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=01e0c7bf-842f-4ed6-8589-b4904063434f, startPtr=FileWALPointer [idx=19814, fileOff=948996096, len=5233457],
      checkpointLockWait=0ms, checkpointLockHoldTime=951ms, walCpRecordFsyncDuration=39ms, pages=78477, reason='timeout']
      2018-09-01 14:55:21.307 [INFO ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] Checkpoint finished [cpId=01e0c7bf-842f-4ed6-8589-b4904063434f, pages=78477, markPos=FileWALPointer [idx=19814, fileOff=948996096, len=5233457], walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=1002m*s, pagesWrite=478ms, fsync=421ms, total=1901ms] 
      
      2018-09-01 14:58:20.355 [INFO ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, startPtr=FileWALPointer [idx=19814, fileOff=1000024208, len=5233457], checkpointLockWait=0ms, checkpointLockHoldTime=926ms, walCpRecordFsyncDuration=14ms, pages=10837, reason='timeout']
      2018-09-01 14:58:20.480 [INFO ][db-checkpoint-thread-#241%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.p.GridCacheDatabaseSharedManager] Checkpoint finished [cpId=09d1f4bc-d3f3-4a16-b291-89d7fa745ea5, pages=10837, markPos=FileWALPointer [idx=19814, fileOff=1000024208, len=5233457], walSegmentsCleared=0, walSegmentsCovered=[], *markDuration=943ms*, pagesWrite=64ms, fsync=61ms, total=1068ms]
      

      Debugging has revealed what this is due to large amount of work required to save metadata for metapages and free/reuse lists. Because this is done under checkpoint write lock, all other activities are blocked, resulting in increased tx and atomic ops latency.

      Simple solution: parallelize metadata processing during mark phase.

      Best way to solve the problem is described in IGNITE-9520.

      Attachments

        Issue Links

          Activity

            People

              ascherbakov Alexey Scherbakov
              ascherbakov Alexey Scherbakov
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: