Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-15818

[Native Persistence 3.0] Checkpoint, lifecycle and file store refactoring and re-implementation

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Goal

      Port and refactor core classes implementing page-based persistent store in Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager, PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.

      New checkpoint implementation to avoid excessive logging.

      Store lifecycle clarification to avoid complicated and invasive code of custom lifecycle managed mostly by DatabaseSharedManager.

      Items to pay attention to

      New checkpoint implementation based on split-file storage, new page index structure to maintain disk-memory page mapping.

      File page store implementation should be extracted from GridCacheOffheapManager to a separate entity, target implementation should support new version of checkpoint (split-file store to enable always-consistent store and to eliminate binary recovery phase).

      Support of big pages (256+ kB).

      Support of throttling algorithms.

      References

      New checkpoint design overview is available here

      Thoughts

      Although there is a technical opportunity to have independent checkpoints for different data regions, managing them could be a nightmare and it's definitely in the realm of optimizations and out of scope right now.

      So, let's assume that there's one good old checkpoint process. There's still a requirement to have checkpoint markers, but they will not have a reference to WAL, because there's no WAL. Instead, we will have to store RAFT log revision per partition. Or not, I'm not that familiar with a recovery procedure that's currently in development.

      Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new version will have DO and UNDO. This drastically simplifies both checkpoint itself and node recovery. But is complicates data access.

      There will be two process that will share storage resource: "checkpointer" and "compactor". Let's examine what compactor should or shouldn't do:

      • it should not work in parallel with checkpointer, except for cases when there are too many layers (more on that later)
      • it should merge later checkpoint delta files into main partition files
      • it should delete checkpoint markers once all merges are completed for it, thus markers are decoupled from RAFT log

      About "cases when there are too many layers" - too many layers could compromise reading speed. Number of layers should not increase uncontrollably. So, when a threshold is exceeded, compactor should start working no mater what. If anything, writing load can be throttled, reading matters more.

      Recovery procedure:

      • read the list of checkpoint markers on engines start
      • remove all data from unfinished checkpoint, if it's there
      • trim main partition files to their proper size (should check it it's actually beneficial)

      Table start procedure:

      • read all layer files headers according to the list of checkpoints
      • construct a list oh hash tables (pageId -> pageIndex) for all layers, make it as effective as possible
      • everything else is just like before

      Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x after all. "Restore partition states" procedure could be revisited, I don't know how this will work yet.

      How to store hashmaps:

      regular maps might be too much, we should consider roaring map implementation or something similar that'll occupy less space. This is only a concern for in-memory structures. Files on disk may have a list of pairs, that's fine. Generally speaking, checkpoints with a size of 100 thousand pages are close to the top limit for most users. Splitting that to 500 partitions, for example, gives us 200 pages per partition. Entire map should fit into a single page.

      The only exception to these calculations is index.bin. Amount of pages per checkpoint can be an orders of magnitudes higher, so we should keep an eye on it, It'll be the main target for testing/benchmarking. Anyway, 4 kilobytes is enough to fit 512 integer pairs, scaling to 2048 for regular 16 kilobytes pages. Map won't be too big IMO.

      Another important moment - we should enable direct IO, it's supported by Java natively since version 9 (I guess). There's a chance that not only regular disk operations will become somewhat faster, but fsync will become drastically faster as a result. Which is good, fsync can easily take half a time of the checkpoint, which is just unacceptable.

      Thoughts 2.0

      With high likelihood, we'll get rid of index.bin. This will remove the requirement of having checkpoint markers.

      All that we need is a consistently growing local counter that will be used to mark partition delta files. But, it doesn't need to be global even on a level of local node, it can be a local counter per partition, that's persisted in the meta page. This should be further discussed during the implementation.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ktkalenko@gridgain.com Kirill Tkalenko
            sergeychugunov Sergey Chugunov
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment