Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-27743 Enhancements for the persistent cache
  3. HBASE-28004

Persistent cache map can get corrupt if crash happens midway through the write

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.6.0, 3.0.0-alpha-4, 4.0.0-alpha-1
    • 2.6.0, 3.0.0-beta-1
    • None
    • None

    Description

      HBASE-27686 added a background thread for periodically saving the cache index map, together with a list of completed cached files so that we can recover the cache state in case of crash or restart. Problem is that the cache index can become few GB large (a sample case with 1.6TB of used bucket cache would map to between 8GB to 10GB indexes), and these writes take few seconds to complete, causing any RS crash very likely to leave a corrupt index file that can't be recovered when the RS starts again. Worse, since we store the list of cached files on a separate file, this also leads to cache inconsistencies, with files in the list of cached files never cached once the RS is restarted, even though we have no cache index for those and every read ends up going to the FS.

      This task aims to refactor the cache persistent as follows:
      1) Write both the list of completely cached files and the cache indexes in a single file, so that we can have this synced atomically;
      2) When writing the persistent cache file, use a temp name first, then once the write is successfully finished, rename it to the actual name. This way, if crash happens whilst the persistent cache is still being written, the temp file would be corrupt, but we could still recover from the last successful sync, and we would only lose the caching ops since the last sync.

      Attachments

        Issue Links

          Activity

            People

              wchevreuil Wellington Chevreuil
              wchevreuil Wellington Chevreuil
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: