Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-4577 BlobGC performance improvements
  3. OAK-4200

[BlobGC] Improve collection times of blobs available

    XMLWordPrintableJSON

    Details

    • Type: Technical task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.6, 1.6.0
    • Component/s: None
    • Labels:
      None

      Description

      The blob collection phase (Identifying all the blobs available in the data store) is quite an expensive part of the whole GC process, taking up a few hours sometimes on large repositories, due to iteration of the sub-folders in the data store.

      In an offline discussion with Terry Mueller and Chetan Mehrotra, the idea came up that this phase can be faster if

      • Blobs ids are tracked when the blobs are added for e.g. in a simple file in the datastore per cluster node.
      • GC then consolidates this file from all the cluster nodes and uses it to get the candidates for GC.
      • This variant of the MarkSweepGC can be triggered more frequently. It would be ok to miss blob id additions to this file during a crash etc., as these blobs can be cleaned up in the regular MarkSweepGC cycles triggered occasionally.

      We also may be able to track other metadata along with the blob ids like paths, timestamps etc. for auditing/analytics, in-conjunction with OAK-3140.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                amitjain Amit Jain
                Reporter:
                amitjain Amit Jain
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: