Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-4756

A parallel approach to garbage collection



    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Later
    • None
    • None
    • segment-tar


      Assuming that:

      1. Logic record IDs are implemented.
      2. TAR files are ordered in reverse chronological order.
      3. When reading segments, TAR files are consulted in order.
      4. Segments in recent TAR files shadow segments in older TAR files with the same segment ID.

      A new algorithm for garbage collection can be implemented:

      1. Define the input for the garbage collection process. The input consists of the current set of TAR files and a set of record IDs representing the GC roots.
      2. Traverse the GC roots and mark the records that are still in use. The mark phase traverses the record graph and produces a list of record IDs. These record IDs are referenced directly or indirectly by the given set of GC roots and need to be kept. The list of record IDs is ordered by segment ID first and record number next. This way, it is possible to process this list in one pass and figure out which segment and which record should be saved at the end of the garbage collection.
      3. Remove unused records from segments and rewrite them in a new set of TAR files. The list is produced in the previous step is traversed. For each segment encountered, a new segment is created containing only the records that were marked in the previous phase. This segment is then saved in a new set of TAR files. The set of new TAR files is the result of the garbage collection process.
      4. Add the new TAR files to the system. The system will append the new TAR files to the segment store. The segments in these TAR files will shadow the ones in older TAR files.
      5. Remove TAR files from the old generation. It is safe to do so because the new set of TAR files are currently shadowing the initial set of TAR files.

      While the garbage collection process is running, the system can work as usual by starting a fresh TAR file. The result of the garbage collection is made visible atomically only at the end, when the new TAR files are integrated into the running system.




            frm Francesco Mari
            frm Francesco Mari
            0 Vote for this issue
            4 Start watching this issue