Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-7105

Implement a traverse with sort strategy for DocumentStoreIndexer

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.8.0
    • Component/s: run
    • Labels:
      None

      Description

      Currently the DocumentStoreIndexer logic uses a StoreAndSortStrategy in which it first dumps all nodestates to a json file -> sort them in batches -> merge the sorted file. In whole indexing the sorting phase is taking decent amount of time (40 mins out of 3 hr run).

      Further this approach suffers with potential OOM while ExternalSort creates in memory batches where actual size of batch exceeds the estimated size considerably. So we need to constant tweak the "oak.indexer.maxSortMemoryInGB" (currently set to 2 GB)

      As an improvement we can do following changes

      1. Implement a traverse with sort strategy - Here instead of first dumping all nodestate in a single big json we instead add them to an in memory buffer and then at some stage sort the batch and save it to file
      2. Use better memory checks - Use the approach as implemented in GCBarrier i.e. monitor the current memory usage and if it goes below certain threshold trigger the batch sort

        Attachments

          Activity

            People

            • Assignee:
              chetanm Chetan Mehrotra
              Reporter:
              chetanm Chetan Mehrotra
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: