Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-2787

Faster multi threaded indexing / text extraction for binary content

    XMLWordPrintableJSON

    Details

    • Type: Wish
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: lucene
    • Labels:
      None

      Description

      With Lucene based indexing the indexing process is single threaded. This hamper the indexing of binary content as on a multi processor system only single thread can be used to perform the indexing

      Ian Boston Suggested a possible approach [1] involving a 2 phase indexing

      1. In first phase detect the nodes to be indexed and start the full text extraction of the binary content. Post extraction save the binary token stream back to the node as a hidden data. In this phase the node properties can still be indexed and a marker field would be added to indicate the fulltext index is still pending
      2. Later in 2nd phase look for all such Lucene docs and then update them with the saved token stream

      This would allow the text extraction logic to be decouple from Lucene indexing logic

      [1] http://markmail.org/thread/2w5o4bwqsosb6esu

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                chetanm Chetan Mehrotra
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated: