[OAK-2787] Faster multi threaded indexing / text extraction for binary content - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: lucene
Labels:
None

Epic Link:
indexer resilience

Description

With Lucene based indexing the indexing process is single threaded. This hamper the indexing of binary content as on a multi processor system only single thread can be used to perform the indexing

ianeboston Suggested a possible approach [1] involving a 2 phase indexing

In first phase detect the nodes to be indexed and start the full text extraction of the binary content. Post extraction save the binary token stream back to the node as a hidden data. In this phase the node properties can still be indexed and a marker field would be added to indicate the fulltext index is still pending
Later in 2nd phase look for all such Lucene docs and then update them with the saved token stream

This would allow the text extraction logic to be decouple from Lucene indexing logic

[1] http://markmail.org/thread/2w5o4bwqsosb6esu

Attachments

Issue Links

is related to

OAK-2892 Speed up lucene indexing post migration by pre extracting the text content from binaries

Closed

OAK-3092 Cache recently extracted text to avoid duplicate extraction

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Chetan Mehrotra

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 20/Apr/15 09:39

Updated:: 26/May/21 14:52