[OAK-2892] Speed up lucene indexing post migration by pre extracting the text content from binaries - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.18, 1.2.3, 1.3.3, 1.4
Component/s: lucene, run
Labels:
- performance

Description

While migrating large repositories say having 3 M docs (250k PDF) Lucene indexing takes long time to complete (at time 4 days!). Currently the text extraction logic is coupled with Lucene indexing and hence is performed in a single threaded mode which slows down the indexing process. Further if the reindexing has to be triggered it has to be done all over again.

To speed up the Lucene indexing we can decouple the text extraction
from actual indexing. It is partly based on discussion on OAK-2787

Introduce a new ExtractedTextProvider which can provide extracted text for a given Blob instance
In oak-run introduce a new indexer mode - This would take a path in repository and would then traverse the repository and look for existing binaries and extract text from that

So before or after migration is done one can run this oak-run tool to create this store which has the text already extracted. Then post startup we need to wire up the ExtractedTextProvider instance (which is backed by the BlobStore populated before) and indexing logic can just get content from that. This would avoid performing expensive text extraction in the indexing thread.

See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66

Attachments

Issue Links

is related to

OAK-3989 Add S3 datastore support for Text Pre Extraction

Closed

relates to

OAK-4036 LuceneIndexProviderService may miss on registering PreExtractedTextProvider

Closed

OAK-2787 Faster multi threaded indexing / text extraction for binary content

Open

Sub-Tasks

There are no Sub-Tasks for this issue.

Activity

People

Assignee:: Chetan Mehrotra

Reporter:: Chetan Mehrotra

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/May/15 12:43

Updated:: 08/Oct/19 15:22

Resolved:: 15/Jul/15 06:43