[TIKA-3835] tika pipes parse cache - avoid re-parsing content that has not changed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: tika-pipes
Labels:
None

Description

Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:

When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important.
Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case).
- In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups.

So the process would be

pipe iterator has the next document: {lastUpdated,docID}
- pipe iterator documents have an optional field: cache boolean - default=true. If cache=false, will not cache this doc.
if parse cache is enabled, cache field != false, and parse cache contains {lastUpdated,docID}
- Get {lastUpdated,docID} document from the cache and push to the emit queue and return.
Parse document
If parse cache is enabled, and cache field != false, put into cache key={lastUpdated,docID}, value={document,metadata}
- Additional conditions can dictate what documents we store in the cache and what ones we don't bother. Such as numBytesInBody, etc.

The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible.

The parser cache should be based on an interface so that the user can use several varieties of implementations such as:

File cache
S3 implementation cache
Others..

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Nicholas DiPiazza

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Aug/22 17:20

Updated:: 07/Sep/22 18:22