Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3835

tika pipes parse cache - avoid re-parsing content that has not changed

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.2.0
    • None
    • tika-pipes
    • None

    Description

      Tika pipes should have an optional configuration to archive parsed results. These archived results can be returned in the case that the same exact version of a document had already been parsed previously, pull the parsed output from a "parse cache" instead of repeating the fetch+parse.

      In other words, skip the fetch+parse if you did it previously.

      Benefits of this:

      • When the tika pipe fetcher is using a cloud service, documents are rate limited heavily. So if you manage to get a document and parse it, storing it for future use is very important.
      • Multi tier environments can be populated faster. Example: You are pulling data from an app in dev, staging and production. When you run the tika pipe job, it will parse each document 1 time. All the other environments can now re-use the parsed output - saving days of run time (in my case).
        • In other words, "full crawls" for your initial tika index on duplicate environments is reduced to cache lookups.

      So the process would be 

      • pipe iterator has the next document: {lastUpdated,docID}
        • pipe iterator documents have an optional field: cache boolean - default=true. If cache=false, will not cache this doc.
      • if parse cache is enabled, cache field != false, and parse cache contains {lastUpdated,docID}
        • Get {lastUpdated,docID} document from the cache and push to the emit queue and return.
      • Parse document
      • If parse cache is enabled, and cache field != false, put into cache key={lastUpdated,docID}, value={document,metadata}
        • Additional conditions can dictate what documents we store in the cache and what ones we don't bother. Such as numBytesInBody, etc.

      The cache would need to be disk or network based storage because of the storage size. In-memory cache would not be feasible. 

      The parser cache should be based on an interface so that the user can use several varieties of implementations such as:

      • File cache
      • S3 implementation cache
      • Others..

      Attachments

        Activity

          People

            Unassigned Unassigned
            ndipiazza@apache.org Nicholas DiPiazza
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: