[TIKA-456] Support timeouts for parsers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.15, 2.0.0
Component/s: parser
Labels:
None

Description

There are a number of reasons why Tika could hang while parsing. One common case is when a parser is fed an incomplete document, such as what happens when limiting the amount of data fetched during a web crawl.

One solution is to create a TikaCallable that wraps the Tika parser, and then use this with a FutureTask. For example, when using a ParsedDatum POJO for the results of the parse operation, I do something like this:

parser = new AutoDetectParser();
Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, inputstream, metadata);
FutureTask<ParsedDatum> task = new FutureTask<ParsedDatum>(c);
Thread t = new Thread(task);
t.start();

ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);

And TikaCallable() looks like:

class TikaCallable implements Callable<ParsedDatum> {
public TikaCallable(Parser parser, ContentHandler handler, InputStream is, Metadata metadata)

{ _parser = parser; _handler = handler; _input = is; _metadata = metadata; ... }

public ParsedDatum call() throws Exception

{ .... _parser.parse(_input, _handler, _metadata, new ParseContext()); .... }

}

This seems like it would be generally useful, as I doubt that we'd ever be able to guarantee that none of the parsers being wrapped by Tika could ever hang.

One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. something like:

Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);

Then the call to p.parse(...) would create a Callable (similar to the code above) and use the specified timeout when calling task.get().

One minus with this approach is that it creates a new thread for each parse request, but I don't think the thread overhead is significant when compared to the typical parser operation.

Attachments

Issue Links

relates to

TIKA-2170 Tika 1.13 ForkParser fails intermittently with very large MS Word docx

Resolved

Activity

People

Assignee:: Tim Allison

Reporter:: Kenneth William Krugler

Votes:: 6 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 05/Jul/10 20:42

Updated:: 12/Apr/21 13:01

Resolved:: 10/Nov/16 14:33