Description
There are a number of reasons why Tika could hang while parsing. One common case is when a parser is fed an incomplete document, such as what happens when limiting the amount of data fetched during a web crawl.
One solution is to create a TikaCallable that wraps the Tika parser, and then use this with a FutureTask. For example, when using a ParsedDatum POJO for the results of the parse operation, I do something like this:
parser = new AutoDetectParser();
Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, inputstream, metadata);
FutureTask<ParsedDatum> task = new FutureTask<ParsedDatum>(c);
Thread t = new Thread(task);
t.start();
ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
And TikaCallable() looks like:
class TikaCallable implements Callable<ParsedDatum> {
public TikaCallable(Parser parser, ContentHandler handler, InputStream is, Metadata metadata)
public ParsedDatum call() throws Exception
{ .... _parser.parse(_input, _handler, _metadata, new ParseContext()); .... }}
This seems like it would be generally useful, as I doubt that we'd ever be able to guarantee that none of the parsers being wrapped by Tika could ever hang.
One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. something like:
Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
Then the call to p.parse(...) would create a Callable (similar to the code above) and use the specified timeout when calling task.get().
One minus with this approach is that it creates a new thread for each parse request, but I don't think the thread overhead is significant when compared to the typical parser operation.
Attachments
Issue Links
- relates to
-
TIKA-2170 Tika 1.13 ForkParser fails intermittently with very large MS Word docx
- Resolved