Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-456

Support timeouts for parsers



    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: parser
    • Labels:


      There are a number of reasons why Tika could hang while parsing. One common case is when a parser is fed an incomplete document, such as what happens when limiting the amount of data fetched during a web crawl.

      One solution is to create a TikaCallable that wraps the Tika parser, and then use this with a FutureTask. For example, when using a ParsedDatum POJO for the results of the parse operation, I do something like this:

      parser = new AutoDetectParser();
      Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, inputstream, metadata);
      FutureTask<ParsedDatum> task = new FutureTask<ParsedDatum>(c);
      Thread t = new Thread(task);

      ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);

      And TikaCallable() looks like:

      class TikaCallable implements Callable<ParsedDatum> {
      public TikaCallable(Parser parser, ContentHandler handler, InputStream is, Metadata metadata)

      { _parser = parser; _handler = handler; _input = is; _metadata = metadata; ... }

      public ParsedDatum call() throws Exception

      { .... _parser.parse(_input, _handler, _metadata, new ParseContext()); .... }


      This seems like it would be generally useful, as I doubt that we'd ever be able to guarantee that none of the parsers being wrapped by Tika could ever hang.

      One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. something like:

      Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);

      Then the call to p.parse(...) would create a Callable (similar to the code above) and use the specified timeout when calling task.get().

      One minus with this approach is that it creates a new thread for each parse request, but I don't think the thread overhead is significant when compared to the typical parser operation.


          Issue Links



              • Assignee:
                tallison Tim Allison
                kkrugler Kenneth William Krugler
              • Votes:
                6 Vote for this issue
                9 Start watching this issue


                • Created: