Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-456

Support timeouts for parsers

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: parser
    • Labels:
      None

      Description

      There are a number of reasons why Tika could hang while parsing. One common case is when a parser is fed an incomplete document, such as what happens when limiting the amount of data fetched during a web crawl.

      One solution is to create a TikaCallable that wraps the Tika parser, and then use this with a FutureTask. For example, when using a ParsedDatum POJO for the results of the parse operation, I do something like this:

      parser = new AutoDetectParser();
      Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, inputstream, metadata);
      FutureTask<ParsedDatum> task = new FutureTask<ParsedDatum>(c);
      Thread t = new Thread(task);
      t.start();

      ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);

      And TikaCallable() looks like:

      class TikaCallable implements Callable<ParsedDatum> {
      public TikaCallable(Parser parser, ContentHandler handler, InputStream is, Metadata metadata)

      { _parser = parser; _handler = handler; _input = is; _metadata = metadata; ... }

      public ParsedDatum call() throws Exception

      { .... _parser.parse(_input, _handler, _metadata, new ParseContext()); .... }

      }

      This seems like it would be generally useful, as I doubt that we'd ever be able to guarantee that none of the parsers being wrapped by Tika could ever hang.

      One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. something like:

      Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);

      Then the call to p.parse(...) would create a Callable (similar to the code above) and use the specified timeout when calling task.get().

      One minus with this approach is that it creates a new thread for each parse request, but I don't think the thread overhead is significant when compared to the typical parser operation.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tallison@apache.org Tim Allison
                Reporter:
                kkrugler Ken Krugler
              • Votes:
                6 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: