Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-456

Support timeouts for parsers

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.15, 2.0.0
    • parser
    • None

    Description

      There are a number of reasons why Tika could hang while parsing. One common case is when a parser is fed an incomplete document, such as what happens when limiting the amount of data fetched during a web crawl.

      One solution is to create a TikaCallable that wraps the Tika parser, and then use this with a FutureTask. For example, when using a ParsedDatum POJO for the results of the parse operation, I do something like this:

      parser = new AutoDetectParser();
      Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, inputstream, metadata);
      FutureTask<ParsedDatum> task = new FutureTask<ParsedDatum>(c);
      Thread t = new Thread(task);
      t.start();

      ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);

      And TikaCallable() looks like:

      class TikaCallable implements Callable<ParsedDatum> {
      public TikaCallable(Parser parser, ContentHandler handler, InputStream is, Metadata metadata)

      { _parser = parser; _handler = handler; _input = is; _metadata = metadata; ... }

      public ParsedDatum call() throws Exception

      { .... _parser.parse(_input, _handler, _metadata, new ParseContext()); .... }

      }

      This seems like it would be generally useful, as I doubt that we'd ever be able to guarantee that none of the parsers being wrapped by Tika could ever hang.

      One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. something like:

      Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);

      Then the call to p.parse(...) would create a Callable (similar to the code above) and use the specified timeout when calling task.get().

      One minus with this approach is that it creates a new thread for each parse request, but I don't think the thread overhead is significant when compared to the typical parser operation.

      Attachments

        Issue Links

          Activity

            People

              tallison Tim Allison
              kkrugler Kenneth William Krugler
              Votes:
              6 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: