Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2787

Make WriteLimitReachedException public and not subclass of SAXException

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.19.1
    • 2.0.0
    • core
    • None

    Description

      The idea behind being able to set a limit on text extraction is to be able to get up to N characters extracted back. We just got tripped up by the fact that Tika throws an exception once the limit has been reached.

      This, in and of itself, is not a major hindrance especially since the error message itself clearly states that the extracted text is, "however, available".

      OK, but why is WriteLimitReachedException private? why not public so it can be explicitly caught when the parse() method is called? and why not add it to the signature of the parse method? I don't think it should extend SAXException, either; just cleanly throw it as is.

      Right now, our code makes this cumbersome adjustment around the condition:

      ContentHandler handler = new BodyContentHandler(limit); // <-- e.g. set to 1000000
      try {
          parser.parse(dataStream, handler, metadata, parseCtx);
      } catch (IOException | TikaException ex) {
          throw ex;
      } catch (SAXException ex) {
          String message = (ex.getMessage() == null) ? "" : ex.getMessage();
          if (!message.contains("Your document contained more than")) {
              throw new TikaException("Tika error has occurred.", ex);
          } else {
              log.warn("TE limit reached on file {}.", filePath);
          }
      }
      
      // Keep the extracted text regardless of WriteLimitReachedException
      String text = handler.toString();
      
      

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            dgoldenberg123 Dmitry Goldenberg
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: