Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2098

Tika.parseToString() with maxLength doesn't work correctly for PDF files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.13
    • 1.14, 2.0.0
    • parser

    Description

      When parsing PDF file with Tika.parseToString(InputStream stream, Metadata metadata, int maxLength) and maxLength < content size it throws Exception.

      org.apache.tika.exception.TikaException: Unable to extract all PDF content
      
      	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
      	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      	at org.apache.tika.Tika.parseToString(Tika.java:568)
      Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a string: Tika - Content Analysis Toolkit
      	at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
      	at org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
      	at org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
      	at org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
      	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
      	at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
      	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
      	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
      	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
      	... 35 more
      Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      	at org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      	at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
      	at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
      	at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
      	at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
      	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
      	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
      	at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
      	... 43 more
      Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      	at org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      	... 51 more
      Caused by: org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      	at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      	at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      	at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      	... 52 more
      

      Attachments

        Issue Links

          Activity

            People

              tallison Tim Allison
              alexshadow007 Alexander Kazakov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: