Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2098

Tika.parseToString() with maxLength doesn't work correctly for PDF files

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 2.0, 1.14
    • Component/s: parser
    • Labels:

      Description

      When parsing PDF file with Tika.parseToString(InputStream stream, Metadata metadata, int maxLength) and maxLength < content size it throws Exception.

      org.apache.tika.exception.TikaException: Unable to extract all PDF content
      
      	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
      	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      	at org.apache.tika.Tika.parseToString(Tika.java:568)
      Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a string: Tika - Content Analysis Toolkit
      	at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
      	at org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
      	at org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
      	at org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
      	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
      	at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
      	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
      	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
      	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
      	... 35 more
      Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      	at org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      	at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
      	at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
      	at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
      	at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
      	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
      	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
      	at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
      	... 43 more
      Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      	at org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      	... 51 more
      Caused by: org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      	at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      	at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      	at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
      	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      	... 52 more
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tallison@mitre.org Tim Allison
                Reporter:
                alexshadow007 Alexander Kazakov
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: