Description
When parsing PDF file with Tika.parseToString(InputStream stream, Metadata metadata, int maxLength) and maxLength < content size it throws Exception.
org.apache.tika.exception.TikaException: Unable to extract all PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:568) Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a string: Tika - Content Analysis Toolkit at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302) at org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779) at org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738) at org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) ... 35 more Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). at org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306) at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300) ... 43 more Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). at org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ... 51 more Caused by: org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ... 52 more
Attachments
Issue Links
- is depended upon by
-
TIKA-2151 Imposed Write Limit Causes Lost Data With Pdfs
- Resolved
- links to