Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2151

Imposed Write Limit Causes Lost Data With Pdfs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Duplicate
    • 1.13
    • 1.14, 2.0.0
    • core
    • None

    Description

      When we upgraded to 1.13, we noticed a new exception in our logs:

      org.apache.tika.exception.TikaException: Unable to extract all PDF content
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:184)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:144)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at org.apache.tika.Tika.parseToString(Tika.java:527)
      at org.apache.tika.Tika.parseToString(Tika.java:602)
      at com.attask.tika.WriteLimitAllCatchTikaTest.testStillNeedOverride(WriteLimitAllCatchTikaTest.java:31)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:606)
      at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
      at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
      at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
      at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
      at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
      at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
      at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
      at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
      at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
      at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
      at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
      at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
      at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
      at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
      at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:78)
      at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:212)
      at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:68)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:606)
      at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
      Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a string: One will of mine to make thy large will more.
      at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:500)
      at org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
      at org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
      at org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
      at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
      at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:214)
      at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
      at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:160)
      ... 33 more
      Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      org.apache.tika.sax.TaggedSAXException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      at org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
      at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
      at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
      at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
      at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
      at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:305)
      at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:498)
      ... 41 more
      Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      at org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      ... 49 more
      Caused by: org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
      at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      ... 50 more

      This appears to be caused by the fact that the top-level exception is not of type SAXException, meaning that Tika#parseToString doesn't catch it and check whether or not its root cause is a WriteLimitReachedException. The result is that the first 100000 parsed characters is not returned.

      Here is a quick repro code block:

      Tika tika = new Tika();
      InputStream is = this.getClass().getClassLoader().getResourceAsStream("pg100.pdf");

      try {
      String s = tika.parseToString(is);
      System.out.println("It works!");
      } catch ( Exception e ) {
      System.out.println("Tika missed the WriteLimitReachedException");
      }

      Where the pdf used is a pdf that has more than 100000 parseable characters in it.

      Not sure I understand all the ins and outs, but we fixed it by extending Tika.java and overriding Tika#parseToString to catch Exception instead of SAXException.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              josh.cummings Josh Cummings
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: