Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1999

org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 2.0, 1.14
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Ubuntu 16.04 (64 bit)
      Oracle Java 1.8.0_91-b14 (64 bit)

      Description

      When trying to read the following PDF document:

      http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf

      TIKA crashes for me with a java.lang.StackOverflowError, caused by a large number of recursion in:

          at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
      

      For some reason, the Tika App doesn't exhibit this behavior, but the following MWE exposes the issue for me:

      import java.io.ByteArrayOutputStream;
      import java.io.File;
      import java.io.FileInputStream;
      import org.apache.tika.metadata.Metadata;
      import org.apache.tika.parser.AutoDetectParser;
      import org.apache.tika.parser.ParseContext;
      import org.apache.tika.sax.ToHTMLContentHandler;
      
      public class test
      {
          public static void main(String [] args) throws Exception {
              String p = "/home/eggie/faulty_pdf_document.pdf";
              
              FileInputStream input = new FileInputStream(new File(p));
              AutoDetectParser tk = new AutoDetectParser();
              ByteArrayOutputStream os = new ByteArrayOutputStream();
              ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8");
              ParseContext pc = new ParseContext();
              System.out.println("Parsing");
              tk.parse(input, handler, new Metadata(), pc);
          }
      }
      

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you for opening this and sharing a triggering file. If you use pdfbox-app's ExtractText, do you run into the same issue? That'd be PDFBox 2.0.1.

        Will take a look in next few days.

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you for opening this and sharing a triggering file. If you use pdfbox-app's ExtractText, do you run into the same issue? That'd be PDFBox 2.0.1. Will take a look in next few days.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        PDFBox's ExtractText works on this file.

        The stackoverflow is caused by the XMPMM metadata extraction that I added on TIKA-1894. This doc has 6,269 media management "events".

        I don't know why this is only a problem for the ToHTMLHandler and not for the straight ToXMLHandler.

        We could add limit on the number of events that are added to the metadata. Say, 1,000?

        Show
        tallison@mitre.org Tim Allison added a comment - PDFBox's ExtractText works on this file. The stackoverflow is caused by the XMPMM metadata extraction that I added on TIKA-1894 . This doc has 6,269 media management "events". I don't know why this is only a problem for the ToHTMLHandler and not for the straight ToXMLHandler. We could add limit on the number of events that are added to the metadata. Say, 1,000?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        As a temporary workaround, you can increase your stack size: -Xss4m.

        Show
        tallison@mitre.org Tim Allison added a comment - As a temporary workaround, you can increase your stack size: -Xss4m.
        Hide
        MadEgg Egbert added a comment -

        I'm sorry, I don't really know what the effect of the limit would be. I am using Tika to extract plain text from PDF documents to be able to import them into a search index, so I do not have a lot of interest in the metadata.

        I'll try your suggested workaround to increase the stack size. Thanks!

        Show
        MadEgg Egbert added a comment - I'm sorry, I don't really know what the effect of the limit would be. I am using Tika to extract plain text from PDF documents to be able to import them into a search index, so I do not have a lot of interest in the metadata. I'll try your suggested workaround to increase the stack size. Thanks!
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Added configurable limit with default set to 1024.

        Show
        tallison@mitre.org Tim Allison added a comment - Added configurable limit with default set to 1024.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in tika-2.x-windows #12 (See https://builds.apache.org/job/tika-2.x-windows/12/)
        TIKA-1999: add configurable limit to number of events extracted in XMPMM (tallison: rev 89062edb0584980d09d55ead215373214cb2895d)

        • tika-test-resources/src/test/resources/test-documents/testXMP.xmp
        • tika-parser-modules/tika-parser-xmp-commons/src/test/java/org/apache/tika/parser/xmp/JempboxExtractorTest.java
        • tika-parser-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in tika-2.x-windows #12 (See https://builds.apache.org/job/tika-2.x-windows/12/ ) TIKA-1999 : add configurable limit to number of events extracted in XMPMM (tallison: rev 89062edb0584980d09d55ead215373214cb2895d) tika-test-resources/src/test/resources/test-documents/testXMP.xmp tika-parser-modules/tika-parser-xmp-commons/src/test/java/org/apache/tika/parser/xmp/JempboxExtractorTest.java tika-parser-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-2.x #108 (See https://builds.apache.org/job/tika-2.x/108/)
        TIKA-1999: add configurable limit to number of events extracted in XMPMM (tallison: rev 89062edb0584980d09d55ead215373214cb2895d)

        • tika-parser-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
        • tika-test-resources/src/test/resources/test-documents/testXMP.xmp
        • tika-parser-modules/tika-parser-xmp-commons/src/test/java/org/apache/tika/parser/xmp/JempboxExtractorTest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #108 (See https://builds.apache.org/job/tika-2.x/108/ ) TIKA-1999 : add configurable limit to number of events extracted in XMPMM (tallison: rev 89062edb0584980d09d55ead215373214cb2895d) tika-parser-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java tika-test-resources/src/test/resources/test-documents/testXMP.xmp tika-parser-modules/tika-parser-xmp-commons/src/test/java/org/apache/tika/parser/xmp/JempboxExtractorTest.java
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in tika-trunk-jdk1.7 #1007 (See https://builds.apache.org/job/tika-trunk-jdk1.7/1007/)
        TIKA-1999 add limit to number of events extracted from the XMPMM section (tallison: rev 3e14505381eefa603adabe61171c0c19fc685b2f)

        • tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java
        • tika-parsers/src/test/java/org/apache/tika/parser/image/xmp/JempboxExtractorTest.java
        • tika-parsers/src/test/resources/test-documents/testXMP.xmp
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in tika-trunk-jdk1.7 #1007 (See https://builds.apache.org/job/tika-trunk-jdk1.7/1007/ ) TIKA-1999 add limit to number of events extracted from the XMPMM section (tallison: rev 3e14505381eefa603adabe61171c0c19fc685b2f) tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java tika-parsers/src/test/java/org/apache/tika/parser/image/xmp/JempboxExtractorTest.java tika-parsers/src/test/resources/test-documents/testXMP.xmp
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in tika-2.x-windows #13 (See https://builds.apache.org/job/tika-2.x-windows/13/)
        TIKA-1999: fix setter, update changes.txt (tallison: rev ac52e5c15852231d003526045124e3fafdafaf90)

        • tika-parser-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
        • CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in tika-2.x-windows #13 (See https://builds.apache.org/job/tika-2.x-windows/13/ ) TIKA-1999 : fix setter, update changes.txt (tallison: rev ac52e5c15852231d003526045124e3fafdafaf90) tika-parser-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-2.x #109 (See https://builds.apache.org/job/tika-2.x/109/)
        TIKA-1999: fix setter, update changes.txt (tallison: rev ac52e5c15852231d003526045124e3fafdafaf90)

        • tika-parser-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
        • CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #109 (See https://builds.apache.org/job/tika-2.x/109/ ) TIKA-1999 : fix setter, update changes.txt (tallison: rev ac52e5c15852231d003526045124e3fafdafaf90) tika-parser-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #1008 (See https://builds.apache.org/job/tika-trunk-jdk1.7/1008/)
        TIKA-1999 small fix and update CHANGES.txt (tallison: rev 99aa587d171207c0c557ce65397f767d6a42cdfd)

        • tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java
        • CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #1008 (See https://builds.apache.org/job/tika-trunk-jdk1.7/1008/ ) TIKA-1999 small fix and update CHANGES.txt (tallison: rev 99aa587d171207c0c557ce65397f767d6a42cdfd) tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java CHANGES.txt

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            MadEgg Egbert
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development