Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3224

Stackoverflow with Embedded PDF in DOCX document

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.24.1
    • 1.27
    • parser
    • None

    Description

      This issue has been reported by a user on discuss.elastic.co.

      I can reproduce the problem using the latest version of Tika (1.24.1) in FSCrawler project.

      When running the extraction of the data, we are seeing:

      java.lang.StackOverflowError: null
      	at java.util.regex.Pattern$BmpCharPredicate.lambda$union$2(Pattern.java:5692) ~[?:?]
      	at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:4019) ~[?:?]
      	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4855) ~[?:?]
      	at java.util.regex.Pattern$BranchConn.match(Pattern.java:4763) ~[?:?]
      	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4886) ~[?:?]
      	at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:4020) ~[?:?]
      	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4855) ~[?:?]
      	at java.util.regex.Pattern$Branch.match(Pattern.java:4800) ~[?:?]
      	at java.util.regex.Pattern$Branch.match(Pattern.java:4798) ~[?:?]
      	at java.util.regex.Pattern$Branch.match(Pattern.java:4798) ~[?:?]
      	at java.util.regex.Pattern$BranchConn.match(Pattern.java:4763) ~[?:?]
      	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4886) ~[?:?]
      	at java.util.regex.Pattern$BmpCharPropertyGreedy.match(Pattern.java:4394) ~[?:?]
      	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4855) ~[?:?]
      	at java.util.regex.Pattern$Branch.match(Pattern.java:4800) ~[?:?]
      	at java.util.regex.Pattern$BranchConn.match(Pattern.java:4763) ~[?:?]
      	at java.util.regex.Pattern$GroupTail.match(Pattern.java:4886) ~[?:?]
      	at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:4020) ~[?:?]
      	at java.util.regex.Pattern$BmpCharPropertyGreedy.match(Pattern.java:4394) ~[?:?]
      	at java.util.regex.Pattern$GroupHead.match(Pattern.java:4855) ~[?:?]
      	at java.util.regex.Pattern$Branch.match(Pattern.java:4800) ~[?:?]
      	at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:4020) ~[?:?]
      	at java.util.regex.Pattern$Start.match(Pattern.java:3673) ~[?:?]
      	at java.util.regex.Matcher.search(Matcher.java:1729) ~[?:?]
      	at java.util.regex.Matcher.find(Matcher.java:773) ~[?:?]
      	at java.util.Formatter.parse(Formatter.java:2702) ~[?:?]
      	at java.util.Formatter.format(Formatter.java:2655) ~[?:?]
      	at java.util.Formatter.format(Formatter.java:2609) ~[?:?]
      	at java.lang.String.format(String.java:3292) ~[?:?]
      	at java.util.logging.SimpleFormatter.format(SimpleFormatter.java:176) ~[?:?]
      	at java.util.logging.StreamHandler.publish(StreamHandler.java:199) ~[?:?]
      	at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:95) ~[?:?]
      	at java.util.logging.Logger.log(Logger.java:979) ~[?:?]
      	at java.util.logging.Logger.doLog(Logger.java:1006) ~[?:?]
      	at java.util.logging.Logger.logp(Logger.java:1172) ~[?:?]
      	at org.apache.commons.logging.impl.Jdk14Logger.log(Jdk14Logger.java:87) ~[?:?]
      	at org.apache.commons.logging.impl.Jdk14Logger.warn(Jdk14Logger.java:260) ~[?:?]
      	at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:159) ~[?:?]
      	at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:41) ~[?:?]
      	at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:183) ~[?:?]
      	at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
      	at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
      	at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
      	at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
      	at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
      	at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
      	at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
      	at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
      	at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
      

      It sounds like related to pdfbox project though but I found that it could be useful to report it here.

      Attachments

        1. oleObject1_cleaned.pdf
          4.30 MB
          Tim Allison
        2. issue-stackoverflow.docx
          3.92 MB
          David Pilato

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dadoonet David Pilato
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: