Tika
  1. Tika
  2. TIKA-617

Series of exceptions from PDFBox

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.10
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      Hi,

      I am getting the following exception from PDFBox. Thank you!

      (If I should file these upstream at PDFBox first, please let me know.)

      $ java -jar tika-app-1.0-SNAPSHOT.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf > /dev/null
      ERROR - Stop reading corrupt stream
      INFO - unsupported/disabled operation: f24.481
      INFO - unsupported/disabled operation: ree)n.
      WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray
      java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray
      	at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
      	at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
      	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
      	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
      	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
      	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
      	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
      	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
      	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
      	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
      	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
      	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
      	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
      INFO - unsupported/disabled operation: i-
      INFO - unsupported/disabled operation: R4%
      INFO - unsupported/disabled operation: )
      INFO - unsupported/disabled operation: Re.8
      INFO - unsupported/disabled operation: e.
      INFO - unsupported/disabled operation: FE)-
      WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray
      java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray
      	at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
      	at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
      	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
      	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
      	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
      	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
      	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
      	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
      	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
      	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
      	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
      	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
      	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
      INFO - unsupported/disabled operation: R3%
      INFO - unsupported/disabled operation: T
      Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5809fdee
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
      	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
      	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
      	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
      Caused by: java.lang.RuntimeException: java.io.IOException: Error: Expected operator 'ID' actual='I8'
      	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:178)
      	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187)
      	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266)
      	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
      	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
      	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
      	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
      	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
      	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
      	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
      	... 5 more
      Caused by: java.io.IOException: Error: Expected operator 'ID' actual='I8'
      	at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:382)
      	at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)
      	at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:175)
      	... 15 more
      

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Erik Hetzner
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development