Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
2.0.7
-
None
-
None
Description
I got an exception to extract HTML from PDF. Source PDF is not available.
Main cause: org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:167) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) .... Caused by: java.io.IOException: name for 'gs' operator not found in resources: /R8 at org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters.process(SetGraphicsStateParameters.java:54) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) ... 27 more
Attachments
Issue Links
- is duplicated by
-
PDFBOX-3950 NPE in PageIterator.enqueueKids
- Closed