Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.8.16, 2.0.20, 2.0.21
-
None
Description
Dear Devs,
we've encountered an issue with version 2.0.20 and 2.0.21 of PDFbox when trying to parse a PDF for text extraction that seem to have existed before seeĀ FOP-2751.
I reproduced this issue with the pdfbox-app and the FuturaStd-Book.pdf of FOP-2751:
Console output
java -jar pdfbox-app-2.0.21.jar ExtractText FuturaStd-Book.pdf Dez 04, 2020 11:06:00 AM org.apache.pdfbox.pdmodel.font.PDType1CFont <init> SCHWERWIEGEND: Can't read the embedded Type1C font FuturaStd-Book java.io.IOException: illegal offset value 2949166 in CFF font at org.apache.fontbox.cff.CFFParser.readIndexDataOffsets(CFFParser.java:192) at org.apache.fontbox.cff.CFFParser.readIndexData(CFFParser.java:201) at org.apache.fontbox.cff.CFFParser.parseFont(CFFParser.java:484) at org.apache.fontbox.cff.CFFParser.parse(CFFParser.java:122) at org.apache.fontbox.cff.CFFParser.parse(CFFParser.java:75) at org.apache.pdfbox.pdmodel.font.PDType1CFont.<init>(PDType1CFont.java:102) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:74) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:933) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:515) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:489) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:144) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:397) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:325) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:272) at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:377) at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:274) at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60) Dez 04, 2020 11:06:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache WARNUNG: New fonts found, font cache will be re-built Dez 04, 2020 11:06:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init> WARNUNG: Building on-disk font cache, this may take a while Dez 04, 2020 11:06:02 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init> WARNUNG: Finished building on-disk font cache, found 550 fonts Dez 04, 2020 11:06:02 AM org.apache.pdfbox.pdmodel.font.PDType1CFont <init> WARNUNG: Using fallback font Courier for FuturaStd-Book
Other examples fonts causing this issue are:
- Can't read the embedded Type1C font COGXUZ+MetaPlusNormal-Caps
- Can't read the embedded Type1C font DJTRFS+MetaPlusBold-CapsItalic
- Can't read the embedded Type1C font EAFTRP+MetaPlusNormal-Caps
- Can't read the embedded Type1C font GQHJVM+MetaPlusNormal-CapsItalic
- Can't read the embedded Type1C font GUEVYR+MetaPlusBold-CapsItalic
- Can't read the embedded Type1C font HYTBMP+MetaPlusNormal-CapsItalic
- Can't read the embedded Type1C font IJCQXI+MetaPlusMedium-Caps
- Can't read the embedded Type1C font JRIYJF+MetaPlusNormal-Caps
- Can't read the embedded Type1C font JSQSJF+NeuzeitGro-Reg
- Can't read the embedded Type1C font KUZTXD+MetaPlusBook-Roman
- Can't read the embedded Type1C font LWIPLB+1496148105355.00001Arial.000-1
- Can't read the embedded Type1C font MCDJBA+MetaSerif-BoldIta
- Can't read the embedded Type1C font UNLUJK+Barmeno-Medium
I couldn't find another issue about this. Is this already known?
Attachments
Issue Links
- relates to
-
FOP-2751 Acrobat Reader error with some Latin Fonts
- Resolved