[PDFBOX-620] Text extract fails on some PDF files but not others... - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Implemented
Affects Version/s: 0.7.3, 0.8.0-incubator
Fix Version/s: None
Component/s: Text extraction
Labels:
None
Environment:
Tried in Java 5 and 6

Description

Having the same problem with 0.7.3, 0.7.4-dev and 0.8.0 - in 0.7.3 I get text with nulls, e.g. "Dermoapo made 'interactive updates' a key part onullits stratenull nullr launnull chinnulla new skincare rannull in a competitive market. nulle resultnullIncreased sales nullr pharmacies that used the updates." while in 0.8.0 it appears as "Dermoapo made 'interactive updates' a key part o?its strate? ?r laun?
chin?a new skincare ran? in a competitive market. ?e result?Increased
sales ?r pharmacies that used the updates."

Maybe this is a font problem? Or encoding? I debugged the code in PDFTextStripper and and these appear in the charactersByArticle field even before normalization.

In 0.8.0 I get some info logs from the engine:

SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: re
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: W
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: n
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: cs
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: scn
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: f
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: CS
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: SCN
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: M
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: m
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: l
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: S
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: BDC
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: c
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: v
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: y
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: h
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: g
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: G
SP INFO 20:52:12 PDFStreamEngine - unsupported/disabled operation: EMC

I got the same error with icu4j 3.6.1 and 4.2.1

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

pdf620-fails.pdf
14/Feb/10 20:25
569 kB
Nicholas Cottrell
pdf620-works.pdf
14/Feb/10 20:25
604 kB
Nicholas Cottrell

Issue Links

requires

PDFBOX-619 Adobe CFF/Type2 font encoding enhancements

Closed

PDFBOX-542 Support for Adobe CFF/Type2 fonts

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Nicholas Cottrell

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Feb/10 20:23

Updated:: 10/Jun/14 07:54

Resolved:: 09/Jun/14 20:43