[PDFBOX-2247] Regression in text extraction between 1.8.5 and 1.8.6 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.8.6
Fix Version/s: 1.8.7, 2.0.0
Component/s: Text extraction
Labels:
None

Description

Looks like a character mapping issue crept in some time between 1.8.5 and 1.8.6 on this file?

With both seq and NonSeq parsers, the correct text was extracted via ExtractText in 1.8.5. In 1.8.6, java -jar pdfbox-app-1.8.6.jar ExtractText yields text starting with:

7>PFLK>I 9>NH ;BNRF@B
=%;% .BM>NPJBKP LC PEB 3KPBNFLN
9>@FCF@ -L>OP ;@FBK@B >KA 5B>NKFKD -BKPBN
:BOB>N@E 9NLGB@P ;QJJ>NT .B@BJ?BN (&&*
"&++&,-+Æ$( #&+-&%+$-& !).&)-*+Æ&,

Attachments

Issue Links

is broken by

PDFBOX-2058 The text of pdfs using Type1C can't be extracted correct

Closed

is related to

PDFBOX-2377 Apparent regression in character mapping in a few files from govdocs1

Closed

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Jul/14 13:13

Updated:: 23/Sep/14 19:31

Resolved:: 29/Jul/14 19:46