[PDFBOX-1155] setSuppressDuplicateOverlappingText sometimes removes characters that it shouldn't - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.8.7, 2.0.0
Fix Version/s: 2.0.0
Component/s: Text extraction
Labels:
None

Description

The duplicate detection (in PDFTextStripper.java) checks whether the
same character was placed "nearish" to where we are about to place
another and de-dups it if so; this is to catch documents that rewind
and overwrite in order to bold word(s).

But in some cases I see it removing valid characters (that were not
dups).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

000527.pdf
01/Nov/11 22:23
61 kB
Michael McCandless
dedup.diffs.txt
01/Nov/11 22:23
7 kB
Michael McCandless

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Michael McCandless

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 01/Nov/11 22:21

Updated:: 17/Mar/16 19:08

Resolved:: 17/Dec/14 11:23