[PDFBOX-571] Dubious handling of word spacing (Tw) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8.0-incubator
Fix Version/s: 1.0.0
Component/s: Text extraction, Utilities
Labels:
None

Description

Wanted to provide a contrary case for the current handling of word spacing.

The sample page (pg_0005.pdf) uses a Type1C font for text rendering. The problem is that this Type1C font uses a custom encoding where the code values are assigned sequentially starting from the code value of 1. Thus the code value 32 is assigned to a digit "3", not to a space character " " as one would expect.

The PDF producer software has (mis-)used word spacing to break up longer character sequences. For example, on table line 3, the character sequence "0.831.05" is broken into two cells "0.83" and "1.05". Other uses of this "optimization" can be seen when the sample page is opened in Acrobat Reader (tested on version 7.0) and the "Select all" operation is performed. I've attached the screenshot of Acrobat Reader (pg_0005_selectall.png) for your convenience.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFStreamEngine.patch
27/Nov/09 17:07
1 kB
Villu Ruusmann
pg_0005_selectall.png
27/Nov/09 17:06
199 kB
Villu Ruusmann
pg_0005.pdf
27/Nov/09 17:04
32 kB
Villu Ruusmann

Issue Links

relates to

PDFBOX-583 TextPosition#getIndividualWidths returns negative values

Closed

PDFBOX-508 Lost spacing as a result of operator "Tc" ignoring.

Closed

PDFBOX-520 Ignores char spacing (Tc) and word space (Tw) when rendering PDFs to images

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Villu Ruusmann

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 27/Nov/09 17:02

Updated:: 22/Feb/10 18:28

Resolved:: 28/Nov/09 11:37