[PDFBOX-3498] Unexpected spaces in text extraction - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.2, 2.0.3
Fix Version/s: 2.0.4, 3.0.0 PDFBox
Component/s: Text extraction
Labels:
None

Description

In the tests by tallison@apache.org regressions were found with files from Delaware courts, see reduced file attached.

The extracted text with 2.0.2 and 2.0.3 is

IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE

in 2.0.1 and 1.8 it was

IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE

The cause is the /W ranges table:

/W [1 1 0 2 3 250 4 10 0 11
12 333 13 14 0 15 15 250 16 16
333 17 17 250 18 18 277 19 19 0
20 23 500 24 35 0 36 36 722 37
37 666 38 39 722 40 40 666 41 41
610 42 43 777 44 44 389 45 45 0
46 46 777 47 47 666 48 48 943 49
49 722 50 50 777 51 51 610 52 52
0 53 53 722 54 54 556 55 55 666
56 57 722 59 59 0 60 60 722 61
67 0 68 68 500 69 69 556 70 70
443 71 71 556 72 72 443 73 73 333
74 74 500 75 75 556 76 76 277 77
77 0 78 78 556 79 79 277 80 80
833 81 81 556 82 82 500 83 84 556
85 85 443 86 86 389 87 87 333 88
88 556 89 89 0 90 90 722 91 92
500 93 178 0 179 180 500 181 181 0
182 182 333 183 751 0 752 752 198 753
794 0 795 795 612 796 1126 0 1127 1127
125 1129 1129 2000 1130 65534 0]

For text extraction, the width of a space is caculated by taking an average of all widths. However these files have a lot (over 60000) of widths that are 0. So I'll just ignore widths <= 0, as it is already done in PDFont.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFBOX-3498-Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.pdf
15/Sep/16 16:08
43 kB
Tilman Hausherr

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Tilman Hausherr

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/Sep/16 16:06

Updated:: 25/Mar/17 18:13

Resolved:: 15/Sep/16 16:26