[PDFBOX-4481] Text extraction error with Thai combined glyph depending on space after it - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.14
Fix Version/s: None
Component/s: Text extraction
Labels:
- Thai

External issue URL:
https://stackoverflow.com/questions/54981236/how-to-set-ttf-for-pdftextstripper

Description

In the first extracted line of the reduced file, the "accent" (somebody please correct me what that thing is) is separate. On the second line it is at the proper place. Content stream:

BT
  1 0 0 1 67.3 756.98 Tm
  [ (\000\203\000\227\000q) ] TJ
  1 0 0 1 77.5 756.98 Tm
  [ (\000\003) ] TJ
  1 0 0 1 67.3 730 Tm
  [ (\000\203\000\227\000q\000\003) ] TJ
ET

The weird thing is that the "\003" is just a space. So when the space is in the string the extraction works, and when it isn't, it doesn't.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SO54981236-reduced.txt
05/Mar/19 17:17
0.0 kB
Tilman Hausherr
SO54981236-reduced.pdf
05/Mar/19 17:17
26 kB
Tilman Hausherr
SO54981236.pdf
05/Mar/19 17:17
51 kB
Tilman Hausherr

Issue Links

is related to

PDFBOX-4531 Extraction of Arabic PDF has incorrect ordering of normalized ligatures

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Tilman Hausherr

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 05/Mar/19 17:18

Updated:: 01/May/19 08:28