[PDFBOX-1652] TextPosition: Japanese alphabetic characters 30fc and 3005 treated as diacritics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Invalid
Affects Version/s: 1.8.1
Fix Version/s: None
Component/s: Text extraction
Labels:
- PatchAvailable

Description

For the purpose of determining the position in text, the Japanese characters U+30fc (KATAKANA-HIRAGANA PROLONGED SOUND MARK) and U+3005 (IDEOGRAPHIC ITERATION MARK) are currently regarded "simple" diacritics. Apparently, they are fully-fledged characters in terms of text positioning.

This can have the effect that when extracting text, some characters get actually reversed (particularly ーン can get ンー).

A patch to fix this is attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFBOX-1652.patch
26/Jun/13 18:14
1 kB
Christian Kohlschütter

Activity

People

Assignee:: Unassigned

Reporter:: Christian Kohlschütter

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Jun/13 18:13

Updated:: 25/Nov/16 09:46

Resolved:: 11/Oct/14 00:43