[PDFBOX-654] Extracting CJK text - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: Text extraction
Labels:
None

Description

This is an update for ~~PDFBOX-420~~ filed by Takashi Komatsubara.

In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.

I have published this patch last year[2], and got some good feedbacks from Japanese users[3].

[1] http://www.unixuser.org/~euske/python/pdfminer/index.html
[2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja,
https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
[3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

China.pdf
13/Mar/10 07:38
243 kB
Takashi Komatsubara
identity-h.patch
10/Mar/10 06:37
5 kB
Atsuo Ishimoto

Issue Links

is depended upon by

PDFBOX-55 Invalid character while extracting text from a chinese pdf

Closed

PDFBOX-5 CJK decoding

Closed

relates to

PDFBOX-420 Japanese Characters are garbled.

Closed

PDFBOX-259 support request chinese-traditional

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Atsuo Ishimoto

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 10/Mar/10 06:36

Updated:: 30/Mar/10 08:23

Resolved:: 10/Mar/10 18:16