[PDFBOX-903] Unicode text getting mangled via TextToPDF + PDFTextStripper - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.4.0
Fix Version/s: None
Component/s: Text extraction, Writing
Labels:
None

Description

I'm trying to round trip some text through PDFBox, but I'm finding that along the way unicode text is getting mangled and coming back as the wrong characters.

The process I'm following is to use TextToPDF to generate a PDF, then reading it back in again with PDFTextStripper. I'm not sure if the problem is coming about during generation or reading yet, but I've a nasty feeling there might be an issue with both. (I've seen issues with code that does one part of the other)

Attached is a unit test written against trunk. It creates a series of Reader objects based on both ASCII and non-ASCII text, creates a PDF using TextToPDF, then compares the text. It includes a test that verifies that the corruption isn't caused by the readers, and another that fails showing that the text was corrupted by the roundtrip.

Ideally the test would also look in the dictionary to check what was stored there, but I don't know enough about the file format to manage that. Will hopefully look into that shortly though.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TestUnicodeText.java
25/Nov/10 15:42
7 kB
Neil McErlean
TestUnicodeText.java
24/Nov/10 18:17
6 kB
Nick Burch

Issue Links

depends upon

PDFBOX-922 True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Closed

is superceded by

PDFBOX-922 True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

Closed

relates to

PDFBOX-1071 Can not generate chinese character PDF file

Closed

PDFBOX-553 writing pdf file in Japanese, garbled

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Nick Burch

Votes:: 5 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/Nov/10 18:16

Updated:: 12/Dec/14 04:36

Resolved:: 12/Dec/14 04:36