[PDFBOX-398] Russian extraction encoding failure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 0.7.3, 0.8.0-incubator
Fix Version/s: None
Component/s: Text extraction
Labels:
None
Environment:
Windows XP 32-bit, CentOS 5.2 32-bit

Description

I am doing some text extraction of Russian documents and some of them aren't extracting correctly. I am using PDFTextStripper.
When I extract on windows using UTF-8 encoding, the output is garbage.
When I extract on linux using any encoding, the output is garbage.
The only way I can get viable output is when I extract the PDF on windows, but don't specify an encoding. If I do this the output is correct when viewed with Ultra Edit, but not in notepad. I can view the output in notepad only after I convert the file to utf-8 with iconv.
It appears to me that the encoding isn't being read correctly from the PDF, and when it's
outputted as UTF-8, it is being double encoded or something. I can detect this double encoding, and then
run the file with no encoding specified, then convert it to UTF-8 using iconv, and it is OK.
But, this method does not work on linux, as I cannot get the file to extract correctly using any encoding
on linux.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

7.pdf
05/Jan/09 19:57
241 kB
Adrian Romano
garbage output.jpg
05/Jan/09 20:06
118 kB
Adrian Romano
working output.jpg
05/Jan/09 20:07
112 kB
Adrian Romano

Activity

People

Assignee:: Unassigned

Reporter:: Adrian Romano

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 05/Jan/09 19:55

Updated:: 21/Oct/09 10:01

Resolved:: 07/Apr/09 15:12