[TIKA-2256] Japanese character substituted when reading PDF - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.14
Fix Version/s: None
Component/s: parser
Labels:
None

Description

The attached file contains “日本語” in its first line. It was created on Mac OS X 10.11.6 by selecting “Save As PDF” in the system print dialog started from Microsoft Word.

Reading the text from the PDF, the first character is not read as U+65E5, but as U+2F47. Copy & paste from Preview.App results in the correct U+65E5 being copied. (The characters look the same in some fonts, but are different.)

The MATLAB code used for reading looks as follows:

handler = org.apache.tika.sax.ToXMLContentHandler;
parser = org.apache.tika.parser.AutoDetectParser;
metadata = org.apache.tika.metadata.Metadata;
fh = java.io.FileInputStream(fullname);
parser.parse(fh, handler, metadata);
s = handler.toString;

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

mixed-fonts.pdf
30/Jan/17 14:15
17 kB
Christopher Creutzig

Activity

People

Assignee:: Unassigned

Reporter:: Christopher Creutzig

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/Jan/17 14:14

Updated:: 22/Jun/17 19:36

Resolved:: 22/Jun/17 19:36