Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
1.14
-
None
-
None
Description
The attached file contains “日本語” in its first line. It was created on Mac OS X 10.11.6 by selecting “Save As PDF” in the system print dialog started from Microsoft Word.
Reading the text from the PDF, the first character is not read as U+65E5, but as U+2F47. Copy & paste from Preview.App results in the correct U+65E5 being copied. (The characters look the same in some fonts, but are different.)
The MATLAB code used for reading looks as follows:
handler = org.apache.tika.sax.ToXMLContentHandler;
parser = org.apache.tika.parser.AutoDetectParser;
metadata = org.apache.tika.metadata.Metadata;
fh = java.io.FileInputStream(fullname);
parser.parse(fh, handler, metadata);
s = handler.toString;