Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2256

Japanese character substituted when reading PDF

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 1.14
    • None
    • parser
    • None

    Description

      The attached file contains “日本語” in its first line. It was created on Mac OS X 10.11.6 by selecting “Save As PDF” in the system print dialog started from Microsoft Word.

      Reading the text from the PDF, the first character is not read as U+65E5, but as U+2F47. Copy & paste from Preview.App results in the correct U+65E5 being copied. (The characters look the same in some fonts, but are different.)

      The MATLAB code used for reading looks as follows:

      handler = org.apache.tika.sax.ToXMLContentHandler;
      parser = org.apache.tika.parser.AutoDetectParser;
      metadata = org.apache.tika.metadata.Metadata;
      fh = java.io.FileInputStream(fullname);
      parser.parse(fh, handler, metadata);
      s = handler.toString;

      Attachments

        1. mixed-fonts.pdf
          17 kB
          Christopher Creutzig

        Activity

          People

            Unassigned Unassigned
            ccreutzig Christopher Creutzig
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: