Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4549

No Unicode mapping

    XMLWordPrintableJSON

    Details

      Description

      Hello, if i try get text from pdf (attached), i will result empty out and many warns. Font attached also.
      Acrobat reader will open succeed, I can select, copy text and save as text

      my code:

      private static void parseOne(String path) throws IOException {
          String pdfFileInText;
          PDFTextStripper tStripper;
          File file = new File(path);
          tStripper = new PDFTextStripper();
          MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMixed(0, 500000000).setTempDir(new File("/home/user/pdfBoxTest/newFiles/"));
          PDDocument document = PDDocument.load(file, memUsageSetting);
          if (!document.isEncrypted()) {
              pdfFileInText = tStripper.getText(document);
              System.out.print(pdfFileInText);
          }
          document.close();
      }

      Error:

      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
      WARNING: Invalid ToUnicode CMap in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+83 (83) in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+116 (116) in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+97 (97) in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+114 (114) in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+87 (87) in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+115 (115) in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
      WARNING: Invalid ToUnicode CMap in font HPDFAB+DejaVuSansMono,Book
      

       

        Attachments

        1. our_star_wars.pdf
          31 kB
          Sergey Makarov
        2. XO_Thames.zip
          171 kB
          Sergey Makarov

          Activity

            People

            • Assignee:
              tilman Tilman Hausherr
              Reporter:
              sergey.makarov Sergey Makarov
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: