Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4549

No Unicode mapping

    XMLWordPrintableJSON

Details

    Description

      Hello, if i try get text from pdf (attached), i will result empty out and many warns. Font attached also.
      Acrobat reader will open succeed, I can select, copy text and save as text

      my code:

      private static void parseOne(String path) throws IOException {
          String pdfFileInText;
          PDFTextStripper tStripper;
          File file = new File(path);
          tStripper = new PDFTextStripper();
          MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMixed(0, 500000000).setTempDir(new File("/home/user/pdfBoxTest/newFiles/"));
          PDDocument document = PDDocument.load(file, memUsageSetting);
          if (!document.isEncrypted()) {
              pdfFileInText = tStripper.getText(document);
              System.out.print(pdfFileInText);
          }
          document.close();
      }

      Error:

      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
      WARNING: Invalid ToUnicode CMap in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+83 (83) in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+116 (116) in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+97 (97) in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+114 (114) in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+87 (87) in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+115 (115) in font HPDFAA+XOThames
      May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
      WARNING: Invalid ToUnicode CMap in font HPDFAB+DejaVuSansMono,Book
      

       

      Attachments

        1. XO_Thames.zip
          171 kB
          Sergey Makarov
        2. our_star_wars.pdf
          31 kB
          Sergey Makarov

        Activity

          People

            tilman Tilman Hausherr
            sergey.makarov Sergey Makarov
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: