Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3962

No unicode mapping / Text not extracting

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • Text extraction
    • None

    Description

      From the attached 72083_qdf.pdf file, this text (big letters on the top) is not extracted using PDFTextStripper:

      AGGIE NIGHT
      AT ENRON FIELD
      FRIDAY, JUNE 15, 2001 at 7:05
      HOUSTON ASTROS VS. TEXAS RANGERS
      

      It does not work well in Acrobat Reader also. But, at the same time, it can be extracted properly by some PDF viewers.

      Also, I found a workaround how to make it work, see it below.

      1. Find this code block in LegacyPDFStreamEngine.java

              if(unicode == null) {
                  if(!(font instanceof PDSimpleFont)) {
                      return;
                  }
                  char c = (char)code;
                  unicode = new String(new char[]{c});
              }
      

      2. Insert this code block just before found one.

              if (unicode == null) {
                  if (font instanceof PDType1CFont) {
                      String name = ((PDType1CFont) font).codeToName(code);
                      try {
                          Method method = PDType1CFont.class.getDeclaredMethod("readEncodingFromFont");
                          method.setAccessible(true);
                          Encoding encoding = (Encoding) method.invoke(font);
                          Integer newCode = encoding.getNameToCodeMap().get(name);
                          if (newCode != null && newCode.intValue() != 0) {
                              unicode = new String(new char[]{(char) newCode.byteValue()});
                          }
                      } catch (NoSuchMethodException e) {
                          e.printStackTrace();
                      } catch (IllegalAccessException e) {
                          e.printStackTrace();
                      } catch (InvocationTargetException e) {
                          e.printStackTrace();
                      }
                  }
              }
      

      Attachments

        1. 72083_qdf.pdf
          164 kB
          Roman

        Activity

          People

            Unassigned Unassigned
            rmakarov Roman
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: