Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3962

No unicode mapping / Text not extracting

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None

      Description

      From the attached 72083_qdf.pdf file, this text (big letters on the top) is not extracted using PDFTextStripper:

      AGGIE NIGHT
      AT ENRON FIELD
      FRIDAY, JUNE 15, 2001 at 7:05
      HOUSTON ASTROS VS. TEXAS RANGERS
      

      It does not work well in Acrobat Reader also. But, at the same time, it can be extracted properly by some PDF viewers.

      Also, I found a workaround how to make it work, see it below.

      1. Find this code block in LegacyPDFStreamEngine.java

              if(unicode == null) {
                  if(!(font instanceof PDSimpleFont)) {
                      return;
                  }
                  char c = (char)code;
                  unicode = new String(new char[]{c});
              }
      

      2. Insert this code block just before found one.

              if (unicode == null) {
                  if (font instanceof PDType1CFont) {
                      String name = ((PDType1CFont) font).codeToName(code);
                      try {
                          Method method = PDType1CFont.class.getDeclaredMethod("readEncodingFromFont");
                          method.setAccessible(true);
                          Encoding encoding = (Encoding) method.invoke(font);
                          Integer newCode = encoding.getNameToCodeMap().get(name);
                          if (newCode != null && newCode.intValue() != 0) {
                              unicode = new String(new char[]{(char) newCode.byteValue()});
                          }
                      } catch (NoSuchMethodException e) {
                          e.printStackTrace();
                      } catch (IllegalAccessException e) {
                          e.printStackTrace();
                      } catch (InvocationTargetException e) {
                          e.printStackTrace();
                      }
                  }
              }
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rmakarov Roman
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: