Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-373

(null) printed when characters cannot be decoded during text extraction

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0-incubator
    • Fix Version/s: 0.8.0-incubator
    • Component/s: Parsing
    • Labels:
      None

      Description

      We have some PDF files where the TO_UNICODE map is corrupt and PDFBox cannot extract the text. font.encode() returns null and PDFStreamEngine.showString() adds the null to the result, which is then printed as "(null)".

      Here is a patch (against the trunk) that replaces the null with "?".

      — PDFStreamEngine.java 2008-09-17 16:09:13.529318500 -0400
      +++ PDFStreamEngine-new.java 2008-09-17 16:12:51.617318500 -0400
      @@ -422,6 +422,11 @@
      }
      }

      + // Replace a null entry with "?" so it is not printed as "(null)"
      + if (c == null)
      +

      { + c = "?"; + }

      totalStringWidth += width;
      stringResult.append( c );
      }

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jukkaz Jukka Zitting
                Reporter:
                carrier Brian Carrier
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: