Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-347

Spaces removed after text extraction

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 0.8.0-incubator
    • Text extraction
    • None

    Description

      [Issue from SourceForge]
      http://sourceforge.net/tracker/index.php?func=detail&aid=1912364&group_id=78314&atid=552832

      The spaces between words from the attached PDF file are removed upon text
      extraction.

      I traced the code and found that the cause seems to be a "division by 0"
      bug in PDCIDFont.java

      In PDCIDFont.getAverageFontWidth(), widths is returned as null from

      COSArray widths = (COSArray)font.getDictionaryObject( COSName.getPDFName(
      "W" ) );

      ,causing characterCount to be 0.

      The result is that the following line
      float average = totalWidths / characterCount;

      returns a NaN, which gets propagated up the method calls to result in the
      spaces being removed.

      I suggest the following fix, instead of:
      float average = totalWidths / characterCount;

      Have:
      float average = defaultWidth;

      if (characterCount > 0) {
      average = totalWidths / characterCount;
      }

      [Comment on SourceForge]
      Date: 2008-03-12 03:01
      Sender: choongyong
      Logged In: YES
      user_id=2033885
      Originator: NO

      Realised that I was considered not login when I raised the request.
      Sending this comment so that the developer can contact me.

      [Comment on SourceForge]
      Date: 2008-03-17 21:50
      Sender: nobody
      Logged In: NO

      I have noticed that there is no spaces between 2 words, if they are
      separated by a new line (or the 2nd word is on the next line because it
      reaches the right margin).

      Could you correct please ?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jukkaz Jukka Zitting
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: