Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-571

Dubious handling of word spacing (Tw)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8.0-incubator
    • 1.0.0
    • Text extraction, Utilities
    • None

    Description

      Wanted to provide a contrary case for the current handling of word spacing.

      The sample page (pg_0005.pdf) uses a Type1C font for text rendering. The problem is that this Type1C font uses a custom encoding where the code values are assigned sequentially starting from the code value of 1. Thus the code value 32 is assigned to a digit "3", not to a space character " " as one would expect.

      The PDF producer software has (mis-)used word spacing to break up longer character sequences. For example, on table line 3, the character sequence "0.831.05" is broken into two cells "0.83" and "1.05". Other uses of this "optimization" can be seen when the sample page is opened in Acrobat Reader (tested on version 7.0) and the "Select all" operation is performed. I've attached the screenshot of Acrobat Reader (pg_0005_selectall.png) for your convenience.

      Attachments

        1. PDFStreamEngine.patch
          1 kB
          Villu Ruusmann
        2. pg_0005_selectall.png
          199 kB
          Villu Ruusmann
        3. pg_0005.pdf
          32 kB
          Villu Ruusmann

        Issue Links

          Activity

            People

              Unassigned Unassigned
              vfed Villu Ruusmann
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: