Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2508

Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.7, 2.0.0
    • Fix Version/s: 1.8.11, 2.0.0
    • Component/s: Text extraction
    • Labels:

      Description

      Attached file is just line one from a file where every piece of text has these problems. All the other lines were removed with Nitro to make a small test case.

      This is the output from PrintTextLocations example:
      String[211.92,356.8801 fs=58.0 xscale=58.0 height=1.75392 space=190528.28 width=1.7052002]?
      String[129.84,347.04 fs=58.0 xscale=58.0 height=2.72832 space=288435.66 width=2.679596]?
      String[70.32,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=7.0643997]?
      String[77.3844,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=4.8720016]?
      String[82.2564,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.333603]?
      String[88.590004,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.577202]?
      String[95.167206,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.0899963]?
      String[101.2572,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.333603]?
      String[107.590805,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.0899963]?
      String[113.6808,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=4.8720016]?
      String[118.5528,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=3.1668015]?
      String[121.719604,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.333603]?
      String[128.0532,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.577194]?
      String[134.63042,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.0899963]?
      String[140.72041,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=3.1667938]?
      String[522.95984,293.28 fs=58.0 xscale=58.0 height=1.36416 space=150394.36 width=1.4616089]?

      Fontsize is way too big (should be more like 8), value for space is ridiculous, height is too small. And each character is coming through as a '?'. The original file has this on every piece of text.

      In Acrobat everything looks fine, both in the original and in this test case.

        Attachments

        1. screenshot of acrobat.png
          45 kB
          Fred Andrews
        2. badtext.pdf
          242 kB
          Fred Andrews

          Issue Links

            Activity

              People

              • Assignee:
                tilman Tilman Hausherr
                Reporter:
                fred_andrews Fred Andrews
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: