Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2508

Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.7, 2.0.0
    • 1.8.11, 2.0.0
    • Text extraction

    Description

      Attached file is just line one from a file where every piece of text has these problems. All the other lines were removed with Nitro to make a small test case.

      This is the output from PrintTextLocations example:
      String[211.92,356.8801 fs=58.0 xscale=58.0 height=1.75392 space=190528.28 width=1.7052002]?
      String[129.84,347.04 fs=58.0 xscale=58.0 height=2.72832 space=288435.66 width=2.679596]?
      String[70.32,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=7.0643997]?
      String[77.3844,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=4.8720016]?
      String[82.2564,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.333603]?
      String[88.590004,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.577202]?
      String[95.167206,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.0899963]?
      String[101.2572,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.333603]?
      String[107.590805,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.0899963]?
      String[113.6808,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=4.8720016]?
      String[118.5528,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=3.1668015]?
      String[121.719604,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.333603]?
      String[128.0532,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.577194]?
      String[134.63042,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=6.0899963]?
      String[140.72041,299.28 fs=58.0 xscale=58.0 height=3.31296 space=349985.12 width=3.1667938]?
      String[522.95984,293.28 fs=58.0 xscale=58.0 height=1.36416 space=150394.36 width=1.4616089]?

      Fontsize is way too big (should be more like 8), value for space is ridiculous, height is too small. And each character is coming through as a '?'. The original file has this on every piece of text.

      In Acrobat everything looks fine, both in the original and in this test case.

      Attachments

        1. screenshot of acrobat.png
          45 kB
          Fred Andrews
        2. badtext.pdf
          242 kB
          Fred Andrews

        Issue Links

          Activity

            People

              tilman Tilman Hausherr
              fred_andrews Fred Andrews
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: