Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3464

character height 3 times higher than expected

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None

      Description

      The issue basically same as PDFBOX-2749, but wrong sample was attached to it by mistake. Correct PDF is attached here.

      The core of the problem is that font height for this specific font is determined incorrectly, please see code with comments below.

      The issue was reproduced on Pdfbox 1.8.4, but as we tested before, same result we get on 1.8.9 and 2.0 versions.

      public class Extractor extends PDFTextStripper {
      //<...CUT...>
      	protected void writePage() throws IOException {
      		for (List<TextPosition> textList : charactersByArticle) { //charactersByArticle was inherited from base class
      			Iterator textIter = textList.iterator();
      //<...CUT...>
      			while (textIter.hasNext()) {
      				TextPosition position = (TextPosition) textIter.next();
      //<...CUT...>
      		PDFontDescriptor fontDescriptor = position.getFont().getFontDescriptor();
      //<...CUT...>
      
      		float yscale = position.getTextPos().getYScale();
      		float asc = Math.abs(fontDescriptor.getAscent() / 1000 * yscale);
      		float rh = Math.abs(fontDescriptor.getFontBoundingBox().getUpperRightY() / 1000 * yscale);
      
      		float desc = Math.abs(fontDescriptor.getDescent() / 1000 * yscale);
      		float capHeight = Math.abs(fontDescriptor.getCapHeight() / 1000 * yscale);
      		if (capHeight == 0)
      			capHeight = position.getHeight();
      
      		float h = (rh + Math.max(Math.max(capHeight, position.getHeight()), asc)) / 2;
      
      //"h" evaluates to 37.39 (should be between 11 and 12)
      //"desc" evaluates to 2.664
      //"capHeight" evaluates to 37.39
      //"position.getHeight()" evaluates to 33.48
      
      

        Attachments

        1. nowItsHelped.png
          385 kB
          Roman
        2. notHelped.png
          381 kB
          Roman
        3. screenshot.png
          157 kB
          Tilman Hausherr
        4. screenshot-1.png
          377 kB
          Roman
        5. subnode.docx.pdf
          25 kB
          Roman

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rmakarov Roman
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: