PDFBox
  1. PDFBox
  2. PDFBOX-577

TextPosition should expose its bounding box

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: PDModel
    • Labels:
      None

      Description

      It does not seem to be possible to calculate the bounding box of a TextPosition.

      IIUC, TextPosition#getY is the baseline of the text and TextPosition#getHeight is the absolute height of the text. When I subtract the latter from the former I get a top line, but this is only correct if the text does not contain descender characters.

      Below is a screenshot (AFM-getHeight.png) which shows the bounding boxes of TextPositions calculated as

      {#getX(), #getY() - #getHeight, #getWidth, #getHeight}

      painted in random colors. For example, the bounding boxes of parentheses are severely misplaced, which makes the line-by-line text extraction impossible.

      Right now I've solved the problem by tweaking AFM FontMetrics code so that it returns BoundingBox#getUpperRightY instead of BoundingBox#getHeight when queried via PDSimpleFont#getFontHeight(byte[], int, int). Another screenshot (AFM-getUpperRightY.png) shows how this restores the previously broken text extraction ability.

      It seems like a good idea to rework TextPosition so that it would be aware of its bounding box:
      *) Replace methods PDSimpleFont#getFontWidth(byte[], int, int) and PDSimpleFont#getFontHeight(byte[], int, int) with a single method PDSimpleFont#getFontBoundingBox(byte[], int, int)
      *) Replace the constructor TextPosition(Matrix, Matrix) with TextPosition(Matrix, BoundingBox)
      *) Add new methods TextPosition#getBoundingBox, TextPosition#getBoundingBoxDir. This shouldn't affect existing application clients, because TextPosition#getY and TextPosition#getHeight remain in place.

      1. textposition-randombg.zip
        3 kB
        Villu Ruusmann
      2. AFM-getUpperRightY.png
        1.07 MB
        Villu Ruusmann
      3. AFM-getHeight.png
        1.07 MB
        Villu Ruusmann
      4. 0001-PDFont.java-Add-methods-to-retreive-the-Ascent-and-D.patch
        4 kB
        Karl Ward

        Activity

        John Hewson made changes -
        Component/s PDModel [ 12312224 ]
        Villu Ruusmann made changes -
        Attachment textposition-randombg.zip [ 12497499 ]
        Hide
        Villu Ruusmann added a comment -

        Replaced the sample application. Thanks to Arun for pointing out that it's safer to use PDPage#findMediaBox instead of PDPage#getMediaBox.

        Show
        Villu Ruusmann added a comment - Replaced the sample application. Thanks to Arun for pointing out that it's safer to use PDPage#findMediaBox instead of PDPage#getMediaBox.
        Villu Ruusmann made changes -
        Attachment textposition-randombg.zip [ 12497397 ]
        Villu Ruusmann made changes -
        Attachment textposition-randombg.zip [ 12497397 ]
        Hide
        Villu Ruusmann added a comment -

        Due to popular demand, here's the sample application that paints the text background in random colors. I suspect that its PDFBox API usage is outdated, so take care.

        Show
        Villu Ruusmann added a comment - Due to popular demand, here's the sample application that paints the text background in random colors. I suspect that its PDFBox API usage is outdated, so take care.
        Hide
        Arun K. M added a comment -

        I would be very grateful of you could post your code for drawing the bounding boxes if possible (I assume you use the pdf to image routines) and also to share another other hints or directions you may have. I have been thinking of intercepting the graphics context and recording all the "pixels" actually drawn on the canvas for each character and then computing the bounding box - a bit extreme I know, but I am also wondering if that information is available as you have started down the path. I am interested in figuring out superscripts and subscripts and using the Unicode codes to provide those in pdf to text for better extracts. Any hints, guidance, pointers or experimental code is welcome. I am new to PDFBOX and learning it at this time and I think that highly accurate bounding box computations would be well worthwhile. Thanks!

        Show
        Arun K. M added a comment - I would be very grateful of you could post your code for drawing the bounding boxes if possible (I assume you use the pdf to image routines) and also to share another other hints or directions you may have. I have been thinking of intercepting the graphics context and recording all the "pixels" actually drawn on the canvas for each character and then computing the bounding box - a bit extreme I know, but I am also wondering if that information is available as you have started down the path. I am interested in figuring out superscripts and subscripts and using the Unicode codes to provide those in pdf to text for better extracts. Any hints, guidance, pointers or experimental code is welcome. I am new to PDFBOX and learning it at this time and I think that highly accurate bounding box computations would be well worthwhile. Thanks!
        Hide
        Karl Ward added a comment -

        Not so much copied as implied from. I doubt that would be an issue.

        But thanks for pointing out the existing methods. It's a shame PDFontDescriptorDictionary#getDescent() does not account for the common problem of an incorrectly positive value. Perhaps a reading of prior work, such as xpdf, would have been beneficial here.

        Show
        Karl Ward added a comment - Not so much copied as implied from. I doubt that would be an issue. But thanks for pointing out the existing methods. It's a shame PDFontDescriptorDictionary#getDescent() does not account for the common problem of an incorrectly positive value. Perhaps a reading of prior work, such as xpdf, would have been beneficial here.
        Hide
        Andreas Lehmkühler added a comment -

        The class PDFontDescriptor already implements getter and setter for both values.

        Besides, it is always a problem to just copy some code from elsewhere. Even if it is open source, the license has to be compatible too. xpdf is licensed under GPL2 and that is a problem for us. See [1] for further details.

        [1] http://www.apache.org/legal/resolved.html

        Show
        Andreas Lehmkühler added a comment - The class PDFontDescriptor already implements getter and setter for both values. Besides, it is always a problem to just copy some code from elsewhere. Even if it is open source, the license has to be compatible too. xpdf is licensed under GPL2 and that is a problem for us. See [1] for further details. [1] http://www.apache.org/legal/resolved.html
        Karl Ward made changes -
        Hide
        Karl Ward added a comment -

        PDFont#getAscent() and #getDescent() patch.

        Show
        Karl Ward added a comment - PDFont#getAscent() and #getDescent() patch.
        Hide
        Karl Ward added a comment -

        I'm actually only interested in the bounding box that fully encapsulates all the characters from a run of text. So, I am using the Ascent and Descent values from the font's font descriptor dictionary, along with baseline position, to calculate a maximum and minimum y for a particular run of text.

        Attached is a patch that adds getAscent() and getDescent() to PDFont. These new methods mimic those found in GfxFont in the xpdf project (which are in fact used by the pdf2html tool to perform text extraction).

        Show
        Karl Ward added a comment - I'm actually only interested in the bounding box that fully encapsulates all the characters from a run of text. So, I am using the Ascent and Descent values from the font's font descriptor dictionary, along with baseline position, to calculate a maximum and minimum y for a particular run of text. Attached is a patch that adds getAscent() and getDescent() to PDFont. These new methods mimic those found in GfxFont in the xpdf project (which are in fact used by the pdf2html tool to perform text extraction).
        Hide
        Villu Ruusmann added a comment -

        I did not go on to implement the proposed solution, because it seemed like too much work to accomplish my modest PDF text extraction goals.

        The patch should include the implementations of PDSimpleFont#getFontBoundingBox(byte[], int, int) for all subclasses of class PDSimpleFont. There are many obstacles in the way. Some Font types lack proper FontBox support, whereas some other Font types do not seem to support the concept of "bounding boxes" at the desired level of detail (eg. there are the dimensions of the box, but no information about the baseline location within the box).

        Show
        Villu Ruusmann added a comment - I did not go on to implement the proposed solution, because it seemed like too much work to accomplish my modest PDF text extraction goals. The patch should include the implementations of PDSimpleFont#getFontBoundingBox(byte[], int, int) for all subclasses of class PDSimpleFont. There are many obstacles in the way. Some Font types lack proper FontBox support, whereas some other Font types do not seem to support the concept of "bounding boxes" at the desired level of detail (eg. there are the dimensions of the box, but no information about the baseline location within the box).
        Hide
        Karl Ward added a comment -

        Did anything become of this proposed solution? I'm facing the same problem and considering a patch similar to the one described here, but I'd rather not duplicate work.

        Show
        Karl Ward added a comment - Did anything become of this proposed solution? I'm facing the same problem and considering a patch similar to the one described here, but I'd rather not duplicate work.
        Villu Ruusmann made changes -
        Attachment AFM-getUpperRightY.png [ 12426956 ]
        Villu Ruusmann made changes -
        Field Original Value New Value
        Attachment AFM-getHeight.png [ 12426954 ]
        Villu Ruusmann created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Villu Ruusmann
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development