Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3970

x,y co-ordinates of the text inside the cell are not getting correctly.

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.7
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
    • Environment:
      Operating system: Windows 7 (64 bit).
    • Flags:
      Important

      Description

      Hello Support Team,

      I am working on a project which parses a whole PDF document and stores the extracted text in some .txt file which can be read by other product.

      My issue is regarding extracting the text inside the cell of a table:
      x,y co-ordinates of the text inside the cell are not getting correctly.
      Y value of the last text line in the cell is getting larger than cell's max-Y value.

      I have attached the test file with this bug.

      As you can see in the test document, there is one cell along-with text in it and a text paragraph next to that cell.

      x-y coordinates that I get from pdfbox for all the paths (two vertical and two horizontal lines) of the cell are:
      (in x1,y1,x2,y2 format)
      Horizontal line 1: [100,88,220,88]
      Horizontal line 2: [100,120,220,120]
      Vertical line 1 : [100,88,100,120]
      Vertical line 2: [220,88,220,120]

      (Y values of the above paths are final values by subtracting the actual value given by pdfbox from height of the page as I see that for paths, y-values are processed from bottom to up)

      And bounding box of the last line in that cell is : [102,114,59,7] and hence max-Y of that line becomes 121 (min-Y + height)

      So, if we consider max-Y value of that cell (i.e. 120) and that of last line in that cell (i.e. 121), clearly, that line goes out of that cell.

      What can be the possible reason for this?

      Thank you in advance!
      Regards,
      Navnath Kumbhar

        Attachments

        1. wrong_space_parsed_sample.pdf
          61 kB
          Tilman Hausherr
        2. simpleAnnotation.pdf
          85 kB
          Navnath Kumbhar
        3. paragraphNextToTable-marked-1.png
          48 kB
          Tilman Hausherr
        4. paragraphNextToTable.pdf
          0.9 kB
          Navnath Kumbhar
        5. LegacyPDFStreamEngine.java
          16 kB
          Tilman Hausherr
        6. LegacyPDFStreamEngine.java
          15 kB
          Tilman Hausherr
        7. formula-marked-34.png
          309 kB
          Navnath Kumbhar

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Navnath@3DS Navnath Kumbhar
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: