Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3749

void writeString(String text, List<TextPosition> textPositions) is not called per line

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Not A Problem
    • Affects Version/s: 2.0.4
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Environment:
      Windows 10 64-bit

      Description

      We overwrote the void writeString(String text, List<TextPosition> textPositions) method of the PDFTextStripper to extract additional position and style information from the PDFs. We thought this method would be called per line and the elements of the parameter List<TextPosition> textPositions would be all the letters, including the spaces in a line.

      This is indeed the case for thousands of the documents. However, one particular document, this is not the case and the textPositions contains just the letters of a word and writeString is called per word.

      I am not sure if this would be counted as a bug because the final extracted text is not affected.

      The problematic PDF is attached.

        Attachments

        1. helloworld-marked-1.png
          41 kB
          Tilman Hausherr
        2. helloworld.pdf
          16 kB
          Tilman Hausherr
        3. contract_00105_SEDAR-marked-1.png
          551 kB
          Tilman Hausherr
        4. contract_00105_SEDAR.pdf
          53 kB
          Harun Reşit Zafer

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              hrzafer Harun Reşit Zafer
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: