Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3749

void writeString(String text, List<TextPosition> textPositions) is not called per line

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Not A Problem
    • 2.0.4
    • None
    • None
    • Windows 10 64-bit

    Description

      We overwrote the void writeString(String text, List<TextPosition> textPositions) method of the PDFTextStripper to extract additional position and style information from the PDFs. We thought this method would be called per line and the elements of the parameter List<TextPosition> textPositions would be all the letters, including the spaces in a line.

      This is indeed the case for thousands of the documents. However, one particular document, this is not the case and the textPositions contains just the letters of a word and writeString is called per word.

      I am not sure if this would be counted as a bug because the final extracted text is not affected.

      The problematic PDF is attached.

      Attachments

        1. contract_00105_SEDAR.pdf
          53 kB
          Harun Reşit Zafer
        2. contract_00105_SEDAR-marked-1.png
          551 kB
          Tilman Hausherr
        3. helloworld.pdf
          16 kB
          Tilman Hausherr
        4. helloworld-marked-1.png
          41 kB
          Tilman Hausherr

        Activity

          People

            Unassigned Unassigned
            hrzafer Harun Reşit Zafer
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: