Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4553

Break of backward compatibility from 2.0.14 to 2.0.15

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

      Description

      We use PDFTextStripper to parse some PDF documents. The parsing sometimes assumes the file template and the order of the words in it.

      The following Kotlin code prints the text content of the attached file, sorted by position.

      fun main() {
        val pdfTextStripper = PDFTextStripper()
        pdfTextStripper.sortByPosition = true
        val text = pdfTextStripper.getText(PDDocument.load(File("/path/to/file/KYPolicy2.pdf").readBytes()))
        print(text)
      }
      

      Running this code with PDFBox 2.0.14 and 2.0.15 giving different parsing for the line 

      POLICY PERIOD:  FROM 02/18/2018 TO 02/18/2019 (2.0.14)

      POLICY PERIOD:  FROM 02/18/2018 02/18/2019TO (2.0.15)

      I suspect the cause is the changes done in this commit:

      https://github.com/apache/pdfbox/commit/068146a9c9fe59becbd82814b6a245f8158fce22

       

      This somehow prevents us for safely upgrading to the newer version

      KYPolicy2.pdf

        Attachments

        1. KYPolicy2.pdf
          25 kB
          Uziel Sulkies

        Issue Links

          Activity

            People

            • Assignee:
              tilman Tilman Hausherr
              Reporter:
              usulkies Uziel Sulkies

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment