Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3045

File that read fine in 1.8 does not in 2.0

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.0.0
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:

      Description

      Attached is a page of a file that was parsed fine with PDFBox 1.8

      In 2.0, using pdfbox/examples/util/PrintTextLocations.java
      lots of the text is missing - for example all the text like
      "MERCH BANKCARD NET SETLMT"

      Also it has width_of_space as some bad value - 561591.3

      Start of PrintTextLocations....

      Oct 21, 2015 10:36:22 PM org.apache.pdfbox.filter.FlateFilter decode
      SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
      Oct 21, 2015 10:36:22 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
      WARNING: java.util.zip.DataFormatException: incorrect data check
      Oct 21, 2015 10:36:22 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
      WARNING: Cannot execute restore, the graphics stack is empty
      String[161.94,422.1 fs=10.0 xscale=10.0 height=7.2857146 space=561591.3 width=6.6857147]B
      String[168.62572,422.1 fs=10.0 xscale=10.0 height=7.2857146 space=561591.3 width=4.457138]e
      String[173.08286,422.1 fs=10.0 xscale=10.0 height=7.2857146 space=561591.3 width=4.9714355]g
      String[178.05429,422.1 fs=10.0 xscale=10.0 height=7.2857146 space=561591.3 width=2.742859]i

        Attachments

        1. PDFbox2.pdf
          114 kB
          Fred Andrews

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                fred_andrews Fred Andrews
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: