Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2069

PDF's with Tc before Tm are getting incorrect spacing in PDFTextArea

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.8.5
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
    • Environment:
      Windows

      Description

      Attached PDF is getting incorrect spacing using example program ExtractTextByArea.java as follows:

      Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200]
      Transaction Activity
      Date D e s c r i p t i o n Deposits W i t h d r a w a l s
      0 4 / 0 8 B E G I N N I N G BALANCE
      04 / 0 8 W I THDRAWAL - ATM 3 1 1 7 3 0 0 . 0 0 -
      62 M I L L H I L L ROAD WOODSTOCK N Y
      04 / 1 0 W I THDRAWAL - ACH 2 0 0 . 0 0 -
      HUMAN RIGHTS WAT-B I L L PAYMT
      04 / 12 C K # 1 2 7 3 11 0 . 0 0 -
      0 4 / 1 5 W I THDRAWAL - ACH 2 0 2 . 5 7 -
      NEW SOUTH INSURA -B I LL PAYMT
      04 / 1 5 W I THDRAWAL - ACH 3 6 . 2 6 -
      WASTE CONNECTION-BILL PAYMT
      04 / 1 7 W I THDRAWAL - ACH 71 2 . 0 0 -
      N PYMT T
      04 / 1 8 W I THDRAWAL - ACH 2958 9 . 0 0 3
      N PYMT T
      04 / 1 9 W I THDRAWAL - ACH 76 8 . 1 2 -

      I believe this because PDF streams with Tc before Tm are having the matrix applied to the Tc, which is contrary to my experience with graphic pipelines. Most PDF streams seem to to have Tc after Tm, and thus do not hit this situation.

      I have attached a patch to two files that corrects the problem for this file, and also works correctly on my test suite of about 40 files from other sources.

      The result for the attached file now becomes:
      Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200]
      Transaction Activity
      Date Description Deposits Withdrawals
      04/08 BEGINNING BALANCE
      04/08 WITHDRAWAL-ATM 3 117 300.00-
      62 MILL HILL ROAD WOODSTOCK NY
      04/10 WITHDRAWAL-ACH 200.00-
      HUMAN RIGHTS WAT-BILL PAYMT
      04/12 CK# 1273 110.00-
      04/15 WITHDRAWAL-ACH 202.57-
      NEW SOUTH INSURA-BILL PAYMT
      04/15 WITHDRAWAL-ACH 36.26-
      WASTE CONNECTION-BILL PAYMT
      04/17 WITHDRAWAL-ACH 712.00-
      N PYMT T
      04/18 WITHDRAWAL-ACH 29589.00 3
      N PYMT T
      04/19 WITHDRAWAL-ACH 768.12-

        Attachments

        1. PDFBOX-2609.pdf
          472 kB
          Joel Hirsh
        2. PDFBox-2609-patch.zip
          0.8 kB
          Joel Hirsh
        3. PDFBOX-2609-visible.pdf
          833 kB
          Tilman Hausherr

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jhirsh Joel Hirsh
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified