PDFBox
  1. PDFBox
  2. PDFBOX-374

text areas not properly being sorted because of page rotation

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0-incubator
    • Fix Version/s: 0.8.0-incubator
    • Component/s: Text extraction
    • Labels:
      None

      Description

      When PDFTextStripper is set to sort the text before outputting, the sorting is not correct if a page rotation exists. The reason is because both TextPositionComparator and PDFStreamEngine take the rotation into account. So, the rotation is applied twice by the time the comparison is done in TextPositionComparator.

      Also, it seems that the rotation code in PDFStreamEngine is not consistent. I verified the code for 0 and 90 degrees works, but the 180 and 270 situations do not seem consistent with the goal of adjusting the X and Y values so that 0,0 is in the upper left, which is what the 0 and 90 code does. I do not have examples of 180 and 270 to test with. There are no comments in this section, so I have been guessing about its purpose.

      The attached patches:

      • Remove the rotation from TextPositionComparator
      • Adds comments and makes changes to the 180 and 270 situations to make it consistent with 0 and 90.
      1. rotation.pdf
        6 kB
        Brian Carrier
      2. text-rotation-081117.zip
        18 kB
        Brian Carrier
      3. text-rotation-081117-take2.zip
        32 kB
        Brian Carrier

        Issue Links

          Activity

          Jukka Zitting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Jukka Zitting made changes -
          Assignee Jukka Zitting [ jukkaz ]
          Resolution Fixed [ 1 ]
          Fix Version/s 0.8.0-incubator [ 12313346 ]
          Status Open [ 1 ] Resolved [ 5 ]
          Brian Carrier made changes -
          Attachment text-rotation-081117-take2.zip [ 12394170 ]
          Brian Carrier made changes -
          Link This issue is related to PDFBOX-363 [ PDFBOX-363 ]
          Brian Carrier made changes -
          Link This issue relates to PDFBOX-66 [ PDFBOX-66 ]
          Brian Carrier made changes -
          Link This issue relates to PDFBOX-272 [ PDFBOX-272 ]
          Brian Carrier made changes -
          Link This issue relates to PDFBOX-133 [ PDFBOX-133 ]
          Brian Carrier made changes -
          Link This issue incorporates PDFBOX-118 [ PDFBOX-118 ]
          Brian Carrier made changes -
          Attachment text-rotation-081117.zip [ 12394102 ]
          Brian Carrier made changes -
          Attachment TextPositionComparator.diff [ 12390306 ]
          Brian Carrier made changes -
          Attachment PDFTextStripper.diff [ 12390305 ]
          Brian Carrier made changes -
          Attachment PDFStreamEngine.diff [ 12390304 ]
          Brian Carrier made changes -
          Attachment rotation.pdf [ 12390310 ]
          Brian Carrier made changes -
          Field Original Value New Value
          Attachment PDFTextStripper.diff [ 12390305 ]
          Attachment TextPositionComparator.diff [ 12390306 ]
          Attachment PDFStreamEngine.diff [ 12390304 ]
          Brian Carrier created issue -

            People

            • Assignee:
              Jukka Zitting
              Reporter:
              Brian Carrier
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development