PDFBox
  1. PDFBox
  2. PDFBOX-374

text areas not properly being sorted because of page rotation

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0-incubator
    • Fix Version/s: 0.8.0-incubator
    • Component/s: Text extraction
    • Labels:
      None

      Description

      When PDFTextStripper is set to sort the text before outputting, the sorting is not correct if a page rotation exists. The reason is because both TextPositionComparator and PDFStreamEngine take the rotation into account. So, the rotation is applied twice by the time the comparison is done in TextPositionComparator.

      Also, it seems that the rotation code in PDFStreamEngine is not consistent. I verified the code for 0 and 90 degrees works, but the 180 and 270 situations do not seem consistent with the goal of adjusting the X and Y values so that 0,0 is in the upper left, which is what the 0 and 90 code does. I do not have examples of 180 and 270 to test with. There are no comments in this section, so I have been guessing about its purpose.

      The attached patches:

      • Remove the rotation from TextPositionComparator
      • Adds comments and makes changes to the 180 and 270 situations to make it consistent with 0 and 90.
      1. rotation.pdf
        6 kB
        Brian Carrier
      2. text-rotation-081117.zip
        18 kB
        Brian Carrier
      3. text-rotation-081117-take2.zip
        32 kB
        Brian Carrier

        Issue Links

          Activity

          Hide
          Brian Carrier added a comment -

          patches against branch for bug.

          Show
          Brian Carrier added a comment - patches against branch for bug.
          Hide
          Brian Carrier added a comment -

          Example file that bug can be seen with. When ExtractText is run with the "--sort" argument, the following is produced for the second page:
          This is m
          Second line here.
          will have a landscape layout. second page. It y

          Instead of:
          This is my second page. It will have a landscape layout.
          Second line here.

          Show
          Brian Carrier added a comment - Example file that bug can be seen with. When ExtractText is run with the "--sort" argument, the following is produced for the second page: This is m Second line here. will have a landscape layout. second page. It y Instead of: This is my second page. It will have a landscape layout. Second line here.
          Hide
          Andreas Lehmkühler added a comment -

          You should have a look at PDFBOX-363. I tried to fix a problem with the page rotation and I provided a patch for some minor problems which are perhaps related to your problem, too.

          Show
          Andreas Lehmkühler added a comment - You should have a look at PDFBOX-363 . I tried to fix a problem with the page rotation and I provided a patch for some minor problems which are perhaps related to your problem, too.
          Hide
          Brian Carrier added a comment -

          Ah, I did not see that. The fix for TextPositionComparator is the same. The logic in PDFTextStripper that adjusts the rotation is different though. I'll look into the differences.

          Show
          Brian Carrier added a comment - Ah, I did not see that. The fix for TextPositionComparator is the same. The logic in PDFTextStripper that adjusts the rotation is different though. I'll look into the differences.
          Hide
          Brian Carrier added a comment -

          After reviewing the patches PDFBOX-363 and finding some more examples that were not fixed by the previous patch in this entry, a new patch is attached. Note: the landscape_rot90.pdf file that was later attached to PDFBOX-363 is an example that is not solved by the previous patch, but is solved by this patch.

          This patch moves all knowledge about page rotation and the direction of text to the TextPosition class. The text matrix is now relied on instead of the page rotation value. New APIs were added so that callers could get text direction adjusted coordinates. The functionality of the original APIs is maintained for other parts of PDFBox. Other code was adjusted accordingly. I also did some cleanup in PDFStreamEngine and PDFTextStripper to remove unused variables and rename some variables to make their contents easier to understand.

          There are some failures on the regression tests, but most of them are better:

          • The two mismatches in "hexnumberproblem.pdf" are because the new code produces better output.
          • The mismatches in ocalc.pdf are all because the new code produces better output.
          • The mismatches in test_rotate_270.pdf are because the new code put "t" on its own line and caused every line after it to fail. The previous version of the code produced better results in this case, but it is not clear how. The text is on an angle relative to the other text and the height of "t" is such that it is equivalent to being on another line of text. I tried to adjust the code so that it was more liberal with making new lines, but it caused lots of other failures in the regression tests.

          Note that the regression tests do not currently sort the text based on location, so the page rotation issues are not tested. New regression tests must be created.

          Show
          Brian Carrier added a comment - After reviewing the patches PDFBOX-363 and finding some more examples that were not fixed by the previous patch in this entry, a new patch is attached. Note: the landscape_rot90.pdf file that was later attached to PDFBOX-363 is an example that is not solved by the previous patch, but is solved by this patch. This patch moves all knowledge about page rotation and the direction of text to the TextPosition class. The text matrix is now relied on instead of the page rotation value. New APIs were added so that callers could get text direction adjusted coordinates. The functionality of the original APIs is maintained for other parts of PDFBox. Other code was adjusted accordingly. I also did some cleanup in PDFStreamEngine and PDFTextStripper to remove unused variables and rename some variables to make their contents easier to understand. There are some failures on the regression tests, but most of them are better: The two mismatches in "hexnumberproblem.pdf" are because the new code produces better output. The mismatches in ocalc.pdf are all because the new code produces better output. The mismatches in test_rotate_270.pdf are because the new code put "t" on its own line and caused every line after it to fail. The previous version of the code produced better results in this case, but it is not clear how. The text is on an angle relative to the other text and the height of "t" is such that it is equivalent to being on another line of text. I tried to adjust the code so that it was more liberal with making new lines, but it caused lots of other failures in the regression tests. Note that the regression tests do not currently sort the text based on location, so the page rotation issues are not tested. New regression tests must be created.
          Hide
          Brian Carrier added a comment -

          Updated set of files to fix page rotation issues (and basic code cleanup)

          Show
          Brian Carrier added a comment - Updated set of files to fix page rotation issues (and basic code cleanup)
          Hide
          Brian Carrier added a comment -

          Both of these issues are on the same topic.

          Show
          Brian Carrier added a comment - Both of these issues are on the same topic.
          Hide
          Brian Carrier added a comment -

          Attached "take2" file as "diffs" along with updated regression test files.

          Show
          Brian Carrier added a comment - Attached "take2" file as "diffs" along with updated regression test files.
          Hide
          Jukka Zitting added a comment -

          Changes committed in revision 719294. Thanks!

          Show
          Jukka Zitting added a comment - Changes committed in revision 719294. Thanks!

            People

            • Assignee:
              Jukka Zitting
              Reporter:
              Brian Carrier
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development