PDFBox
  1. PDFBox
  2. PDFBOX-600

PDFBox performance issue: PDFTextStripper performance tweak

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0-incubator
    • Fix Version/s: 1.0.0
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      All

      Description

      During text extraction, the PDFTextStripper needs to calculate textposition proximities in order to determine if text elements are overlapping either vertically or horizontally.

      As part of this, the PDFTextStripper.within(float first, float second, float variance) method is used.

      The current (0.8.0) version of this method uses the following test: second > first - variance && second < first + variance

      This is accurate, but slower in my test documents than if you flip the test order: second < first + variance && second > first - variance

      This is because the second test fails out faster on left-to-right text. I believe that should be the default case.

      Please change the PDFTextStripper.within() method to use the second form of the test. I.E. to:

      private boolean within( float first, float second, float variance )

      { return second < first + variance && second > first - variance; }

      Thanks!

        Activity

        Hide
        Mel Martinez added a comment -

        flips the conditional expression component order in the within() method to speed up the test on left-to-right text.

        Show
        Mel Martinez added a comment - flips the conditional expression component order in the within() method to speed up the test on left-to-right text.
        Hide
        Jukka Zitting added a comment -

        Simple yet effective, nice! Committed in revision 899474.

        Show
        Jukka Zitting added a comment - Simple yet effective, nice! Committed in revision 899474.
        Hide
        Andreas Lehmkühler added a comment -

        closed after releasing version 1.0.0

        Show
        Andreas Lehmkühler added a comment - closed after releasing version 1.0.0

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Mel Martinez
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development