Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-956

Poor text extraction performance in PDFTextStripper.java

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4.0
    • 1.7.0
    • Text extraction
    • None

    Description

      The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
      The patch is to use a TreeMap to achieve O(N log N) performance.
      The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.

      BTW: The extracted text is also quite different compared to Adobe Reader. Not sure which is correct but for this document it doesn't matter.

      Attachments

        1. PDFTextStripper.pdf
          29 kB
          Stefan Magnus Landrø
        2. PDFBOX956-c4ce2fcd_69.txt
          611 kB
          Andreas Lehmkühler
        3. c4ce2fcd_69.pdf
          2.85 MB
          Kevin Jackson
        4. PDFTextStripper.java.patch
          4 kB
          Kevin Jackson

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              kevinjackson Kevin Jackson
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: