Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-956

Poor text extraction performance in PDFTextStripper.java

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4.0
    • 1.7.0
    • Text extraction
    • None

    Description

      The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
      The patch is to use a TreeMap to achieve O(N log N) performance.
      The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.

      BTW: The extracted text is also quite different compared to Adobe Reader. Not sure which is correct but for this document it doesn't matter.

      Attachments

        1. PDFTextStripper.java.patch
          4 kB
          Kevin Jackson
        2. c4ce2fcd_69.pdf
          2.85 MB
          Kevin Jackson
        3. PDFBOX956-c4ce2fcd_69.txt
          611 kB
          Andreas Lehmkühler
        4. PDFTextStripper.pdf
          29 kB
          Stefan Magnus Landrø

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              kevinjackson Kevin Jackson
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: