PDFBox
  1. PDFBox
  2. PDFBOX-956

Poor text extraction performance in PDFTextStripper.java

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.0
    • Fix Version/s: 1.7.0
    • Component/s: Text extraction
    • Labels:
      None

      Description

      The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
      The patch is to use a TreeMap to achieve O(N log N) performance.
      The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.

      BTW: The extracted text is also quite different compared to Adobe Reader. Not sure which is correct but for this document it doesn't matter.

      1. c4ce2fcd_69.pdf
        2.85 MB
        Kevin Jackson
      2. PDFBOX956-c4ce2fcd_69.txt
        611 kB
        Andreas Lehmkühler
      3. PDFTextStripper.java.patch
        4 kB
        Kevin Jackson
      4. PDFTextStripper.pdf
        29 kB
        Stefan Magnus Landrø

        Issue Links

          Activity

          Hide
          Mel Martinez added a comment -

          Kevin,

          I like the look of this optimization, though I haven't run it yet.

          One comment: The 52b22bd6.pdf file that you attached as a test case seems a bit problematic. It apparently makes use of javascript. For security reasons I normally run with javascript disabled on PDFs. When I try to open this file with Adobe Reader from Firefox, it properly prompted me with whether to allow the javascript to run and before I could answer it hung up the Adobe Reader for about 90 seconds and then blew away both the Reader (v9.4.0) AND Firefox (v3.6.13). This is on a 64bit Win XP Pro machine (latest patches as of a couple of days ago).

          After restarting everything I tried once more (thinking maybe Firefox had gotten to a cluttered state) but it did it again right off the bat.

          I'm not sure what happened at a micro level to cause that but I wonder if the file has some sort of corruption?

          Show
          Mel Martinez added a comment - Kevin, I like the look of this optimization, though I haven't run it yet. One comment: The 52b22bd6.pdf file that you attached as a test case seems a bit problematic. It apparently makes use of javascript. For security reasons I normally run with javascript disabled on PDFs. When I try to open this file with Adobe Reader from Firefox, it properly prompted me with whether to allow the javascript to run and before I could answer it hung up the Adobe Reader for about 90 seconds and then blew away both the Reader (v9.4.0) AND Firefox (v3.6.13). This is on a 64bit Win XP Pro machine (latest patches as of a couple of days ago). After restarting everything I tried once more (thinking maybe Firefox had gotten to a cluttered state) but it did it again right off the bat. I'm not sure what happened at a micro level to cause that but I wonder if the file has some sort of corruption?
          Hide
          Kevin Jackson added a comment -

          I replaced the original sample PDF file with one that did not contain JavaScript.
          Yes, Adobe Reader ALSO has performance problems with this evil file.
          This fix addresses the performance problem in PDFBox.

          Show
          Kevin Jackson added a comment - I replaced the original sample PDF file with one that did not contain JavaScript. Yes, Adobe Reader ALSO has performance problems with this evil file. This fix addresses the performance problem in PDFBox.
          Hide
          Andreas Lehmkühler added a comment -

          The provided pdf contains a lot of crap. There is a section with round about 21000 lines like the following

          (!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!)Tj

          That leads to round about 632000 "!" in the text (see the attached extraction result). That text is invisible because of its size, but it triggers the suppress duplicates algorithm od PDFBox which doesn't perform that good.

          Show
          Andreas Lehmkühler added a comment - The provided pdf contains a lot of crap. There is a section with round about 21000 lines like the following (!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!)Tj That leads to round about 632000 "!" in the text (see the attached extraction result). That text is invisible because of its size, but it triggers the suppress duplicates algorithm od PDFBox which doesn't perform that good.
          Hide
          Andreas Lehmkühler added a comment -

          I added the patch in revision 1070125 as proposed.

          Thanks for the contribution!!

          Show
          Andreas Lehmkühler added a comment - I added the patch in revision 1070125 as proposed. Thanks for the contribution!!
          Hide
          Lars Torunski added a comment -

          Using NavigableMap in the patch will result in the dependency to Java 6.

          Show
          Lars Torunski added a comment - Using NavigableMap in the patch will result in the dependency to Java 6.
          Hide
          Andreas Lehmkühler added a comment - - edited

          The minimum requirements for PDFBox are java 1.5 and I guess that won't be changed in a near future.

          As I introduced the issue I'll take care of this.

          Show
          Andreas Lehmkühler added a comment - - edited The minimum requirements for PDFBox are java 1.5 and I guess that won't be changed in a near future. As I introduced the issue I'll take care of this.
          Hide
          Andreas Lehmkühler added a comment -

          I reverted the changes in revision 1072665.

          I'm working on a solution to solve this without using a NavigableMap, but I still have to fight some missing/addtional spaces ....

          Show
          Andreas Lehmkühler added a comment - I reverted the changes in revision 1072665. I'm working on a solution to solve this without using a NavigableMap, but I still have to fight some missing/addtional spaces ....
          Hide
          Lars Torunski added a comment -

          Kevin, is it possible to realise O(N log N) performance with Java 5 without new library dependencies like backport-util-concurrent?

          Show
          Lars Torunski added a comment - Kevin, is it possible to realise O(N log N) performance with Java 5 without new library dependencies like backport-util-concurrent?
          Hide
          Stefan Magnus Landrø added a comment -

          PDF which is extremely slow to extract text from

          Show
          Stefan Magnus Landrø added a comment - PDF which is extremely slow to extract text from
          Hide
          Stefan Magnus Landrø added a comment -

          Added PDF which takes a long time to extract text from. Tested in 1.5 also - it seems like something happened after 1.2.1.

          Show
          Stefan Magnus Landrø added a comment - Added PDF which takes a long time to extract text from. Tested in 1.5 also - it seems like something happened after 1.2.1.
          Hide
          Michael McCandless added a comment -

          I'm also hitting this performance problem... it's quite severe: on my
          test case (~550 various PDFs), with
          setSuppressDuplicateOverlappingText on it takes 73.6 sec and with it
          off it's 24.031 sec: 3X slower.

          Looking at the code... I think we need some sort of spatial data
          structure here (rtree, k-d tree, quadtree, or something?), to
          efficiently query for overlapping rectangles for the new incoming
          character.

          But, even once we switch to a more efficient data structure... maybe
          we could add some simple heuristics to restrict when we search for
          dups. For example, if the text is only ever "moving forward" (ie,
          right to left or left to right, and "downwards", so that each glyph is
          placed into a previously unused space) then we can know nothing can
          overlap. On seeing a glpyh "move backwards" (or, pu) then we could
          turn on dup removal until it catches up to the unused space again...
          I think this would mean most characters don't need to be further
          checked.

          Show
          Michael McCandless added a comment - I'm also hitting this performance problem... it's quite severe: on my test case (~550 various PDFs), with setSuppressDuplicateOverlappingText on it takes 73.6 sec and with it off it's 24.031 sec: 3X slower. Looking at the code... I think we need some sort of spatial data structure here (rtree, k-d tree, quadtree, or something?), to efficiently query for overlapping rectangles for the new incoming character. But, even once we switch to a more efficient data structure... maybe we could add some simple heuristics to restrict when we search for dups. For example, if the text is only ever "moving forward" (ie, right to left or left to right, and "downwards", so that each glyph is placed into a previously unused space) then we can know nothing can overlap. On seeing a glpyh "move backwards" (or, pu) then we could turn on dup removal until it catches up to the unused space again... I think this would mean most characters don't need to be further checked.
          Hide
          Andreas Lehmkühler added a comment -

          I improved the text extraction performance in revision 1199634. My changes are based on Kevins patch, I simply replaced the NavigableMap/Set with a TreeMap/Set. The results seem to be similar, the performance is way faster than before.

          Thanks for the contribution!

          Show
          Andreas Lehmkühler added a comment - I improved the text extraction performance in revision 1199634. My changes are based on Kevins patch, I simply replaced the NavigableMap/Set with a TreeMap/Set. The results seem to be similar, the performance is way faster than before. Thanks for the contribution!

            People

            • Assignee:
              Andreas Lehmkühler
              Reporter:
              Kevin Jackson
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development