I'm also hitting this performance problem... it's quite severe: on my
test case (~550 various PDFs), with
setSuppressDuplicateOverlappingText on it takes 73.6 sec and with it
off it's 24.031 sec: 3X slower.
Looking at the code... I think we need some sort of spatial data
structure here (rtree, k-d tree, quadtree, or something?), to
efficiently query for overlapping rectangles for the new incoming
But, even once we switch to a more efficient data structure... maybe
we could add some simple heuristics to restrict when we search for
dups. For example, if the text is only ever "moving forward" (ie,
right to left or left to right, and "downwards", so that each glyph is
placed into a previously unused space) then we can know nothing can
overlap. On seeing a glpyh "move backwards" (or, pu) then we could
turn on dup removal until it catches up to the unused space again...
I think this would mean most characters don't need to be further