Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
3.0.3 PDFBox
-
None
Description
PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a word has a space in it (https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624)
For large documents ~800 pages and small string sequences (like a regular word), it causes a memory overhead (see attached), due to the several extra allocations. I've replaced the regexp for space and \t using word.contains, and since it's a O ( 1 ) operation that does not require extra allocations, memory used has been reduced.
What would be the implications of replacing this block for contains()?
Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to allocate less memory.
Attachments
Attachments
Issue Links
- links to