Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5823

StringUtil.PATTERN_SPACE memory optmisation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 3.0.3 PDFBox
    • 3.0.3 PDFBox, 4.0.0
    • PDModel
    • None

    Description

      PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a word has a space in it (https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624)

      For large documents ~800 pages and small string sequences (like a regular word), it causes a memory overhead (see attached), due to the several extra allocations. I've replaced the regexp for space and \t using word.contains, and since it's a O ( 1 ) operation that does not require extra allocations, memory used has been reduced.

      What would be the implications of replacing this block for contains()?

      Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to allocate less memory.

       

      Attachments

        1. Main.java
          2 kB
          Jonathan Prates
        2. Main-1.java
          4 kB
          Jonathan Prates
        3. Screenshot 2024-05-19 at 22.39.10.png
          71 kB
          Jonathan Prates
        4. Screenshot 2024-05-19 at 22.40.17.png
          89 kB
          Jonathan Prates
        5. Screenshot 2024-05-21 at 20.21.43.png
          171 kB
          Jonathan Prates

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              thumbox Jonathan Prates
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: