[PDFBOX-5499] Performance issue since 2.0.18 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.19
Fix Version/s: 2.0.27, 3.0.0 PDFBox
Component/s: PDModel
Labels:
None

Description

Our PDF is parsed in less than 200ms in 2.0.18 and more then 8 seconds in 2.0.19. The same issue is still there in 2.0.26.

In version 2.0.19, SmallMap has been introduced. We're facing a performance issue since this modification.

We patch our code to just replace the SmallMap implementation like this:

package org.apache.pdfbox.util;

import java.util.LinkedHashMap;

public class SmallMap<K, V> extends LinkedHashMap<K, V> {
    // nothing : use the standard LinkedHashMap
}

And the performance issue disappear.

Our test is really simple:

    long start = System.currentTimeMillis();
    try (PDDocument document = PDDocument.load(new File(inFile))) {
      // nothing : only parsing is evaluated
    }
    long duration = System.currentTimeMillis() -start;

    assertTrue(duration < 500);

I can understand that the SmallMap can solve issues in some cases, but it is possible to implement a factory to create this map and then allow to setup which Map implementation we want to use?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2022-09-05-12-48-04-608.png
05/Sep/22 10:48
103 kB
Thomas Debray Luyat
image-2022-09-05-17-37-55-155.png
05/Sep/22 15:37
27 kB
Thomas Debray Luyat
image-2022-09-05-17-40-22-416.png
05/Sep/22 15:40
26 kB
Thomas Debray Luyat
image-2022-09-05-19-55-40-753.png
05/Sep/22 17:55
16 kB
Thomas Debray Luyat

Issue Links

relates to

PDFBOX-3284 Big Pdf parsing to text - Out of memory

Closed

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Thomas Debray Luyat

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 05/Sep/22 10:48

Updated:: 29/Sep/22 17:57

Resolved:: 20/Sep/22 17:17