[PDFBOX-4883] COSFloat is extremely slow - ASF JIRA

XML

Word

Printable

JSON

I am testing text extraction from PDF and profiling the execution.

I found that biggest time consumer is the COSFloat class.

All other improvements I suggested so far are small compared to this.

But this is the also the most complex one.

I have attached te profiler output for the same text extraction, with and without the COSFloat changes.

The time to extract the same text was 4 times long with the original COSFlow, because of its use of BigDecimal.

I will try to write extra tests for all cases I see in the original COSFLoat code, if they are not already tested.

Then I will submit for review a new COSFloat version.

I think this affects parsing and displaying PDFs too, not just text extraction.