Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.20, 3.0.0 PDFBox
Description
I am testing text extraction from PDF and profiling the execution.
I found that biggest time consumer is the COSFloat class.
All other improvements I suggested so far are small compared to this.
But this is the also the most complex one.
I have attached te profiler output for the same text extraction, with and without the COSFloat changes.
The time to extract the same text was 4 times long with the original COSFlow, because of its use of BigDecimal.
I will try to write extra tests for all cases I see in the original COSFLoat code, if they are not already tested.
Then I will submit for review a new COSFloat version.
I think this affects parsing and displaying PDFs too, not just text extraction.