Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4883

COSFloat is extremely slow

    XMLWordPrintableJSON

    Details

      Description

      I am testing text extraction from PDF and profiling the execution.

      I found that biggest time consumer is the COSFloat class.

       

      All other improvements I suggested so far are small compared to this.

      But this is the also the most complex one.

       

      I have attached te profiler output for the same text extraction, with and without the COSFloat changes.

      The time to extract the same text was 4 times long with the original COSFlow, because of its use of BigDecimal.

      I will try to write extra tests for all cases I see in the original COSFLoat code, if they are not already tested.

      Then I will submit for review a new COSFloat version.

       

      I think this affects parsing and displaying PDFs too, not just text extraction.

        Attachments

        1. Before.png
          53 kB
          Alfred
        2. After.png
          40 kB
          Alfred
        3. extreme-values-out.pdf
          26 kB
          Alfred
        4. PDFBOX-4883.patch
          11 kB
          Alfred
        5. PDFBOX-4883-b.patch
          4 kB
          Alfred

          Activity

            People

            • Assignee:
              lehmi Andreas Lehmkühler
              Reporter:
              Faltiska Alfred
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: