Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4883

COSFloat is extremely slow

    XMLWordPrintableJSON

Details

    Description

      I am testing text extraction from PDF and profiling the execution.

      I found that biggest time consumer is the COSFloat class.

       

      All other improvements I suggested so far are small compared to this.

      But this is the also the most complex one.

       

      I have attached te profiler output for the same text extraction, with and without the COSFloat changes.

      The time to extract the same text was 4 times long with the original COSFlow, because of its use of BigDecimal.

      I will try to write extra tests for all cases I see in the original COSFLoat code, if they are not already tested.

      Then I will submit for review a new COSFloat version.

       

      I think this affects parsing and displaying PDFs too, not just text extraction.

      Attachments

        1. After.png
          40 kB
          Alfred
        2. Before.png
          53 kB
          Alfred
        3. extreme-values-out.pdf
          26 kB
          Alfred
        4. PDFBOX-4883.patch
          11 kB
          Alfred
        5. PDFBOX-4883-b.patch
          4 kB
          Alfred

        Activity

          People

            lehmi Andreas Lehmkühler
            Faltiska Alfred
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: