Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3131

PDFParserConfig default values were accidentally swapped

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.24.1
    • Fix Version/s: 1.25
    • Component/s: config, parser
    • Labels:
      None

      Description

      When default values were added for averageCharTolerance andĀ spacingTolerance as a part of TIKA-3091, their values appear to have been inadvertently swapped.

      From PDFBox:

          private float spacingTolerance = .5f;
          private float averageCharTolerance = .3f;
      

      From tika 1.24.1:

          //The character width-based tolerance value used to estimate where spaces in text should be added
          //Default taken from PDFBox.
          private Float averageCharTolerance = 0.5f;
      
          //The space width-based tolerance value used to estimate where spaces in text should be added
          //Default taken from PDFBox.
          private Float spacingTolerance = 0.3f;
      

      This effective change in defaults has caused PDFParser to start adding more spaces than it did in 1.24 and earlier.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                clarkperkins Clark Perkins
              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: