Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3131

PDFParserConfig default values were accidentally swapped

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.24.1
    • 1.25
    • config, parser
    • None

    Description

      When default values were added for averageCharTolerance andĀ spacingTolerance as a part of TIKA-3091, their values appear to have been inadvertently swapped.

      From PDFBox:

          private float spacingTolerance = .5f;
          private float averageCharTolerance = .3f;
      

      From tika 1.24.1:

          //The character width-based tolerance value used to estimate where spaces in text should be added
          //Default taken from PDFBox.
          private Float averageCharTolerance = 0.5f;
      
          //The space width-based tolerance value used to estimate where spaces in text should be added
          //Default taken from PDFBox.
          private Float spacingTolerance = 0.3f;
      

      This effective change in defaults has caused PDFParser to start adding more spaces than it did in 1.24 and earlier.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              clarkperkins Clark Perkins
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: