Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3307

extracted text strings have repeated characters

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Bug
    • None
    • None
    • parser
    • None

    Description

      Extracted text from some PDF files includes some strings with repeated (doubled) characters.

      To reproduce the problem, download attached PDF file and run the following command:

      java -jar ./tika-app-1.25.jar -T WSHP-PRC025F-EN_07132019.pdf | egrep '(.)\1(.)\2'
      
      

      The bad strings all seem to be headings, so perhaps something in the font or other style features are causing the problem.

      First detected in version 1.19, retested with 1.25. Did not test earlier versions.

      Attachments

        1. WSHP-PRC025F-EN_07132019.pdf
          12.86 MB
          Paul Tyson

        Activity

          People

            Unassigned Unassigned
            paul.tyson Paul Tyson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: