Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3202

Tika duplicates the ocr text

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Works for Me
    • 1.24.1
    • None
    • None
    • None

    Description

      I m using tika 1.24.1 together with tesseract from docker image apache/tika:1.24-full

      The headerĀ X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue

      the output from pdf processing is duplicated:
      The output from the attached pdf file is:

      There is some text 
      [image: image0.jpg]
      
      There is some textT
      here is an image!!
      

      the curl to reproduce:

      curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: OCR_AND_TEXT" -T text_and_image.pdf  http://localhost:9998/tika
      

      Attachments

        1. text_and_image.pdf
          42 kB
          marek kapowicki

        Activity

          People

            Unassigned Unassigned
            marekkapowicki marek kapowicki
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: