Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-93

OCR support

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 1.7
    • parser

    Description

      I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.

      Attachments

        1. TIKA-93.patch
          21 kB
          Grant Ingersoll
        2. TIKA-93.patch
          28 kB
          Grant Ingersoll
        3. TIKA-93.patch
          38 kB
          Grant Ingersoll
        4. TIKA-93.patch
          40 kB
          Grant Ingersoll
        5. testOCR.pptx
          77 kB
          Grant Ingersoll
        6. testOCR.pdf
          41 kB
          Grant Ingersoll
        7. testOCR.docx
          61 kB
          Grant Ingersoll
        8. TesseractOCRParser.patch
          26 kB
          Luís Filipe Nassif
        9. TesseractOCRParser.patch
          25 kB
          Luís Filipe Nassif
        10. TesseractOCR_Tyler.patch
          17 kB
          Tyler Bui-Palsulich
        11. TesseractOCR_Tyler_v4.patch
          20 kB
          Tyler Bui-Palsulich
        12. TesseractOCR_Tyler_v3.patch
          20 kB
          Tyler Bui-Palsulich
        13. TesseractOCR_Tyler_v2.patch
          18 kB
          Tyler Bui-Palsulich
        14. Petr_tika-config.xml
          1 kB
          Petr Vas

        Issue Links

          Activity

            People

              chrismattmann Chris A. Mattmann
              jukkaz Jukka Zitting
              Votes:
              12 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: