Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-93

OCR support

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 1.7
    • parser

    Description

      I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.

      Attachments

        1. Petr_tika-config.xml
          1 kB
          Petr Vas
        2. TesseractOCR_Tyler_v2.patch
          18 kB
          Tyler Bui-Palsulich
        3. TesseractOCR_Tyler_v3.patch
          20 kB
          Tyler Bui-Palsulich
        4. TesseractOCR_Tyler_v4.patch
          20 kB
          Tyler Bui-Palsulich
        5. TesseractOCR_Tyler.patch
          17 kB
          Tyler Bui-Palsulich
        6. TesseractOCRParser.patch
          25 kB
          Luís Filipe Nassif
        7. TesseractOCRParser.patch
          26 kB
          Luís Filipe Nassif
        8. testOCR.docx
          61 kB
          Grant Ingersoll
        9. testOCR.pdf
          41 kB
          Grant Ingersoll
        10. testOCR.pptx
          77 kB
          Grant Ingersoll
        11. TIKA-93.patch
          40 kB
          Grant Ingersoll
        12. TIKA-93.patch
          38 kB
          Grant Ingersoll
        13. TIKA-93.patch
          28 kB
          Grant Ingersoll
        14. TIKA-93.patch
          21 kB
          Grant Ingersoll

        Issue Links

          Activity

            People

              chrismattmann Chris A. Mattmann
              jukkaz Jukka Zitting
              Votes:
              12 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: