Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7
    • Component/s: parser
    • Labels:

      Description

      I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.

        Attachments

        1. TIKA-93.patch
          21 kB
          Grant Ingersoll
        2. TIKA-93.patch
          28 kB
          Grant Ingersoll
        3. TIKA-93.patch
          38 kB
          Grant Ingersoll
        4. TIKA-93.patch
          40 kB
          Grant Ingersoll
        5. testOCR.pptx
          77 kB
          Grant Ingersoll
        6. testOCR.pdf
          41 kB
          Grant Ingersoll
        7. testOCR.docx
          61 kB
          Grant Ingersoll
        8. TesseractOCRParser.patch
          26 kB
          Luis Filipe Nassif
        9. TesseractOCRParser.patch
          25 kB
          Luis Filipe Nassif
        10. TesseractOCR_Tyler.patch
          17 kB
          Tyler Palsulich
        11. TesseractOCR_Tyler_v4.patch
          20 kB
          Tyler Palsulich
        12. TesseractOCR_Tyler_v3.patch
          20 kB
          Tyler Palsulich
        13. TesseractOCR_Tyler_v2.patch
          18 kB
          Tyler Palsulich
        14. Petr_tika-config.xml
          1 kB
          Petr Vas

          Issue Links

            Activity

              People

              • Assignee:
                chrismattmann Chris A. Mattmann
                Reporter:
                jukkaz Jukka Zitting
              • Votes:
                12 Vote for this issue
                Watchers:
                31 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: