Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7
    • Component/s: parser
    • Labels:

      Description

      I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.

        Attachments

        1. Petr_tika-config.xml
          1 kB
          Petr Vas
        2. TesseractOCR_Tyler_v2.patch
          18 kB
          Tyler Palsulich
        3. TesseractOCR_Tyler_v3.patch
          20 kB
          Tyler Palsulich
        4. TesseractOCR_Tyler_v4.patch
          20 kB
          Tyler Palsulich
        5. TesseractOCR_Tyler.patch
          17 kB
          Tyler Palsulich
        6. TesseractOCRParser.patch
          25 kB
          Luis Filipe Nassif
        7. TesseractOCRParser.patch
          26 kB
          Luis Filipe Nassif
        8. testOCR.docx
          61 kB
          Grant Ingersoll
        9. testOCR.pdf
          41 kB
          Grant Ingersoll
        10. testOCR.pptx
          77 kB
          Grant Ingersoll
        11. TIKA-93.patch
          40 kB
          Grant Ingersoll
        12. TIKA-93.patch
          38 kB
          Grant Ingersoll
        13. TIKA-93.patch
          28 kB
          Grant Ingersoll
        14. TIKA-93.patch
          21 kB
          Grant Ingersoll

          Issue Links

            Activity

              People

              • Assignee:
                chrismattmann Chris A. Mattmann
                Reporter:
                jukkaz Jukka Zitting
              • Votes:
                12 Vote for this issue
                Watchers:
                31 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: