Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7
    • Component/s: parser
    • Labels:

      Description

      I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.

      1. Petr_tika-config.xml
        1 kB
        Petr Vas
      2. TesseractOCR_Tyler_v2.patch
        18 kB
        Tyler Palsulich
      3. TesseractOCR_Tyler_v3.patch
        20 kB
        Tyler Palsulich
      4. TesseractOCR_Tyler_v4.patch
        20 kB
        Tyler Palsulich
      5. TesseractOCR_Tyler.patch
        17 kB
        Tyler Palsulich
      6. TesseractOCRParser.patch
        25 kB
        Luis Filipe Nassif
      7. TesseractOCRParser.patch
        26 kB
        Luis Filipe Nassif
      8. testOCR.docx
        61 kB
        Grant Ingersoll
      9. testOCR.pdf
        41 kB
        Grant Ingersoll
      10. testOCR.pptx
        77 kB
        Grant Ingersoll
      11. TIKA-93.patch
        40 kB
        Grant Ingersoll
      12. TIKA-93.patch
        38 kB
        Grant Ingersoll
      13. TIKA-93.patch
        28 kB
        Grant Ingersoll
      14. TIKA-93.patch
        21 kB
        Grant Ingersoll

        Issue Links

          Activity

            People

            • Assignee:
              Chris A. Mattmann
              Reporter:
              Jukka Zitting
            • Votes:
              12 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development