Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2021

Improving accuracy of Tesseract parser for Serial Number and Part Number (Numeric) Extraction

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.14
    • Component/s: ocr, parser
    • Labels:

      Description

      Tesseract OCR parser works well with images containing English text. However, there is possibility of improvement in case of alphanumeric and numeric content which require training Tesseract with the relevant cases in order to better extract content from images. Such a customization can be helpful in extraction of serial numbers from images of counterfeit electronics and other applications focussing on atypical textual content.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                chrismattmann Chris A. Mattmann
                Reporter:
                Zarana Parekh Zarana Parekh
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: