[TIKA-2021] Improving accuracy of Tesseract parser for Serial Number and Part Number (Numeric) Extraction - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.14
Component/s: ocr, parser
Labels:
- memex

Description

Tesseract OCR parser works well with images containing English text. However, there is possibility of improvement in case of alphanumeric and numeric content which require training Tesseract with the relevant cases in order to better extract content from images. Such a customization can be helpful in extraction of serial numbers from images of counterfeit electronics and other applications focussing on atypical textual content.

Attachments

Issue Links

links to

GitHub Pull Request #126

Activity

People

Assignee:: Chris A. Mattmann

Reporter:: Zarana Parekh

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/Jun/16 21:07

Updated:: 08/Jul/16 22:48

Resolved:: 07/Jul/16 06:39