Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2021

Improving accuracy of Tesseract parser for Serial Number and Part Number (Numeric) Extraction

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.14
    • Component/s: ocr, parser
    • Labels:

      Description

      Tesseract OCR parser works well with images containing English text. However, there is possibility of improvement in case of alphanumeric and numeric content which require training Tesseract with the relevant cases in order to better extract content from images. Such a customization can be helpful in extraction of serial numbers from images of counterfeit electronics and other applications focussing on atypical textual content.

        Issue Links

          Activity

          Hide
          Zarana Parekh Zarana Parekh added a comment -

          Thank you Tim Allison and Chris A. Mattmann for the feedback. I will add the updates in a new issue.

          Show
          Zarana Parekh Zarana Parekh added a comment - Thank you Tim Allison and Chris A. Mattmann for the feedback. I will add the updates in a new issue.
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Tim this would be great. Zarana Parekh can you open up a new issue with the updates Tim Allison suggests? Thanks!

          Show
          chrismattmann Chris A. Mattmann added a comment - Tim this would be great. Zarana Parekh can you open up a new issue with the updates Tim Allison suggests? Thanks!
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Any chance you could make the check for python static and remove the e.printStackTrace()s? Thank you!

          Wait...it would also be good to apply this to 2.x

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Any chance you could make the check for python static and remove the e.printStackTrace()s? Thank you! Wait...it would also be good to apply this to 2.x
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Tika-trunk #1079 (See https://builds.apache.org/job/Tika-trunk/1079/)
          fix for TIKA-2021 contributed by Zarana Parekh (zaranaparekh17: rev 48b27d219f791ee14f1e0ffa18e4e80583f3df54)

          • tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
          • tika-bundle/pom.xml
          • tika-parsers/pom.xml
          • tika-parsers/src/main/resources/org/apache/tika/parser/ocr/rotation.py
          • tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
          • tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties
            fix for TIKA-2021 contributed by Zarana Parekh (zaranaparekh17: rev de84d71b145045792b8a3bd175634251623188dc)
          • tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
          • tika-bundle/pom.xml
            Record TIKA-2021 change. (mattmann: rev 636060eb6c4a2ea4960ccc045f8bc5ae159c9117)
          • CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Tika-trunk #1079 (See https://builds.apache.org/job/Tika-trunk/1079/ ) fix for TIKA-2021 contributed by Zarana Parekh (zaranaparekh17: rev 48b27d219f791ee14f1e0ffa18e4e80583f3df54) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java tika-bundle/pom.xml tika-parsers/pom.xml tika-parsers/src/main/resources/org/apache/tika/parser/ocr/rotation.py tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties fix for TIKA-2021 contributed by Zarana Parekh (zaranaparekh17: rev de84d71b145045792b8a3bd175634251623188dc) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java tika-bundle/pom.xml Record TIKA-2021 change. (mattmann: rev 636060eb6c4a2ea4960ccc045f8bc5ae159c9117) CHANGES.txt
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Great work Zarana Parekh and thanks for the great review Lewis John McGibbney!

          LMC-053601:tika1.13 mattmann$ git commit -m "Fix to work if ImageMagick isn't present. Fix forbidden APIs."
          [master 6f16480] Fix to work if ImageMagick isn't present. Fix forbidden APIs.
           2 files changed, 3 insertions(+), 3 deletions(-)
          LMC-053601:tika1.13 mattmann$ git push -u origin master
          Counting objects: 267, done.
          Delta compression using up to 8 threads.
          Compressing objects: 100% (128/128), done.
          Writing objects: 100% (267/267), 29.68 KiB | 0 bytes/s, done.
          Total 267 (delta 93), reused 207 (delta 62)
          remote: tika git commit: Fix to work if ImageMagick isn't present. Fix forbidden APIs.
          remote: tika git commit: Merge branch 'TIKA-2021' of https://github.com/Zarana-Parekh/tika
          remote: tika git commit: fix orthogonal changes
          remote: tika git commit: formatting changes
          remote: tika git commit: added check for non-UNIX OS
          remote: tika git commit: formatting changes
          remote: tika git commit: rebasing pom.xml for tika-bundle
          remote: tika git commit: formatting chanages
          remote: tika git commit: updated config file
          remote: tika git commit: updated scope in pom.xml
          remote: tika git commit: updated Javadoc for Tesseract config and parser
          remote: tika git commit: updated property name, removed orthogonal changes
          remote: tika git commit: added validation tests for new processing features
          remote: tika git commit: optional processing enabled
          remote: tika git commit: fix for TIKA-2021 contributed by Zarana Parekh
          remote: tika git commit: fix for TIKA-2021 contributed by Zarana Parekh
          To https://git-wip-us.apache.org/repos/asf/tika.git
             95b2cd1..6f16480  master -> master
          Branch master set up to track remote branch master from origin.
          LMC-053601:tika1.13 mattmann$ 
          
          Show
          chrismattmann Chris A. Mattmann added a comment - Great work Zarana Parekh and thanks for the great review Lewis John McGibbney ! LMC-053601:tika1.13 mattmann$ git commit -m "Fix to work if ImageMagick isn't present. Fix forbidden APIs." [master 6f16480] Fix to work if ImageMagick isn't present. Fix forbidden APIs. 2 files changed, 3 insertions(+), 3 deletions(-) LMC-053601:tika1.13 mattmann$ git push -u origin master Counting objects: 267, done. Delta compression using up to 8 threads. Compressing objects: 100% (128/128), done. Writing objects: 100% (267/267), 29.68 KiB | 0 bytes/s, done. Total 267 (delta 93), reused 207 (delta 62) remote: tika git commit: Fix to work if ImageMagick isn't present. Fix forbidden APIs. remote: tika git commit: Merge branch 'TIKA-2021' of https://github.com/Zarana-Parekh/tika remote: tika git commit: fix orthogonal changes remote: tika git commit: formatting changes remote: tika git commit: added check for non-UNIX OS remote: tika git commit: formatting changes remote: tika git commit: rebasing pom.xml for tika-bundle remote: tika git commit: formatting chanages remote: tika git commit: updated config file remote: tika git commit: updated scope in pom.xml remote: tika git commit: updated Javadoc for Tesseract config and parser remote: tika git commit: updated property name, removed orthogonal changes remote: tika git commit: added validation tests for new processing features remote: tika git commit: optional processing enabled remote: tika git commit: fix for TIKA-2021 contributed by Zarana Parekh remote: tika git commit: fix for TIKA-2021 contributed by Zarana Parekh To https://git-wip-us.apache.org/repos/asf/tika.git 95b2cd1..6f16480 master -> master Branch master set up to track remote branch master from origin. LMC-053601:tika1.13 mattmann$
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/126

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/126
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user Zarana-Parekh opened a pull request:

          https://github.com/apache/tika/pull/126

          fix for TIKA-2021 contributed by Zarana Parekh

          Improving accuracy of Tesseract for better extraction of numeric and alphanumeric text from images.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/Zarana-Parekh/tika TIKA-2021

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/126.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #126


          commit 48b27d219f791ee14f1e0ffa18e4e80583f3df54
          Author: Zarana Parekh <zaranaparekh17@gmail.com>
          Date: 2016-06-25T01:53:00Z

          fix for TIKA-2021 contributed by Zarana Parekh


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user Zarana-Parekh opened a pull request: https://github.com/apache/tika/pull/126 fix for TIKA-2021 contributed by Zarana Parekh Improving accuracy of Tesseract for better extraction of numeric and alphanumeric text from images. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Zarana-Parekh/tika TIKA-2021 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/126.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #126 commit 48b27d219f791ee14f1e0ffa18e4e80583f3df54 Author: Zarana Parekh <zaranaparekh17@gmail.com> Date: 2016-06-25T01:53:00Z fix for TIKA-2021 contributed by Zarana Parekh

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              Zarana Parekh Zarana Parekh
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development