Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2190

Add "preserve_interword_spaces" option of tesseract

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: ocr
    • Labels:
      None

      Description

      This option will preserve the spaces for TXT output type so that the layout or context can be inferred while further parsing.

      to enable :: -c preserve_interword_spaces=1
      to disable :: -c preserve_interword_spaces=0 or simply don't mention

        Activity

        Hide
        dasbipulkumar Bipul Kumar added a comment -

        Please review. I will raise a pull request for this.

        Show
        dasbipulkumar Bipul Kumar added a comment - Please review. I will raise a pull request for this.
        Hide
        dasbipulkumar Bipul Kumar added a comment -

        Please provide the details of test-cases for this.

        Regards
        Bipul
        Imaginea Labs

        Show
        dasbipulkumar Bipul Kumar added a comment - Please provide the details of test-cases for this. Regards Bipul Imaginea Labs
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you!
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1164 (See https://builds.apache.org/job/Tika-trunk/1164/)
        TIKA-2190 – add configurability for preserve interword spacing (tallison: rev ae44b9e507dbb11b9b9f5c57cf342b47966ffb66)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        • (edit) tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties
        • (edit) CHANGES.txt
        • (add) tika-parsers/src/test/resources/test-documents/testOCR_spacing.png
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1164 (See https://builds.apache.org/job/Tika-trunk/1164/ ) TIKA-2190 – add configurability for preserve interword spacing (tallison: rev ae44b9e507dbb11b9b9f5c57cf342b47966ffb66) (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java (edit) tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties (edit) CHANGES.txt (add) tika-parsers/src/test/resources/test-documents/testOCR_spacing.png (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
        Hide
        dasbipulkumar Bipul Kumar added a comment -

        Hi Tim,

        If you are okay, then should I take up this. I want to start contributing
        and I can take up this.

        Regards
        Bipul

        Show
        dasbipulkumar Bipul Kumar added a comment - Hi Tim, If you are okay, then should I take up this. I want to start contributing and I can take up this. Regards Bipul
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Doh! Sorry. Already done. Any other areas for improvement?

        Show
        tallison@mitre.org Tim Allison added a comment - Doh! Sorry. Already done. Any other areas for improvement?
        Hide
        dasbipulkumar Bipul Kumar added a comment -

        I will let you know if I get anything while working more on this.

        Show
        dasbipulkumar Bipul Kumar added a comment - I will let you know if I get anything while working more on this.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Also...I forgot to mention that you may want to check out the hocr option (if you haven't already)...this outputs coordinates and can help with maintaining structure.

        Show
        tallison@mitre.org Tim Allison added a comment - Also...I forgot to mention that you may want to check out the hocr option (if you haven't already)...this outputs coordinates and can help with maintaining structure.
        Hide
        dasbipulkumar Bipul Kumar added a comment -

        Thanks Tim. I know that option and am using it but issue with hocr is that sometimes the y co-ordinate donot match for the words on the same line. So the TXT format can be used as extra info instead of writing code to predict the words on same line.

        Moreover many users can simply use TXT format with space info for simple and straight forward usecases instead of writing code to parse HOCR output. Simple user friendly.

        Show
        dasbipulkumar Bipul Kumar added a comment - Thanks Tim. I know that option and am using it but issue with hocr is that sometimes the y co-ordinate donot match for the words on the same line. So the TXT format can be used as extra info instead of writing code to predict the words on same line. Moreover many users can simply use TXT format with space info for simple and straight forward usecases instead of writing code to parse HOCR output. Simple user friendly.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Got it. Thank you, and thank you for opening this issue.

        Show
        tallison@mitre.org Tim Allison added a comment - Got it. Thank you, and thank you for opening this issue.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x #190 (See https://builds.apache.org/job/tika-2.x/190/)
        TIKA-2190 – Add test file for maintain spacing (tallison: rev f1a541378a89046b7e85e2e87c0a18013f414cd3)

        • (add) tika-test-resources/src/test/resources/test-documents/testOCR_spacing.png
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #190 (See https://builds.apache.org/job/tika-2.x/190/ ) TIKA-2190 – Add test file for maintain spacing (tallison: rev f1a541378a89046b7e85e2e87c0a18013f414cd3) (add) tika-test-resources/src/test/resources/test-documents/testOCR_spacing.png

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            dasbipulkumar Bipul Kumar
          • Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development