Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.14
    • Fix Version/s: 1.15
    • Component/s: ocr
    • Labels:
      None

      Description

      From https://github.com/apache/tika/pull/177 by Rafael Ferreira

      Extend support for increased PSM options up to 13 for modern versions of Tesseract.

      $ tesseract --version
      tesseract 3.05.00
       leptonica-1.74.1
        libjpeg 8d : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.8
      
      $ tesseract --help-psm
      Page segmentation modes:
        0    Orientation and script detection (OSD) only.
        1    Automatic page segmentation with OSD.
        2    Automatic page segmentation, but no OSD, or OCR.
        3    Fully automatic page segmentation, but no OSD. (Default)
        4    Assume a single column of text of variable sizes.
        5    Assume a single uniform block of vertically aligned text.
        6    Assume a single uniform block of text.
        7    Treat the image as a single text line.
        8    Treat the image as a single word.
        9    Treat the image as a single word in a circle.
       10    Treat the image as a single character.
       11    Sparse text. Find as much text as possible in no particular order.
       12    Sparse text with OSD.
       13    Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
      

        Activity

        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1261 (See https://builds.apache.org/job/Tika-trunk/1261/)
        TIKA-2357: Increased support for Tesseract PSM up to 13 from Rafael (david: https://github.com/apache/tika/commit/0aaa1215fd11632c349e9bdebac9829578276cb1)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRConfigTest.java
        • (edit) CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1261 (See https://builds.apache.org/job/Tika-trunk/1261/ ) TIKA-2357 : Increased support for Tesseract PSM up to 13 from Rafael (david: https://github.com/apache/tika/commit/0aaa1215fd11632c349e9bdebac9829578276cb1 ) (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRConfigTest.java (edit) CHANGES.txt
        Hide
        davemeikle Dave Meikle added a comment -

        Merged in 0aaa121. Thanks Rafael!

        Show
        davemeikle Dave Meikle added a comment - Merged in 0aaa121 . Thanks Rafael!

          People

          • Assignee:
            davemeikle Dave Meikle
            Reporter:
            davemeikle Dave Meikle
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development