Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2357

Allow Tesseract PSM up to 13

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.14
    • 1.15
    • ocr
    • None

    Description

      From https://github.com/apache/tika/pull/177 by Rafael Ferreira

      Extend support for increased PSM options up to 13 for modern versions of Tesseract.

      $ tesseract --version
      tesseract 3.05.00
       leptonica-1.74.1
        libjpeg 8d : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.8
      
      $ tesseract --help-psm
      Page segmentation modes:
        0    Orientation and script detection (OSD) only.
        1    Automatic page segmentation with OSD.
        2    Automatic page segmentation, but no OSD, or OCR.
        3    Fully automatic page segmentation, but no OSD. (Default)
        4    Assume a single column of text of variable sizes.
        5    Assume a single uniform block of vertically aligned text.
        6    Assume a single uniform block of text.
        7    Treat the image as a single text line.
        8    Treat the image as a single word.
        9    Treat the image as a single word in a circle.
       10    Treat the image as a single character.
       11    Sparse text. Find as much text as possible in no particular order.
       12    Sparse text with OSD.
       13    Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
      

      Attachments

        Issue Links

          Activity

            People

              davemeikle Dave Meikle
              davemeikle Dave Meikle
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: