Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2444

JP2 codestream files not parsed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.16
    • None
    • parser

    Description

      We've come across some embedded files in the wild that are detected by Tika as image/x-jp2-codestream. The identification is correct according to a description of the format [1].

      However, no Parser implementation declares support for this format.

      It would makes to declare support for this format in the Tesseract OCR parser. However, the parser would need to contain functionality that either:

      1) wraps the codestream in a JP2 container;
      2) or transcodes the image to PNG.

      This is because while Tesseract supports JP2 (via Leptonica), it doesn't support the raw codestream as a file.

      [1] http://fileformats.archiveteam.org/wiki/JPEG_2000_codestream

      Attachments

        1. balloon.j2c
          614 kB
          Matthew Caruana Galizia

        Activity

          People

            Unassigned Unassigned
            mcaruanagalizia Matthew Caruana Galizia
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: