Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2444

JP2 codestream files not parsed

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.16
    • Fix Version/s: None
    • Component/s: parser
    • Labels:

      Description

      We've come across some embedded files in the wild that are detected by Tika as image/x-jp2-codestream. The identification is correct according to a description of the format [1].

      However, no Parser implementation declares support for this format.

      It would makes to declare support for this format in the Tesseract OCR parser. However, the parser would need to contain functionality that either:

      1) wraps the codestream in a JP2 container;
      2) or transcodes the image to PNG.

      This is because while Tesseract supports JP2 (via Leptonica), it doesn't support the raw codestream as a file.

      [1] http://fileformats.archiveteam.org/wiki/JPEG_2000_codestream

        Attachments

        1. balloon.j2c
          614 kB
          Matthew Caruana Galizia

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              mcaruanagalizia Matthew Caruana Galizia
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: