[TIKA-2444] JP2 codestream files not parsed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.16
Fix Version/s: None
Component/s: parser
Labels:
- imageio
- images
- ocr

Description

We've come across some embedded files in the wild that are detected by Tika as image/x-jp2-codestream. The identification is correct according to a description of the format [1].

However, no Parser implementation declares support for this format.

It would makes to declare support for this format in the Tesseract OCR parser. However, the parser would need to contain functionality that either:

1) wraps the codestream in a JP2 container;
2) or transcodes the image to PNG.

This is because while Tesseract supports JP2 (via Leptonica), it doesn't support the raw codestream as a file.

[1] http://fileformats.archiveteam.org/wiki/JPEG_2000_codestream

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

balloon.j2c
22/Aug/17 16:02
614 kB
Matthew Caruana Galizia

Activity

People

Assignee:: Unassigned

Reporter:: Matthew Caruana Galizia

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Aug/17 15:00

Updated:: 30/Aug/17 17:37