[TIKA-3416] Extract logical images from PDFs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

PDFs, bless their hearts, can store a logical image as hundreds or thousands of subimages that when rendered, look like one image.

We currently have the option to let the user render the page and run OCR on that rendered image, or the user can extract inline images (and optionally run OCR on those extracted images).

Processing inline images, e.g. running OCR, can lead to surprising behavior and user tears (for those paying attention), not only because of this split image issue but also because PDFs can add filters and other modifiers to an image so that the image as stored in the PDF may not look at all like the image as rendered in a PDF.

There has to be a happier medium, and the user should be able to get back the renderings in, e.g., the /unpack endpoint (see TIKA-3348).

It would be handy for some use cases to do the geometry to find bounding boxes for image components and then render those bounding boxes so that a human gets a "logical image" <hand_waving>most of the time</hand_waving>.

There would have to be some heuristics for when to give up and just render the whole page, but I think we could do something that performed well enough. More importantly, I'm sure this is a solved problem...any recs for efficient algorithms for this?

What do you think?

Attachments

Issue Links

is related to

TIKA-3348 Improve the workflow for extracting and returning images from PDFs and other containers using Tika Server..

Open

Activity

People

Assignee:: Unassigned

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/May/21 16:45

Updated:: 24/May/21 17:46