[TIKA-3384] Convert new transcribe package to a Parser - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

This is a proposal to convert lewismc et al's awesome new transcribe code into a parser along the lines of Tesseract.

In 2.x, I inverted the call order from 1.x. The image parsers now look to see if there's a parser that supports a pseudo mime, like image/ocr-jpeg, if there is, then they apply that parser to the stream. We could do the same thing with media files that the new transcription package supports.

For those who want only ocr/transcription, they can turn off the image parsers and then decorate the OCR parser, for example, with supports "image/jpeg" and that parser will be called directly.

What do you think?

Attachments

Issue Links

is related to

TIKA-94 Speech-to-text transcription

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 04/May/21 21:25

Updated:: 18/May/21 14:47