Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3384

Convert new transcribe package to a Parser

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This is a proposal to convert Lewis John McGibbney et al's awesome new transcribe code into a parser along the lines of Tesseract.

      In 2.x, I inverted the call order from 1.x. The image parsers now look to see if there's a parser that supports a pseudo mime, like image/ocr-jpeg, if there is, then they apply that parser to the stream. We could do the same thing with media files that the new transcription package supports.

      For those who want only ocr/transcription, they can turn off the image parsers and then decorate the OCR parser, for example, with supports "image/jpeg" and that parser will be called directly.

      What do you think?

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tallison Tim Allison

              Dates

              • Created:
                Updated:

                Issue deployment