Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1994

Integrate OCR with PDFParser

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.14, 2.0.0
    • None
    • None

    Description

      Users can now run OCR on individual images embedded inline in PDFs if they get the configuration right.

      There are some drawbacks: 1) the text appears as an attachment if using the RecursiveParserWrapper, 2) text may be more cleanly extracted on the fully rendered page instead of on the individual images (this is still tbd).

      It might be useful to run OCR against each rendered page (instead of the component images).

      Integrating OCR is on the roadmap for PDFBox 2.1 (PDFBOX-1912). This will allow us to experiment with strategies until the cleaner integration is available with PDFBox 2.1.

      Attachments

        Issue Links

          Activity

            People

              tallison Tim Allison
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: