Description
Tika has two properties in PDFParser.properties that control what happens in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract for OCR. These are ocrDPI (default 300) and ocrImageScale (default 2.0).
ocrDPI is passed to ImageIOUtil.writeImage, which uses it as the metadata in the image (i.e. it doesn't control scaling at all, it's just an advertised metadata field).
ocrImageScale is passed to PDFBox's PDFRenderer.renderImage, which uses it to specify the scale for rendering. This value is such that 1.0 == 72dpi, and therefore Tika's default is to request 144dpi for rendering.
This means that Tika is asking PDFBox to render at 144dpi, and then advertising 300dpi in the image metadata. This makes no sense to me, and is surely going to confuse Tesseract.
Instead of doing this, we should remove ocrImageScale, and use the same DPI value in both places.
We should keep the existing default DPI value, since Tesseract is trained at 300dpi by default, so this will mean that all stages between PDFRenderer and Tesseract are defaulting to 300dpi.
This change will have the side-effect that the temporary images between the PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi). This will have a memory and temporary disk space impact, but I think that it's still best to have the whole pipeline using 300dpi. People who have memory constraints will need to reduce ocrDPI and make the corresponding changes on the Tesseract side.
Attachments
Issue Links
- links to