[PDFBOX-5097] Rendered pdf image lacks all the text in this particular case - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Bug
Affects Version/s: 2.0.22
Fix Version/s: None
Component/s: Rendering
Labels:
- jbig2
Environment:
Linux DamianPad 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description

Hello,

I am working with pdfbox to transform input pdf files to images, which are later fed to an OCR library. It works perfectly in most of the cases but I stumbled upon this particular case in which all text disappeared from the rendered image.

My source code for the method which converts the pdf into images:

public List<BufferedImage> splitPdf(File pdfFile) throws IOException {
    List<BufferedImage> result = new ArrayList<>();

    PDDocument document = PDDocument.load(pdfFile);
    PDFRenderer pdfRenderer = new PDFRenderer(document);
    for (int pageIndex = 0; pageIndex < document.getNumberOfPages(); pageIndex++) {
        result.add(pdfRenderer.renderImage(pageIndex));
        debugPageImageInfo(result.get(result.size() - 1));
    }
    document.close();

    return result;
}

I attached to this issue the pdf file for which I identified the problem and the resulting images.

I hope this is helpful for anyone else encountering the same problem!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0.png
04/Feb/21 09:45
30 kB
Robert-Andrei Damian
1.png
04/Feb/21 09:45
26 kB
Robert-Andrei Damian
document(3).pdf
04/Feb/21 09:46
140 kB
Robert-Andrei Damian

Activity

People

Assignee:: Unassigned

Reporter:: Robert-Andrei Damian

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 04/Feb/21 09:49

Updated:: 04/Feb/21 18:25

Resolved:: 04/Feb/21 18:25