[PDFBOX-895] Infinite recursion when trying to extract text from specific types of PDFs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.3.1
Fix Version/s: 1.7.0
Component/s: Text extraction
Labels:
None

Description

Hello and thanks for PDFBox.

We just started using PDFBox for text extraction(through Tika)
and it fails to finish text extraction falling in an infinite loop
and never returning the text.

Please note that this happens only for a specific type of PDF
documents(used for hand writing recognition) such as the one attached.
Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
but I think that PDFBox should at least break out if extraction is not possible.

I wish I could give you more information but I know nothing about PDF format, parsing, etc.
Please let me know if you need any information or my help in any way.

Thanks a lot for your time.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test.pdf
19/Nov/10 12:22
542 kB
Panayiotis Vlissidis

Issue Links

requires

PDFBOX-956 Poor text extraction performance in PDFTextStripper.java

Closed

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Panayiotis Vlissidis

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 19/Nov/10 12:21

Updated:: 29/May/12 16:21

Resolved:: 09/Nov/11 07:13