[PDFBOX-2451] Only gibberish extracted from certain PDF files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

I was told to report a bug here. There are problems with extracting text from PDF files in Dutch. The bug was reported in issue ~~TIKA-1095~~ (https://issues.apache.org/jira/browse/TIKA-1095). The problem can be reproduced with the latest Tika version.

The extracted Text only shows gibberish (or in other cases question marks and incorrect characters).

It was suggested it could be a font problem. Could this be looked into?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

tika-other-document.png
24/Oct/14 09:35
49 kB
Stefan Postema
tika-metadata.png
24/Oct/14 08:01
53 kB
Stefan Postema
tika-formatted-text.png
24/Oct/14 08:01
26 kB
Stefan Postema
ALG 2010-05-19 03 bijlage 1 - besluitenlijst dagelijks bestuur d d 10 februari 2010.pdf
24/Oct/14 08:01
126 kB
Stefan Postema

Issue Links

relates to

TIKA-1095 Only gibberish extracted from this PDF

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Stefan Postema

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Oct/14 08:00

Updated:: 14/Mar/15 20:28

Resolved:: 06/Nov/14 07:05