[TIKA-3170] PDF extraction space issue - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.24.1
Fix Version/s: 1.25
Component/s: parser
Labels:
None

Description

While extracting pdf files, we are observing spaces between some letters.

As per below documentation,

https://tika.apache.org/1.24.1/api/org/apache/tika/parser/pdf/PDFParserConfig.html

we can resolve this by disabling Enable Auto Space property. But when we disable this value, we are getting an issue with another text.

With Enable Auto Space

< <p>2014 C H A M B R E 2 e S E S S I O N D E L A 5 4 e L É G I S L A T U R EK A M E R 2 e Z I T T I N G V A N D E 5 4 e Z I T T I N G S P E R I O D E 2015

Without Enable Auto Space
> <p>*2014CHA*MBRE 2e SESSION DE LA 54e LÉGISLATUREKAMER 2e ZITTING VAN DE 54e ZITTINGSPERIODE2015

Now there is no space between 2014 and CHAMBRE.

Is there some configuration to over come this issue.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2020-08-18-20-23-16-159.png
18/Aug/20 14:53
26 kB
Akash
document_example.pdf
17/Aug/20 13:57
139 kB
Akash

Activity

People

Assignee:: Unassigned

Reporter:: Akash

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Aug/20 13:57

Updated:: 18/Aug/20 18:08

Resolved:: 18/Aug/20 18:08