[TIKA-2650] Soft-hyphen is not extracted properly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Blocker
Resolution: Unresolved
Affects Version/s: 1.18
Fix Version/s: None
Component/s: app
Labels:
None

Flags:

Important

Description

We are tring to extract text from PDF. if PDF having any big word at the end of line then after half word there is soft hyphen and remaining word goes to next line. but which extracting these text TIKA automatically replace hyphen with space.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

document_example_w_sort.txt
19/Feb/20 17:00
43 kB
Yauheni Salopiy
document_example_wo_sort.txt
19/Feb/20 17:00
43 kB
Yauheni Salopiy
document_example.pdf
18/Feb/20 21:31
139 kB
Yauheni Salopiy
document_example.txt
18/Feb/20 21:31
45 kB
Yauheni Salopiy
output.txt
25/May/18 09:50
5 kB
Saurabh Patil
Peter Rabbit.pdf
24/May/18 14:32
3.12 MB
Saurabh Patil

Activity

People

Assignee:: Unassigned

Reporter:: Saurabh Patil

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/May/18 14:35

Updated:: 19/Feb/20 18:46