[PDFBOX-3782] Text extraction loses whitespace - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.0.4, 2.0.5, 2.0.6
Fix Version/s: None
Component/s: Text extraction
Labels:
None
Environment:
Java/Tika

Description

I have a PDF document that I am using Tika/PDFBox to extract the content. In several areas, the content extracted loses the whitespace, causing a tokenization problem for indexing/searching.

I have attached the original document and the text output. If you search (Ctrl+f) the text document for "Another example". Here you will see no space after "is" and the Japanese text. The same issue shows for "whichmeans"eraser"" at the end of the sentence.
Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”

I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font RGOFPX+IPAexMincho" during extraction but have been unable to find any information on it.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Test doc - Japanese writing system - Kanji Hiragana Katakana.txt
08/May/17 17:01
30 kB
Tony Bray
Test doc - Japanese writing system - Kanji Hiragana Katakana.pdf
08/May/17 17:01
150 kB
Tony Bray
PDFBOX-3782-reduced.pdf
08/May/17 19:00
18 kB
Tilman Hausherr

Activity

People

Assignee:: Unassigned

Reporter:: Tony Bray

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/May/17 17:01

Updated:: 10/May/17 15:58