[PDFBOX-2711] Japanese text not extracted - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8.8, 2.0.0
Fix Version/s: 2.0.0
Component/s: None
Labels:
None

Description

ExtractText does not return the text content of this PDF. There are just a few real characters when running 1.8.8, and none with today's 2.0.0 snapshot.

I also attach the output from pdftotext 0.26.5 (from poppler-utils), which seems to get it mostly right.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

150218.pdf
16/Mar/15 15:04
71 kB
Daniel Bonniot de Ruisselet
150218-pdfbox-1.8.8.txt
16/Mar/15 15:04
0.9 kB
Daniel Bonniot de Ruisselet
150218-pdfbox-2.0.0.txt
16/Mar/15 15:04
0.2 kB
Daniel Bonniot de Ruisselet
150218-pdftotext.txt
16/Mar/15 15:04
5 kB
Daniel Bonniot de Ruisselet

Issue Links

is related to

PDFBOX-2272 Can't extract vertical text correctly

Open

relates to

PDFBOX-2509 Korean Text font substitution issues

Closed

Activity

People

Assignee:: John Hewson

Reporter:: Daniel Bonniot de Ruisselet

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Mar/15 15:04

Updated:: 17/Mar/16 19:08

Resolved:: 08/May/15 23:20