[PDFBOX-770] Greek text extraction - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.2.0, 1.2.1, 1.3.1
Fix Version/s: 1.5.0
Component/s: Text extraction
Labels:
None
Environment:
Ubuntu 10.04

Description

Greek text extraction error
Ι have a greek pdf but
a) after extraction the greek letter π is extracted as pi

for expamle
original text in pdf
"φυσικών προσώπων"

extracted text
"φυσικών piροσώpiων"

b) the greek letter μ is displayed as µ
there is no difference in display except that is different encoding and when searching for μ cannot find it (you find only the uppercase Μ)
if you copy μ as displayed search for that is working fine

e.g. the word is displayed as "κλίµακας" but it is different from the typed word κλίμακα due to the letter μ

due to this problem solr is not indexing documents correctly

is there any configuration I can make?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFBOX770-3842.txt
08/Mar/11 16:34
898 kB
Andreas Lehmkühler
3842.html
04/Jul/10 18:33
2.41 MB
Manos Karampasis
3842.pdf
04/Jul/10 18:33
508 kB
Manos Karampasis

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Manos Karampasis

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 04/Jul/10 17:40

Updated:: 08/Mar/11 16:37

Resolved:: 08/Mar/11 16:36