[SOLR-2460] Some European characters cannot be parsed correctly for some PDFs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.4.1, 3.1
Fix Version/s: 3.1.1, 3.5
Component/s: contrib - Solr Cell (Tika extraction)
Labels:
None
Environment:

Tika, PDFBox

Description

The Norwegian characters (æ, ø and å) in the following PDF document will not display correctly after Solr has indexed it, using Solr Cell:
http://ridder.uio.no/dokument.pdf

If I manually change the version of PDFBox (one of Tika's dependencies) to 1.4.0, the document will parse correctly.

I suggest that the next release of Solr ships with version 0.9 of Tika which also has updated its PDFBox dependencies to 1.4.0

Attachments

Activity

People

Assignee:: Steven Rowe

Reporter:: Erlend Garåsen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Apr/11 10:50

Updated:: 27/Nov/11 12:36

Resolved:: 26/Oct/11 00:52