[SOLR-5124] Solr glues word´s when parsing PDFs under certan circumstances - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Duplicate
Affects Version/s: 4.4
Fix Version/s: None
Component/s: update
Labels:
- tika,text-extraction
Environment:

Windows 7 (don´t think, this is relevant)

Description

For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word)
(Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr.
(This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document.
In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

01_alz_2009_folge11_2009_05_28.pdf
08/Aug/13 09:16
518 kB
Christoph Straßer
02_PDF.png
08/Aug/13 09:16
150 kB
Christoph Straßer
03_TikaOutput_GUI_MainContent.png
08/Aug/13 09:16
81 kB
Christoph Straßer
03_TikaOutput_GUI_PlainText.png
08/Aug/13 09:16
50 kB
Christoph Straßer
03_TikaOutput_GUI_StructuredText.png
08/Aug/13 09:16
57 kB
Christoph Straßer
03_TikaOutput.png
08/Aug/13 09:16
74 kB
Christoph Straßer
04_Solr.png
08/Aug/13 09:16
95 kB
Christoph Straßer

Issue Links

duplicates

SOLR-4679 HTML line breaks (<br>) are removed during indexing; causes wrong search results

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Christoph Straßer

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Aug/13 09:14

Updated:: 09/Aug/13 13:27

Resolved:: 08/Aug/13 09:58