[SOLR-4679] HTML line breaks ( ) are removed during indexing; causes wrong search results - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.2
Fix Version/s: 4.5
Component/s: contrib - Solr Cell (Tika extraction)
Labels:
None
Environment:

Windows Server 2008 R2, Java 6, Tomcat 7

Description

HTML line breaks ( , , , ...) seem to be removed during extraction of content from HTML-Files. They need to be replaced with a empty space.

Test-File:
<html>
<head>
<title>Test mit HTML-Zeilenschaltungen</title>
</head>

word1 word2 
Some other words, a special name like linz and another special name - vienna

</html>

The Solr-content-attribute contains the following text:
Test mit HTML-Zeilenschaltungen
word1word2
Some other words, a special name like linzand another special name - vienna

So we are not able to find the word "linz".

We use the ExtractingRequestHandler to put content into Solr. (wiki.apache.org/solr/ExtractingRequestHandler)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-4679__weird_TIKA-1134.patch
11/Jun/13 18:59
2 kB
Chris M. Hostetter
Solr_HtmlLineBreak_Vienna.png
05/Apr/13 12:24
108 kB
Christoph Straßer
Solr_HtmlLineBreak_Linz_NotFound.png
05/Apr/13 12:24
89 kB
Christoph Straßer
external.htm
05/Apr/13 12:19
0.2 kB
Christoph Straßer

Issue Links

is blocked by

TIKA-1134 ContentHandler gets ignorable whitespace for tags when parsing HTML

Closed

is duplicated by

SOLR-4908 SolrContentHandler procuces glued words when extracting html

Resolved

SOLR-5124 Solr glues word´s when parsing PDFs under certan circumstances

Closed

Activity

People

Assignee:: Uwe Schindler

Reporter:: Christoph Straßer

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 05/Apr/13 12:19

Updated:: 05/Oct/13 10:19

Resolved:: 09/Aug/13 13:28