[SOLR-4908] SolrContentHandler procuces glued words when extracting html - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 4.3
Fix Version/s: None
Component/s: contrib - Solr Cell (Tika extraction)
Labels:
None
Environment:

Windows 7, 64bit, Solr 4.3 example

Description

The SolrContentHandler produces glued words when extracting html

for html documents like:

<html><head></head><body>glued<br/>words</body></html>

This was solved in Tika ~~TIKA-343~~ but the problem occurs when using the extraction handler because the SolrContentHandler discards ignoreableWhitespace.
The Tika XHTMLContentHandler issues ignoreableWhitspace events with a newline in the character stream when a <br> tag is encountered.

The SolrContentHandler should be modified to add the ignorable whitespace to the content.

Reproduction Steps:

POST the html example file from the attachtments to http://localhost:8983/solr/update/extract?literal.id=html-test-1&commit=true (e.g. with curl or HTTP Requester Plugin in Firefox)
Query for the document http://localhost:8983/solr/collection1/select?q=id%3A%22html-test-1%22&fl=content&wt=xml&indent=true
Look for the field content, which contains the word "Shouldnotbeglued"

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

tika-test.html
07/Jun/13 08:27
0.1 kB
Markus Schuch

Issue Links

duplicates

SOLR-4679 HTML line breaks (<br>) are removed during indexing; causes wrong search results

Closed

is related to

TIKA-343 some parsers produces glued words

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Markus Schuch

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Jun/13 08:26

Updated:: 09/Aug/13 13:27

Resolved:: 11/Jun/13 19:05