[NUTCH-1458] Support for raw HTML field added to Solr - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.5.1
Fix Version/s: None
Component/s: indexer, parser
Labels:
- html
- nutch
- raw
- solr

Description

At the moment, the “content” field holds only the parsed text from the page. It would be nice to have a separate field in Solr document that would hold raw HTML from the crawled page.

Attachments

Issue Links

is related to

NUTCH-1785 Ability to index raw content

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Max Dzyuba

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Aug/12 08:44

Updated:: 28/Jan/21 14:03

Resolved:: 11/Jun/14 15:24