[NUTCH-2435] New configuration allowing to choose whether to store 'parse_text' directory or not. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.13
Fix Version/s: 1.14
Component/s: parser
Labels:
None
Environment:

Apach Nutch 1.13

Description

Whenever a page is parsed, one of the outputs is the directory 'parse_text'.
It is intended to be used at the indexing phase so the page can be searched from a search engine such as Solr.
In my special crawling case, I don't need to index the page contents. Therefore, creating and filing the 'parse_text' is not required for me. To optimize performance, I don't want the crawler to store this information to the filesystem.
I propose a new parameter "parser.store.text" allowing to choose whether to store 'parse_text' directory or not. Its default value, of course, is "true".

Attachments

Issue Links

links to

GitHub Pull Request #225

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Marcos Bori

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Sep/17 11:05

Updated:: 13/Mar/24 14:50

Resolved:: 19/Oct/17 21:29