Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
0.9.0
-
None
-
None
Description
This property "indexer.max.tokens" has the following description in nutch-default.xml :
" The maximum number of tokens that will be indexed for a single field
in a document. This limits the amount of memory required for
indexing, so that collections with very large files will not crash
the indexing process by running out of memory.
Note that this effectively truncates large documents, excluding
from the index tokens that occur further in the document. If you
know your source documents are large, be sure to set this value
high enough to accomodate the expected size. If you set it to
Integer.MAX_VALUE, then the only limit is your memory, but you
should anticipate an OutOfMemoryError."
Apparently, "set it to Integer.MAX_VALUE" here means <<substitute the integer value of Integer.MAX_VALUE>>, and not <<put the text "Integer.MAX_VALUE" between the value tags>>. I think this is very confusing and the description should be improved.
I first put <value>Integer.MAX_VALUE</value> in my configuration, and it took a long time to figure out what was wrong, especially since Nutch rolled back on the default value of 10000 instead of giving an error.