Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-640

confusing description "set it to Integer.MAX_VALUE"

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.9.0
    • None
    • documentation
    • None

    Description

      This property "indexer.max.tokens" has the following description in nutch-default.xml :

      " The maximum number of tokens that will be indexed for a single field
      in a document. This limits the amount of memory required for
      indexing, so that collections with very large files will not crash
      the indexing process by running out of memory.

      Note that this effectively truncates large documents, excluding
      from the index tokens that occur further in the document. If you
      know your source documents are large, be sure to set this value
      high enough to accomodate the expected size. If you set it to
      Integer.MAX_VALUE, then the only limit is your memory, but you
      should anticipate an OutOfMemoryError."

      Apparently, "set it to Integer.MAX_VALUE" here means <<substitute the integer value of Integer.MAX_VALUE>>, and not <<put the text "Integer.MAX_VALUE" between the value tags>>. I think this is very confusing and the description should be improved.

      I first put <value>Integer.MAX_VALUE</value> in my configuration, and it took a long time to figure out what was wrong, especially since Nutch rolled back on the default value of 10000 instead of giving an error.

      Attachments

        1. NUTCH-640.patch
          2 kB
          Dogacan Guney

        Activity

          People

            dogacan Dogacan Guney
            stijn Stijn Vermeeren
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: