Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2435

New configuration allowing to choose whether to store 'parse_text' directory or not.

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 1.14
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Apach Nutch 1.13

      Description

      Whenever a page is parsed, one of the outputs is the directory 'parse_text'.
      It is intended to be used at the indexing phase so the page can be searched from a search engine such as Solr.
      In my special crawling case, I don't need to index the page contents. Therefore, creating and filing the 'parse_text' is not required for me. To optimize performance, I don't want the crawler to store this information to the filesystem.
      I propose a new parameter "parser.store.text" allowing to choose whether to store 'parse_text' directory or not. Its default value, of course, is "true".

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wastl-nagel Sebastian Nagel
                Reporter:
                maborec Marcos Bori
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: