Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2435

New configuration allowing to choose whether to store 'parse_text' directory or not.

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.13
    • 1.14
    • parser
    • None
    • Apach Nutch 1.13

    Description

      Whenever a page is parsed, one of the outputs is the directory 'parse_text'.
      It is intended to be used at the indexing phase so the page can be searched from a search engine such as Solr.
      In my special crawling case, I don't need to index the page contents. Therefore, creating and filing the 'parse_text' is not required for me. To optimize performance, I don't want the crawler to store this information to the filesystem.
      I propose a new parameter "parser.store.text" allowing to choose whether to store 'parse_text' directory or not. Its default value, of course, is "true".

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              maborec Marcos Bori
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: