Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1614

Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.2.1
    • Fix Version/s: None
    • Component/s: indexer
    • Labels:
    • Patch Info:
      Patch Available

      Description

      Some pages we need to crawl (such as some main pages and different views of a main page) to get all the other pages, but we don't want to index those pages themselves. Therefore we cannot use the url filter approach.

      This plugin uses a file containing regex strings (see included sample file). If one of the regex strings matches with an entire URL, that URL will be excluded form indexing.

      The file to use is specified by the following property in nutch-site.xml:

      <property>
      <name>indexer.url.filter.exclude.regex.file</name>
      <value>regex-indexer-exclude-urls.txt</value>
      <description>
      Holds the file name containing the regex strings. Any URL matching one of these strings will be excluded from indexing.
      "#" indicates a comment line and will be ignored.
      </description>
      </property>

        Attachments

        1. RegexUtil.java
          5 kB
          Riyaz Shaik
        2. NUTCH-1614.patch
          13 kB
          Brian
        3. IndexerJob.java
          6 kB
          Riyaz Shaik

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                brian44 Brian
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: