Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2527

URL filter: provide rules to exclude localhost and private address spaces

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Implemented
    • Affects Version/s: 2.3.1, 1.14
    • Fix Version/s: 2.4, 1.15
    • Component/s: None
    • Labels:
      None

      Description

      While checking the log files of a large web crawl, I've found hundreds of (luckily failed) requests of local or private content:

      2018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: fetching http://127.0.0.42/ ...
      018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: fetch of http://127.0.0.42/ failed with: java.net.ConnectException: Connection refused (Connection refused)
      

      URLs pointing to localhost, loop-back addresses, private address spaces should be blocked for a wider web crawl where links are not controlled to avoid that information is leaked by links or redirects pointing to web interfaces of services running on the crawling machines (e.g., HDFS, Hadoop YARN).

      Of course, this must be optional. For testing it's quite common to crawl your local machine.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                snagel Sebastian Nagel
                Reporter:
                snagel Sebastian Nagel
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: