Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-381

Ignore external link not work as expected

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Won't Fix
    • 0.8.1
    • 0.9.0
    • None
    • None

    Description

      Currently there is no way to properly limit fetcher without regexp rules we use ignore.external.link option but It seams that It doesn't work in all cases.
      Here is example urls I'm seeing but

      cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit.

      fetching http://help.yahoo.com/help/sports
      fetching http://www.turkish-xxx.com/adult-traffic-trade.php
      fetching http://help.yahoo.com/help/us/astr/
      fetching http://www.polish-xxx.com/de-index.html
      fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx
      fetching http://help.yahoo.com/help/groups
      fetching http://help.yahoo.com/help/fin/
      fetching http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx
      fetching http://help.yahoo.com/help/us/edit/
      fetching http://www.polish-xxx.com/es-index.html

      Anyone notice this?

      I assume that there must be something with expired domains where pages generates randomly. But still why urls from other domain was added. Maybe urlregexp filter +* exclude.

      Attachments

        Activity

          People

            ab Andrzej Bialecki
            sekundek Uros Gruber
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: