Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1431

Introduce link 'distance' and add configurable max distance in the generator

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 2.1
    • None
    • None
    • Patch Available

    Description

      Introducing a new feature that enables to crawl URLs within a specific distance (shortest path) from the injected source urls. This is where the db-updater of Nutchgora really shines. Because every url in the reducer has all of its inlinks present, it is really easy to determine what the shortest path is to that url. (I would not know how to cleanly implement this feature for trunk).

      Injected urls have distance 0. Outlink urls on those pages have distance 1. Outlinks on those pages have distance 2, etc. Outlinks that already had a smaller distance will keep that distance. Of all inlinks to a page, it will always select the smallest distance in order to maintain the shortest path garantuee.

      Generator now has a property 'generate.max.distance' (default set to -1) that specifies the maximum allowed distance of urls to select for fetch.

      Note that this is fundamentally different from the concept crawl 'depth'. Depth is used for crawl cycles. Distance allows to crawl for unlimited number of cycles AND always stay within a certain number of 'hops' from injected urls.

      I will attach a patch. Will commit in a few days. (It does not change crawl behaviour unless otherwise configured). Let me know if you have comments.

      Attachments

        1. NUTCH-1431.patch
          8 kB
          Ferdy

        Activity

          People

            Unassigned Unassigned
            ferdy.g Ferdy
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: