Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1331

limit crawler to defined depth

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4
    • 1.7
    • generator, parser, storage
    • None

    Description

      there is a need to limit crawler to some defined depth, and importance of this option is to avoid crawling of infinite loops, with dynamic generated urls, that occur in some sites, and to optimize crawler to select important urls.
      an option is define a iteration limit on generate,fetch,parse,updatedb cycle, but it works only if in each cycle, all of unfetched urls become fetched, (without recrawling them and with some other considerations)
      we can define a new parameter in CrawlDatum, named depth, and like score-opic algorithm, compute depth of a link after parse, and in generate, only select urls with valid depth.

      Attachments

        1. NUTCH-1331.patch
          5 kB
          behnam nikbakht
        2. NUTCH-1331-v2.patch
          11 kB
          Julien Nioche

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            behnam.nikbakht behnam nikbakht
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment