Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-547

Redirection handling: YahooSlurp's algorithm

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0
    • fetcher
    • None

    Description

      After reading Yahoo's algorithm (then one Andrzej linked to:
      http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
      in the redirect/alias handling discussion, I had a bit of a spare
      time, so I implemented it.

      Note that the patch I am attaching is for the 'choosing' algorithm described in
      Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).

      E.g,
      generate "http://www.milliyet.com.tr/"

      fetch "http:/www.milliyet.com.tr/" which redirects to
      "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".

      Update second page's datum's metadata to indicate that
      "http://www.milliyet.com.tr/" is the representative form.

      Updatedb, invertlinks, etc...

      While indexing second page, change its "url" field to
      "http://www.milliyet.com.tr/".

      Attachments

        1. redirect_draft.patch
          25 kB
          Dogacan Guney
        2. redirect_draft_v2.patch
          26 kB
          Dogacan Guney
        3. NUTCH-547-3.patch
          26 kB
          Dennis Kubes

        Issue Links

          Activity

            People

              dogacan Dogacan Guney
              dogacan Dogacan Guney
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: