Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-547

Redirection handling: YahooSlurp's algorithm

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.0
    • Component/s: fetcher
    • Labels:
      None

      Description

      After reading Yahoo's algorithm (then one Andrzej linked to:
      http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
      in the redirect/alias handling discussion, I had a bit of a spare
      time, so I implemented it.

      Note that the patch I am attaching is for the 'choosing' algorithm described in
      Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).

      E.g,
      generate "http://www.milliyet.com.tr/"

      fetch "http:/www.milliyet.com.tr/" which redirects to
      "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".

      Update second page's datum's metadata to indicate that
      "http://www.milliyet.com.tr/" is the representative form.

      Updatedb, invertlinks, etc...

      While indexing second page, change its "url" field to
      "http://www.milliyet.com.tr/".

        Attachments

        1. redirect_draft.patch
          25 kB
          Dogacan Guney
        2. redirect_draft_v2.patch
          26 kB
          Dogacan Guney
        3. NUTCH-547-3.patch
          26 kB
          Dennis Kubes

          Issue Links

            Activity

              People

              • Assignee:
                dogacan Dogacan Guney
                Reporter:
                dogacan Dogacan Guney
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: