Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-603

Add more default url normalizations

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0
    • None
    • None
    • All

    • Patch Available

    Description

      By default the regex-urlnormalizers only remove PHPSESSID strings. I propose adding in more default url normalizers including expressions for removing different types of session ids, removing default pages, remvoing interpage links, and cleaning up url strings. The point of these expressions is to decrease the number of duplicate urls that are being stored and scored in the crawl database and being fetched.

      Attachments

        1. NUTCH-603-1-20080205.patch
          9 kB
          Dennis Kubes
        2. NUTCH-603-2-20080212.patch
          9 kB
          Dennis Kubes

        Activity

          People

            musepwizard Dennis Kubes
            musepwizard Dennis Kubes
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: