Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-255

Regular Expression for RegexUrlNormalizer to remove jsessionid

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Trivial
    • Resolution: Duplicate
    • 0.8
    • None
    • fetcher
    • None
    • Windows XP Media Center 2005, 2 Gigs RAM, 3.0 Ghz Pentium 4 Hyperthreaded, Eclipse 3.2.0

    Description

      Some URLs are filtered out by the crawl url filter for special characters (by default). One of these is the jsessionid urls such as:

      http://www.somesite.com;jsessionid=A8D7D812B5EFD3099F099A760F779E3B?query=string

      We want to get rid of the jessionid and keep everything else so that it looks like this:

      http://www.somesite.com?query=string

      Below is a regular expression for the regex-normalize.xml file used by the RegexUrlNormalizer that sucessfully removes jsessionid strings while leaving the hostname and querystring. I have also attached a patch for the regex-normalize.xml.template file that adds the following expression.

      <regex>
      <pattern>(.*)(;jsessionid=[a-zA-Z0-9]

      {32}

      )(.*)</pattern>
      <substitution>$1$3</substitution>
      </regex>

      Attachments

        1. urlnormalize_jessionid.patch
          0.6 kB
          Dennis Kubes

        Issue Links

          Activity

            People

              Unassigned Unassigned
              musepwizard Dennis Kubes
              Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: