Nutch
  1. Nutch
  2. NUTCH-255

Regular Expression for RegexUrlNormalizer to remove jsessionid

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Duplicate
    • Affects Version/s: 0.8
    • Fix Version/s: None
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      Windows XP Media Center 2005, 2 Gigs RAM, 3.0 Ghz Pentium 4 Hyperthreaded, Eclipse 3.2.0

      Description

      Some URLs are filtered out by the crawl url filter for special characters (by default). One of these is the jsessionid urls such as:

      http://www.somesite.com;jsessionid=A8D7D812B5EFD3099F099A760F779E3B?query=string

      We want to get rid of the jessionid and keep everything else so that it looks like this:

      http://www.somesite.com?query=string

      Below is a regular expression for the regex-normalize.xml file used by the RegexUrlNormalizer that sucessfully removes jsessionid strings while leaving the hostname and querystring. I have also attached a patch for the regex-normalize.xml.template file that adds the following expression.

      <regex>
      <pattern>(.*)(;jsessionid=[a-zA-Z0-9]

      {32}

      )(.*)</pattern>
      <substitution>$1$3</substitution>
      </regex>

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Dennis Kubes
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development