Nutch
  1. Nutch
  2. NUTCH-255

Regular Expression for RegexUrlNormalizer to remove jsessionid

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Duplicate
    • Affects Version/s: 0.8
    • Fix Version/s: None
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      Windows XP Media Center 2005, 2 Gigs RAM, 3.0 Ghz Pentium 4 Hyperthreaded, Eclipse 3.2.0

      Description

      Some URLs are filtered out by the crawl url filter for special characters (by default). One of these is the jsessionid urls such as:

      http://www.somesite.com;jsessionid=A8D7D812B5EFD3099F099A760F779E3B?query=string

      We want to get rid of the jessionid and keep everything else so that it looks like this:

      http://www.somesite.com?query=string

      Below is a regular expression for the regex-normalize.xml file used by the RegexUrlNormalizer that sucessfully removes jsessionid strings while leaving the hostname and querystring. I have also attached a patch for the regex-normalize.xml.template file that adds the following expression.

      <regex>
      <pattern>(.*)(;jsessionid=[a-zA-Z0-9]

      {32}

      )(.*)</pattern>
      <substitution>$1$3</substitution>
      </regex>

        Issue Links

          Activity

          Hide
          Andrzej Bialecki added a comment -

          Duplicate of NUTCH-279 .

          Show
          Andrzej Bialecki added a comment - Duplicate of NUTCH-279 .
          Hide
          Stefan Neufeind added a comment -

          You might want to have a / right after the .com in the example - but that's not too important here
          You can also omit the (.*) at beginning/end of expression as it's not needed for this task

          NUTCH-279 includes your patch modified in there.
          PS: Thanks for the contribution.

          Show
          Stefan Neufeind added a comment - You might want to have a / right after the .com in the example - but that's not too important here You can also omit the (.*) at beginning/end of expression as it's not needed for this task NUTCH-279 includes your patch modified in there. PS: Thanks for the contribution.
          Hide
          Dennis Kubes added a comment -

          Patch file that adds regular expression to remove jsessionid strings from urls to the regex-normalize.xml.template file.

          Show
          Dennis Kubes added a comment - Patch file that adds regular expression to remove jsessionid strings from urls to the regex-normalize.xml.template file.

            People

            • Assignee:
              Unassigned
              Reporter:
              Dennis Kubes
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development