Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None

      Description

      Imho needed:
      1) Extend normalize-rules to commonly used session-id's etc.
      2) Ship a checker to check rules easily by hand

      1. regex-normalize2.patch
        4 kB
        Stefan Neufeind
      2. regex-normalize.patch
        4 kB
        Stefan Neufeind

        Issue Links

          Activity

          Hide
          Stefan Neufeind added a comment -

          1) Incorporates jsessionid-normalization from NUTCH-255
          2) Adds further normalizations
          3) Adds a commandline-checker. Start with:
          bin/nutch org.apache.nutch.net.RegexUrlNormalizerChecker

          Show
          Stefan Neufeind added a comment - 1) Incorporates jsessionid-normalization from NUTCH-255 2) Adds further normalizations 3) Adds a commandline-checker. Start with: bin/nutch org.apache.nutch.net.RegexUrlNormalizerChecker
          Hide
          Stefan Neufeind added a comment -

          New patch with just one session-ID-regex extended (also including . - , now), since I came across those extra chars while used on a common German website (www.bahn.de).

          Show
          Stefan Neufeind added a comment - New patch with just one session-ID-regex extended (also including . - , now), since I came across those extra chars while used on a common German website (www.bahn.de).
          Hide
          Andrzej Bialecki added a comment -

          Commited with some modifications. All patterns in this patch except one have been added in another commit, the remaining one (-S: ...) IMHO occurs too rarely and the pattern would be too incusive. The checking utility has been rewritten to follow a similar model like URLFilterChecker.

          Show
          Andrzej Bialecki added a comment - Commited with some modifications. All patterns in this patch except one have been added in another commit, the remaining one (-S: ...) IMHO occurs too rarely and the pattern would be too incusive. The checking utility has been rewritten to follow a similar model like URLFilterChecker.
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #714 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/714/)
          Additions to urlnormalizer-regex (modified).

          Show
          Hudson added a comment - Integrated in Nutch-trunk #714 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/714/ ) Additions to urlnormalizer-regex (modified).

            People

            • Assignee:
              Andrzej Bialecki
              Reporter:
              Stefan Neufeind
            • Votes:
              3 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development