Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1062

Migrate BasicURLNormalizer from Apache ORO to java.util.regex

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.10, 2.3.1
    • None
    • None

    Description

      Issue for migration from ORO to j.u.regex. There is a small problem here. I began the migration mostly because of the double slash issue using lookback which was not supported in ORO. This was to prevent the URL schema from being reduced to one slash. The current Basic URL Normalizer has this problem built-in!

              // this pattern tries to find spots like "xx//yy" in the url,
              // which could be replaced by a "/"
              adjacentSlashRule = new Rule();
              adjacentSlashRule.pattern = (Perl5Pattern)      
                compiler.compile("/{2,}", Perl5Compiler.READ_ONLY_MASK);     
              adjacentSlashRule.substitution = new Perl5Substitution("/");
      

      But provides the wrong solution as it touches the schema as well. What to do? Migrate to j.u.regex and keep this `feature` intact?

      edit: reading more it looks like it is being fixed at a later stage. A slash is added for URI schema's http & ftp.

      Attachments

        Issue Links

          Activity

            People

              markus17 Markus Jelsma
              markus17 Markus Jelsma
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: