Nutch
  1. Nutch
  2. NUTCH-1013

Migrate RegexURLNormalizer from Apache ORO to java.util.regex

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4, nutchgora
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Apache ORO uses old Perl 5-style regular expressions. Features such as the powerful lookbehind are not available. The project has become retired as well.

        Issue Links

          Activity

          Hide
          Markus Jelsma added a comment -

          Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220

          Show
          Markus Jelsma added a comment - Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1536 (See https://builds.apache.org/job/Nutch-trunk/1536/)
          NUTCH-1013 Migrate RegexURLNormalizer from Apache ORO to java.util.regex

          markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1142687
          Files :

          • /nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java
          • /nutch/trunk/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Nutch-trunk #1536 (See https://builds.apache.org/job/Nutch-trunk/1536/ ) NUTCH-1013 Migrate RegexURLNormalizer from Apache ORO to java.util.regex markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1142687 Files : /nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java /nutch/trunk/CHANGES.txt
          Hide
          Markus Jelsma added a comment -

          Committed for trunk in rev. 1142687.
          Thanks for your comments Julien.

          Show
          Markus Jelsma added a comment - Committed for trunk in rev. 1142687. Thanks for your comments Julien.
          Hide
          Julien Nioche added a comment -

          Ahh, sorry I'd read your comment too quickly. Since it compiles and passes the test, it is fine to commit to trunk

          Show
          Julien Nioche added a comment - Ahh, sorry I'd read your comment too quickly. Since it compiles and passes the test, it is fine to commit to trunk
          Hide
          Markus Jelsma added a comment -

          I have confirmed that it compiles against 2.0 and that 2.0 unit tests pass when i attached the 1.4 patch. I only didn't execute a crawl cycle. Sorry for the confusion, my assumption above was that i maybe could commit because it compiled and tests passed.

          [junit] Running org.apache.nutch.net.TestURLNormalizers
          [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.248 sec

          Failing tests are all Gora related.

          Show
          Markus Jelsma added a comment - I have confirmed that it compiles against 2.0 and that 2.0 unit tests pass when i attached the 1.4 patch. I only didn't execute a crawl cycle. Sorry for the confusion, my assumption above was that i maybe could commit because it compiled and tests passed. [junit] Running org.apache.nutch.net.TestURLNormalizers [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.248 sec Failing tests are all Gora related.
          Hide
          Julien Nioche added a comment -

          I'd rather we REALLY checked that it compiles and passes the test. We can leave the issue open for now until someone has the time to commit to trunk after checking that it is fine

          Show
          Julien Nioche added a comment - I'd rather we REALLY checked that it compiles and passes the test. We can leave the issue open for now until someone has the time to commit to trunk after checking that it is fine
          Hide
          Markus Jelsma added a comment -

          Committed for 1.4 in rev 1142664.

          I can also commit this for trunk on just the assumption that it compiles well and unit tests pass and the normalizer code hasn't changed between versions. Is this assumption safe enough?

          Show
          Markus Jelsma added a comment - Committed for 1.4 in rev 1142664. I can also commit this for trunk on just the assumption that it compiles well and unit tests pass and the normalizer code hasn't changed between versions. Is this assumption safe enough?
          Hide
          Markus Jelsma added a comment -

          The commented line has been removed now. I'll commit this shortly for 1.4-dev. I'm unsure whether to commit it for trunk as i don't have full test environments here and not too much time.

          Show
          Markus Jelsma added a comment - The commented line has been removed now. I'll commit this shortly for 1.4-dev. I'm unsure whether to commit it for trunk as i don't have full test environments here and not too much time.
          Hide
          Julien Nioche added a comment -

          +1 patch looks fine except maybe for :

          -    Perl5Compiler compiler = new Perl5Compiler();
          +    //Perl5Compiler compiler = new Perl5Compiler();
          

          where the second line should be removed

          Show
          Julien Nioche added a comment - +1 patch looks fine except maybe for : - Perl5Compiler compiler = new Perl5Compiler(); + //Perl5Compiler compiler = new Perl5Compiler(); where the second line should be removed
          Hide
          Julien Nioche added a comment -

          Thanks for the details Markus. I don't think that 17% justifies keeping ORO, especially that as you pointed out j.u.regex has more functionalities and is maintained. I'll have a look at the patch tomorrow

          Show
          Julien Nioche added a comment - Thanks for the details Markus. I don't think that 17% justifies keeping ORO, especially that as you pointed out j.u.regex has more functionalities and is maintained. I'll have a look at the patch tomorrow
          Hide
          Markus Jelsma added a comment -

          Yes, ORO is superior in terms of raw speed, on average ORO is ~17% faster. This has been measured with a CrawlDB rougly about 2.2 million URLS. The generator is not limited with -topN.

          Java regex averages on 310 seconds whereas ORO averages on 263 seconds run time. This was on a dedicated machine without Hadoop.

          More interesting, in my opinion, is the reduced memory consumption. ORO uses almost three times more heap space than util.regex. The same generate cycles show about 12.4% for ORO and util.regex never went higher than 4.8%.

          Is the performance penalty considered to be blocking?

          Show
          Markus Jelsma added a comment - Yes, ORO is superior in terms of raw speed, on average ORO is ~17% faster. This has been measured with a CrawlDB rougly about 2.2 million URLS. The generator is not limited with -topN. Java regex averages on 310 seconds whereas ORO averages on 263 seconds run time. This was on a dedicated machine without Hadoop. More interesting, in my opinion, is the reduced memory consumption. ORO uses almost three times more heap space than util.regex. The same generate cycles show about 12.4% for ORO and util.regex never went higher than 4.8%. Is the performance penalty considered to be blocking?
          Hide
          Julien Nioche added a comment -

          Any idea about how java.util.regex fares against ORO in terms of speed?

          Show
          Julien Nioche added a comment - Any idea about how java.util.regex fares against ORO in terms of speed?
          Hide
          Markus Jelsma added a comment -

          Agreed, although i would then definately get rid of the XML configuration file. Anyway, i'd like to commit this issue for 1.4, objections?

          Show
          Markus Jelsma added a comment - Agreed, although i would then definately get rid of the XML configuration file. Anyway, i'd like to commit this issue for 1.4, objections?
          Hide
          Ken Krugler added a comment -

          No comment directly related to this patch, but URL normalization seems like a great component to move into crawler-commons, since all web crawlers need to do the same thing.

          Show
          Ken Krugler added a comment - No comment directly related to this patch, but URL normalization seems like a great component to move into crawler-commons, since all web crawlers need to do the same thing.
          Hide
          Markus Jelsma added a comment - - edited

          Patch for RegexURLNormalizer for 1.4. Seems to work fine. It also compiles against trunk. Unit tests pass as well.

          Are there objections? Thinks to take special care off?

          Show
          Markus Jelsma added a comment - - edited Patch for RegexURLNormalizer for 1.4. Seems to work fine. It also compiles against trunk. Unit tests pass as well. Are there objections? Thinks to take special care off?

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development