Nutch
  1. Nutch
  2. NUTCH-160

Use standard Java Regex library rather than org.apache.oro.text.regex

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.8
    • Component/s: None
    • Labels:
      None

      Description

      org.apache.oro.text.regex is based on perl 5.003 which has some corner cases which perform poorly. The standard regular expression libraries for Java (1.4 and later) do not seen to contain these issues.

      1. regex.patch
        3 kB
        Rod Taylor

        Activity

        Hide
        Rod Taylor added a comment -

        Patch for RegexURLFilter.java

        Show
        Rod Taylor added a comment - Patch for RegexURLFilter.java
        Hide
        Rod Taylor added a comment -

        This patch also appears to eliminate the issue reported on November 18th to the mailing list with the subject "Urlfilter bug (doesn't return on long URLs)" regarding abnormally long urls causing a timeout in the URLFilter.

        Show
        Rod Taylor added a comment - This patch also appears to eliminate the issue reported on November 18th to the mailing list with the subject "Urlfilter bug (doesn't return on long URLs)" regarding abnormally long urls causing a timeout in the URLFilter.
        Hide
        Doug Cutting added a comment -

        +1

        I like this patch. I don't see a need for us to use oro anywhere, since Java now has good builtin regex support. And Java's regex's are faster in many cases, not just this:

        http://tbray.org/ongoing/When/200x/2004/08/22/PJre

        There are a few places in which Java's regex's are incompatible with Perl 5 regex's, documented in the "Comparison to Perl 5" section of:

        http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

        So this change is not completely back-compatible.

        Any objections?

        Show
        Doug Cutting added a comment - +1 I like this patch. I don't see a need for us to use oro anywhere, since Java now has good builtin regex support. And Java's regex's are faster in many cases, not just this: http://tbray.org/ongoing/When/200x/2004/08/22/PJre There are a few places in which Java's regex's are incompatible with Perl 5 regex's, documented in the "Comparison to Perl 5" section of: http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html So this change is not completely back-compatible. Any objections?
        Hide
        Doug Cutting added a comment -

        I just committed this patch. Thanks!

        Show
        Doug Cutting added a comment - I just committed this patch. Thanks!
        Hide
        Sami Siren added a comment -

        closing issues for released versions

        Show
        Sami Siren added a comment - closing issues for released versions

          People

          • Assignee:
            Unassigned
            Reporter:
            Rod Taylor
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development