Nutch
  1. Nutch
  2. NUTCH-1352

Improve regex urlfilters/normalizers synchronization

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: nutchgora, 1.6
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      I noticed that during fetching a lot of the time the fetcherthreads are blocking on a monitor because of outlink normalizing/filtering. The cause of this: Some of the regex plugins use single lock synchronization.

      This patch improves throughput by removing synchronization locks and replace them with threadlocals were needed.

      It has been extensively tested in production. I will commit this later today when no objection.

      1. NUTCH-1352.patch
        15 kB
        Ferdy Galema
      2. NUTCH-1352-1.6-1.patch
        15 kB
        Markus Jelsma

        Activity

        Hide
        Markus Jelsma added a comment -

        Interesting. This should apply to trunk as well.

        Show
        Markus Jelsma added a comment - Interesting. This should apply to trunk as well.
        Hide
        Ferdy Galema added a comment -

        This indeed applies to trunk too. (Except for a minor patch segment about a logging statement... quite irrelevant).

        I'll commit it to trunk too.

        Show
        Ferdy Galema added a comment - This indeed applies to trunk too. (Except for a minor patch segment about a logging statement... quite irrelevant). I'll commit it to trunk too.
        Hide
        Ferdy Galema added a comment -

        On second thought, I will hold commit for trunk for now. (Feature freeze I guess?)

        Show
        Ferdy Galema added a comment - On second thought, I will hold commit for trunk for now. (Feature freeze I guess?)
        Hide
        Markus Jelsma added a comment -

        Yes, thanks

        Show
        Markus Jelsma added a comment - Yes, thanks
        Hide
        Markus Jelsma added a comment -

        Slightly modified patch for trunk.

        Show
        Markus Jelsma added a comment - Slightly modified patch for trunk.
        Hide
        Ferdy Galema added a comment -

        Thanks.

        Show
        Ferdy Galema added a comment - Thanks.
        Hide
        Ferdy Galema added a comment -

        committed at nutchgora

        Show
        Ferdy Galema added a comment - committed at nutchgora
        Hide
        Hudson added a comment -

        Integrated in Nutch-nutchgora #248 (See https://builds.apache.org/job/Nutch-nutchgora/248/)
        NUTCH-1352 Improve regex urlfilters/normalizers synchronization (Revision 1335066)

        Result = SUCCESS
        ferdy :
        Files :

        • /nutch/branches/nutchgora/CHANGES.txt
        • /nutch/branches/nutchgora/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java
        • /nutch/branches/nutchgora/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
        • /nutch/branches/nutchgora/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java
        • /nutch/branches/nutchgora/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
        • /nutch/branches/nutchgora/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java
        Show
        Hudson added a comment - Integrated in Nutch-nutchgora #248 (See https://builds.apache.org/job/Nutch-nutchgora/248/ ) NUTCH-1352 Improve regex urlfilters/normalizers synchronization (Revision 1335066) Result = SUCCESS ferdy : Files : /nutch/branches/nutchgora/CHANGES.txt /nutch/branches/nutchgora/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java /nutch/branches/nutchgora/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java /nutch/branches/nutchgora/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java /nutch/branches/nutchgora/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java /nutch/branches/nutchgora/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java
        Hide
        Lewis John McGibbney added a comment -

        Markus feel free to commit this for trunk

        Show
        Lewis John McGibbney added a comment - Markus feel free to commit this for trunk
        Hide
        Markus Jelsma added a comment -

        Committed for 1.6 in rev. 1349227.
        Thanks Ferdy.

        Show
        Markus Jelsma added a comment - Committed for 1.6 in rev. 1349227. Thanks Ferdy.
        Hide
        Hudson added a comment -

        Integrated in nutch-trunk-maven #310 (See https://builds.apache.org/job/nutch-trunk-maven/310/)
        NUTCH-1352 Improve regex urlfilters/normalizers synchronization (Revision 1349227)

        Result = SUCCESS
        markus :
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java
        • /nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
        • /nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java
        • /nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
        • /nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java
        Show
        Hudson added a comment - Integrated in nutch-trunk-maven #310 (See https://builds.apache.org/job/nutch-trunk-maven/310/ ) NUTCH-1352 Improve regex urlfilters/normalizers synchronization (Revision 1349227) Result = SUCCESS markus : Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java /nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java /nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java /nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java /nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java
        Hide
        Hudson added a comment -

        Integrated in Nutch-trunk #1869 (See https://builds.apache.org/job/Nutch-trunk/1869/)
        NUTCH-1352 Improve regex urlfilters/normalizers synchronization (Revision 1349227)

        Result = SUCCESS
        markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349227
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java
        • /nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
        • /nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java
        • /nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
        • /nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java
        Show
        Hudson added a comment - Integrated in Nutch-trunk #1869 (See https://builds.apache.org/job/Nutch-trunk/1869/ ) NUTCH-1352 Improve regex urlfilters/normalizers synchronization (Revision 1349227) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349227 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java /nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java /nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java /nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java /nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java

          People

          • Assignee:
            Unassigned
            Reporter:
            Ferdy Galema
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development