Nutch
  1. Nutch
  2. NUTCH-1314

Impose a limit on the length of outlink target urls

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 2.3
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      In the past we have encountered situations where crawling specific broken sites resulted in ridiciously long urls that caused the stalling of tasks. The regex plugins (normalizing/filtering) processed single urls for hours, if not indefinitely hanging.

      My suggestion is to limit the outlink url target length as soon possible. It is a configurable limit, the default is 3000. This should be reasonably long enough for most uses. But sufficienly strict enough to make sure regex plugins do not choke on urls that are too long. Please see attached patch for the Nutchgora implementation.

      I'd like to hear what you think about this.

      1. NUTCH-1314.patch
        2 kB
        Ferdy Galema
      2. NUTCH-1314-trunk.patch
        3 kB
        Lewis John McGibbney
      3. NUTCH-1314-v2.patch
        4 kB
        Lewis John McGibbney
      4. NUTCH-1314-v3.patch
        2 kB
        Canan Girgin

        Issue Links

          Activity

          Hide
          Markus Jelsma added a comment -

          This should then also work for the Tika parser and the OutlinkExtractor i think. Parse-html is similar to parse-tika, it there are no outlinks obtain by getOutlinks in Domcontentutils then the outlink extractor is used.

          Show
          Markus Jelsma added a comment - This should then also work for the Tika parser and the OutlinkExtractor i think. Parse-html is similar to parse-tika, it there are no outlinks obtain by getOutlinks in Domcontentutils then the outlink extractor is used.
          Hide
          Ferdy Galema added a comment -

          Good one, I overlooked those but they should definitely be treated the same way.

          Show
          Ferdy Galema added a comment - Good one, I overlooked those but they should definitely be treated the same way.
          Hide
          Julien Nioche added a comment -

          What about doing this with a URLNormalizer (and make it the first to be called)?

          Show
          Julien Nioche added a comment - What about doing this with a URLNormalizer (and make it the first to be called)?
          Hide
          Ferdy Galema added a comment -

          I assume you mean an URLFilter? Or do you want to correct the length by cutting off the excessive part? I think the urls should be rejected, because they probably were malformed anyway.

          Show
          Ferdy Galema added a comment - I assume you mean an URLFilter? Or do you want to correct the length by cutting off the excessive part? I think the urls should be rejected, because they probably were malformed anyway.
          Hide
          Julien Nioche added a comment -

          I was under the impression that the patch did not remove the URL but substituted it with a shorter version. If the idea is to remove the URL altogether (which makes perfect sense) then yes it should be a URLFilter instead

          Show
          Julien Nioche added a comment - I was under the impression that the patch did not remove the URL but substituted it with a shorter version. If the idea is to remove the URL altogether (which makes perfect sense) then yes it should be a URLFilter instead
          Hide
          Ferdy Galema added a comment -

          I understand. I think the problem with implementing it with an urlfilter is that some parts of Nutch run the normalizers first. In the ParseUtil this is the case. Thus with malformed outlinks (of course this is where the majority of new urls are found) this will still be problematic. It makes sense to run normalizers first. Some urls still have a chance to be fixed (normalized) before they are filtered out.

          Therefore the scope of this issue is to apply a very crude (but effective) filter before normalizing/filtering code is run.

          Show
          Ferdy Galema added a comment - I understand. I think the problem with implementing it with an urlfilter is that some parts of Nutch run the normalizers first. In the ParseUtil this is the case. Thus with malformed outlinks (of course this is where the majority of new urls are found) this will still be problematic. It makes sense to run normalizers first. Some urls still have a chance to be fixed (normalized) before they are filtered out. Therefore the scope of this issue is to apply a very crude (but effective) filter before normalizing/filtering code is run.
          Hide
          Julien Nioche added a comment -

          This makes a good case for the merging of URL filters and normalizers (I think there is a JIRA on this) - we wouldn't need to worry about whether the the normalizer is called first etc...

          Show
          Julien Nioche added a comment - This makes a good case for the merging of URL filters and normalizers (I think there is a JIRA on this) - we wouldn't need to worry about whether the the normalizer is called first etc...
          Hide
          Ferdy Galema added a comment -

          Exactly. Until that merge is properly implemented we can rely on this quickfix.

          Show
          Ferdy Galema added a comment - Exactly. Until that merge is properly implemented we can rely on this quickfix.
          Hide
          Lewis John McGibbney added a comment -

          Fresh patches for 2.x and trunk branches respectively. Markus Jelsma I tried to accommodate your suggestions but please let me know if there is something we can work on.
          If someone could test it would be great.
          Thanks
          Lewis

          Show
          Lewis John McGibbney added a comment - Fresh patches for 2.x and trunk branches respectively. Markus Jelsma I tried to accommodate your suggestions but please let me know if there is something we can work on. If someone could test it would be great. Thanks Lewis
          Hide
          Tejas Patil added a comment -

          Hi Lewis,
          I tried to test both the patches. NUTCH-1314-trunk.patch gave compilation errors:

              [javac] /home/tejas/Desktop/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java:391: error: cannot find symbol
              [javac]                     fixEmbeddedParams(base, target) :  new URL(base, target);
              [javac]                     ^
              [javac]   symbol:   method fixEmbeddedParams(URL,String)
              [javac]   location: class DOMContentUtils
          

          For NUTCH-1314-v2.patch:
          I used this url and ran the HtmlParser parser.

          Before applying the patch:

          bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser about.html
          title: About Apache Nutch
          text: About Apache Nutch Apache > Nutch > Home   .................
          outlinks: [toUrl: file:skin/basic.css anchor: , toUrl: file:skin/screen.css anchor: , toUrl: file:skin/print.css anchor: , toUrl: file:skin/profile.css anchor: , toUrl: file:skin/getBlank.js anchor: , toUrl: file:skin/getMenu.js anchor: , toUrl: file:skin/fontsize.js anchor: , toUrl: file:images/favicon.ico anchor: , toUrl: http://www.apache.org/ anchor: Apache, toUrl: http://nutch.apache.org anchor: Nutch, toUrl: http://nutch.apache.org anchor: Home, toUrl: file:skin/breadcrumbs.js anchor: , toUrl: http://www.apache.org/ anchor: , toUrl: file:images/feather-small.gif anchor: , toUrl: http://nutch.apache.org/ anchor: , toUrl: file:images/nutch_logo_tm.gif anchor: , toUrl: file:index.html anchor: Main, toUrl: file:wiki.html anchor: Wiki, toUrl: http://issues.apache.org/jira/browse/NUTCH anchor: Jira, toUrl: file:index.html anchor: News, toUrl: file:credits.html anchor: Credits, toUrl: http://www.apache.org/foundation/thanks.html anchor: Thanks, toUrl: http://www.cafepress.com/nutch/ anchor: Buy Stuff, toUrl: http://www.apache.org/foundation/sponsorship.html anchor: Sponsorship, toUrl: http://www.apache.org/licenses/ anchor: License, toUrl: http://www.apache.org/security/ anchor: Security, toUrl: file:faq.html anchor: FAQ, toUrl: file:wiki.html anchor: Wiki, toUrl: file:tutorial.html anchor: Tutorial, toUrl: file:bot.html anchor: Robot, toUrl: file:apidocs-2.1/index.html anchor: API Docs (2.1), toUrl: file:apidocs-1.6/index.html anchor: API Docs (1.6), toUrl: https://builds.apache.org/job/Nutch-trunk/javadoc/ anchor: API Docs (trunk nightly), toUrl: https://builds.apache.org/job/Nutch-nutchgora/javadoc/ anchor: API Docs (2.x nightly), toUrl: file:downloads.html anchor: Download, toUrl: file:nightly.html anchor: Nightly builds, toUrl: file:sonar.html anchor: Sonar Analysis, toUrl: file:mailing_lists.html anchor: Mailing Lists, toUrl: file:issue_tracking.html anchor: Issue Tracking, toUrl: file:version_control.html anchor: Version Control, toUrl: file:old_downloads.html anchor: Older Downloads, toUrl: http://lucene.apache.org/java/ anchor: Lucene, toUrl: http://hadoop.apache.org/ anchor: Hadoop, toUrl: http://lucene.apache.org/solr/ anchor: Solr, toUrl: http://tika.apache.org/ anchor: Tika, toUrl: http://gora.apache.org anchor: Gora, toUrl: file:skin/images/rc-b-l-15-1body-2menu-3menu.png anchor: , toUrl: file:about.pdf anchor: PDF, toUrl: file:skin/images/pdfdoc.gif anchor: , toUrl: file:about.html#Overview anchor: Overview, toUrl: http://lucene.apache.org/java/ anchor: Apache Lucene, toUrl: http://lucene.apache.org/solr/ anchor: Apache Solr, toUrl: http://tika.apache.org/ anchor: Apache Tika, toUrl: http://hadoop.apache.org/ anchor: Hadoop cluster, toUrl: http://wiki.apache.org/nutch/ anchor: Nutch wiki., toUrl: http://www.apache.org/licenses/ anchor: The Apache Software Foundation. Apache Nutch, Nutch, Apache, the Apache feather logo, and the Apache Nutch project logo are trademarks of The Apache Software Foundation.]

          After applying the patch:

          bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser about.html
          title: About Apache Nutch
          text: About Apache Nutch Apache > Nutch > Home   .................
          outlinks: []

          Correct me if I am wrong: this patch would remove links of size > 3000. The outlinks are not super lengthy and that patch should not have removed those.

          Show
          Tejas Patil added a comment - Hi Lewis, I tried to test both the patches. NUTCH-1314 -trunk.patch gave compilation errors: [javac] /home/tejas/Desktop/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java:391: error: cannot find symbol [javac] fixEmbeddedParams(base, target) : new URL(base, target); [javac] ^ [javac] symbol: method fixEmbeddedParams(URL,String) [javac] location: class DOMContentUtils For NUTCH-1314 -v2.patch: I used this url and ran the HtmlParser parser. Before applying the patch: bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser about.html title: About Apache Nutch text: About Apache Nutch Apache > Nutch > Home   ................. outlinks: [toUrl: file:skin/basic.css anchor: , toUrl: file:skin/screen.css anchor: , toUrl: file:skin/print.css anchor: , toUrl: file:skin/profile.css anchor: , toUrl: file:skin/getBlank.js anchor: , toUrl: file:skin/getMenu.js anchor: , toUrl: file:skin/fontsize.js anchor: , toUrl: file:images/favicon.ico anchor: , toUrl: http://www.apache.org/ anchor: Apache, toUrl: http://nutch.apache.org anchor: Nutch, toUrl: http://nutch.apache.org anchor: Home, toUrl: file:skin/breadcrumbs.js anchor: , toUrl: http://www.apache.org/ anchor: , toUrl: file:images/feather-small.gif anchor: , toUrl: http://nutch.apache.org/ anchor: , toUrl: file:images/nutch_logo_tm.gif anchor: , toUrl: file:index.html anchor: Main, toUrl: file:wiki.html anchor: Wiki, toUrl: http://issues.apache.org/jira/browse/NUTCH anchor: Jira, toUrl: file:index.html anchor: News, toUrl: file:credits.html anchor: Credits, toUrl: http://www.apache.org/foundation/thanks.html anchor: Thanks, toUrl: http://www.cafepress.com/nutch/ anchor: Buy Stuff, toUrl: http://www.apache.org/foundation/sponsorship.html anchor: Sponsorship, toUrl: http://www.apache.org/licenses/ anchor: License, toUrl: http://www.apache.org/security/ anchor: Security, toUrl: file:faq.html anchor: FAQ, toUrl: file:wiki.html anchor: Wiki, toUrl: file:tutorial.html anchor: Tutorial, toUrl: file:bot.html anchor: Robot, toUrl: file:apidocs-2.1/index.html anchor: API Docs (2.1), toUrl: file:apidocs-1.6/index.html anchor: API Docs (1.6), toUrl: https://builds.apache.org/job/Nutch-trunk/javadoc/ anchor: API Docs (trunk nightly), toUrl: https://builds.apache.org/job/Nutch-nutchgora/javadoc/ anchor: API Docs (2.x nightly), toUrl: file:downloads.html anchor: Download, toUrl: file:nightly.html anchor: Nightly builds, toUrl: file:sonar.html anchor: Sonar Analysis, toUrl: file:mailing_lists.html anchor: Mailing Lists, toUrl: file:issue_tracking.html anchor: Issue Tracking, toUrl: file:version_control.html anchor: Version Control, toUrl: file:old_downloads.html anchor: Older Downloads, toUrl: http://lucene.apache.org/java/ anchor: Lucene, toUrl: http://hadoop.apache.org/ anchor: Hadoop, toUrl: http://lucene.apache.org/solr/ anchor: Solr, toUrl: http://tika.apache.org/ anchor: Tika, toUrl: http://gora.apache.org anchor: Gora, toUrl: file:skin/images/rc-b-l-15-1body-2menu-3menu.png anchor: , toUrl: file:about.pdf anchor: PDF, toUrl: file:skin/images/pdfdoc.gif anchor: , toUrl: file:about.html#Overview anchor: Overview, toUrl: http://lucene.apache.org/java/ anchor: Apache Lucene, toUrl: http://lucene.apache.org/solr/ anchor: Apache Solr, toUrl: http://tika.apache.org/ anchor: Apache Tika, toUrl: http://hadoop.apache.org/ anchor: Hadoop cluster, toUrl: http://wiki.apache.org/nutch/ anchor: Nutch wiki., toUrl: http://www.apache.org/licenses/ anchor: The Apache Software Foundation. Apache Nutch, Nutch, Apache, the Apache feather logo, and the Apache Nutch project logo are trademarks of The Apache Software Foundation.] After applying the patch: bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser about.html title: About Apache Nutch text: About Apache Nutch Apache > Nutch > Home   ................. outlinks: [] Correct me if I am wrong: this patch would remove links of size > 3000. The outlinks are not super lengthy and that patch should not have removed those.
          Hide
          Canan Girgin added a comment -

          I tried to test NUTCH-1314-v2.patch. But it removes links size<3000.In my opinion, "if (target.length() > maxTargetLength)" rows are not correct in patch file. It must be like "if (target.length() < maxTargetLength) ".
          NUTCH-1314-v2.patch file , there is a new parameter used ("parser.html.outlinks.max_target_length"). I think it must be defined in nutch-default.xml file.

          I attached a new patch file. In the ParseUtil class, target url length controlled before normalizer and filters. Is it correct?

          Show
          Canan Girgin added a comment - I tried to test NUTCH-1314 -v2.patch. But it removes links size<3000.In my opinion, "if (target.length() > maxTargetLength)" rows are not correct in patch file. It must be like "if (target.length() < maxTargetLength) ". NUTCH-1314 -v2.patch file , there is a new parameter used ("parser.html.outlinks.max_target_length"). I think it must be defined in nutch-default.xml file. I attached a new patch file. In the ParseUtil class, target url length controlled before normalizer and filters. Is it correct?
          Hide
          Otis Gospodnetic added a comment -

          BTW. we are using this now, too. +1 for committing, Ferdy Galema!

          Show
          Otis Gospodnetic added a comment - BTW. we are using this now, too. +1 for committing, Ferdy Galema !
          Hide
          Lewis John McGibbney added a comment -

          Hi Otis Gospodnetic which patch are you using... NUTCH-1314-v3.patch? Can you commit Otis Gospodnetic?

          Show
          Lewis John McGibbney added a comment - Hi Otis Gospodnetic which patch are you using... NUTCH-1314 -v3.patch? Can you commit Otis Gospodnetic ?
          Hide
          Tien Nguyen Manh added a comment -

          Lewis John McGibbney We are using NUTCH-1314-v3.patch

          Show
          Tien Nguyen Manh added a comment - Lewis John McGibbney We are using NUTCH-1314 -v3.patch

            People

            • Assignee:
              Unassigned
              Reporter:
              Ferdy Galema
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development