Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to escaped_fragment URL's and back to an AJAX URL.

      https://developers.google.com/webmasters/ajax-crawling/

      1. NUTCH-1323-1.8.patch
        16 kB
        Markus Jelsma
      2. NUTCH-1323-1.6-1.patch
        15 kB
        Markus Jelsma

        Issue Links

          Activity

          Hide
          markus17 Markus Jelsma added a comment -

          Patch for 1.6. See unit tests for examples. Please comment. There must be something wrong as all tests pass. Any tests to add?

          Show
          markus17 Markus Jelsma added a comment - Patch for 1.6. See unit tests for examples. Please comment. There must be something wrong as all tests pass. Any tests to add?
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          After a small test crawl on http://si.draagle.com:

          1. usage is cumbersome because you have to carefully think about in which steps to normalize URLs. This is because AjaxNormalizer acts as a flip-flop: hashbang URLs are escaped, escaped ones are unescaped. If URLs are normalized during parsing and then during CrawlDb update, you get the hashbang URL again.
          2. relative hashbang links are not resolved correctly. The outlink of
             base: http://si.draagle.com/?_escaped_fragment_=browse/group/root/
             <a href="#!static/draagle_pogoji_uporabe.html">
            

            should be

            http://si.draagle.com/?_escaped_fragment_=static/draagle_pogoji_uporabe.html
            

            but hardly

            http://si.draagle.com/?_escaped_fragment_=browse/group/root/&_escaped_fragment_=static/draagle_pogoji_uporabe.html
            
          3. the outlink set of one page with escaped base URL may contain escaped and unescaped URLs simultaneously as results of
            • a relative link without hashbang, e.g., <a href="#search">
            • a global link with hashbang

          If understood right:

          • URLs with escaped fragments are used
            • in crawlDb, segments, linkDb (URL acts as key)
            • for fetching
          • unescaped hashbang URLs
            • are used in the index (and shown to the user)
            • may appear in outlinks, redirects, and seeds

          Couldn't we bind the decision whether to (un)escape to the current normalizer scope:

          • if URL contains #!
            and scope is one of { inject, fetcher/redirect, outlink, ?crawldb/update? }

            => escape

          • if URL contains escaped_fragment=
            and scope is index
            => unescape
          Show
          wastl-nagel Sebastian Nagel added a comment - After a small test crawl on http://si.draagle.com: usage is cumbersome because you have to carefully think about in which steps to normalize URLs. This is because AjaxNormalizer acts as a flip-flop: hashbang URLs are escaped, escaped ones are unescaped. If URLs are normalized during parsing and then during CrawlDb update, you get the hashbang URL again. relative hashbang links are not resolved correctly. The outlink of base: http://si.draagle.com/?_escaped_fragment_=browse/group/root/ <a href="#!static/draagle_pogoji_uporabe.html"> should be http://si.draagle.com/?_escaped_fragment_=static/draagle_pogoji_uporabe.html but hardly http://si.draagle.com/?_escaped_fragment_=browse/group/root/&_escaped_fragment_=static/draagle_pogoji_uporabe.html the outlink set of one page with escaped base URL may contain escaped and unescaped URLs simultaneously as results of a relative link without hashbang, e.g., <a href="#search"> a global link with hashbang If understood right: URLs with escaped fragments are used in crawlDb, segments, linkDb (URL acts as key) for fetching unescaped hashbang URLs are used in the index (and shown to the user) may appear in outlinks, redirects, and seeds Couldn't we bind the decision whether to (un)escape to the current normalizer scope: if URL contains #! and scope is one of { inject, fetcher/redirect, outlink, ?crawldb/update? } => escape if URL contains escaped_fragment = and scope is index => unescape
          Hide
          behnam.nikbakht behnam nikbakht added a comment -

          hi
          when i want to crawl some dynamic url like this:
          http://d43.me/storage/examples/highslide-dynamic-urls/gallery.html#!mountain
          AjaxNorlalizer must convert this to:
          http://d43.me/storage/examples/highslide-dynamic-urls/gallery.html?_escaped_fragment_=mountain
          but there is problem:
          other normalizers remove # from urls based on rules in regex-normalize.xml
          also in
          src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
          there is a line that remove ref:
          if (url.getRef() != null) {
          ...
          for this, i test that must change to:
          if (url.getRef() != null) { // remove the ref
          file=file+"#"+url.getRef();
          changed = true;
          }
          and when remove rules in regex-normalize.xml , the plugin works correctly.

          Show
          behnam.nikbakht behnam nikbakht added a comment - hi when i want to crawl some dynamic url like this: http://d43.me/storage/examples/highslide-dynamic-urls/gallery.html#!mountain AjaxNorlalizer must convert this to: http://d43.me/storage/examples/highslide-dynamic-urls/gallery.html?_escaped_fragment_=mountain but there is problem: other normalizers remove # from urls based on rules in regex-normalize.xml also in src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java there is a line that remove ref: if (url.getRef() != null) { ... for this, i test that must change to: if (url.getRef() != null) { // remove the ref file=file+"#"+url.getRef(); changed = true; } and when remove rules in regex-normalize.xml , the plugin works correctly.
          Hide
          markus17 Markus Jelsma added a comment -

          @sebastian:
          yes, it should honor scoping rules.

          @behnam:
          you should work around this by changing URL normalizer order depening on your scope.

          However, we may also change the basic normalizer to disable reference removal via configuration. Changing order at fetch and index time to work-around this is cumbersome.

          Show
          markus17 Markus Jelsma added a comment - @sebastian: yes, it should honor scoping rules. @behnam: you should work around this by changing URL normalizer order depening on your scope. However, we may also change the basic normalizer to disable reference removal via configuration. Changing order at fetch and index time to work-around this is cumbersome.
          Hide
          behnam.nikbakht behnam nikbakht added a comment -

          yes , it's works correctly. thank you

          Show
          behnam.nikbakht behnam nikbakht added a comment - yes , it's works correctly. thank you
          Hide
          markus17 Markus Jelsma added a comment -

          Updated patch for trunk.

          Normalizer now relies on SCOPE_INDEXER, otherwise other rules are tried. This solves the problem of cumbersome usage. This new patch does not solve the problem of relative URL's. As far as i know, relative URL's never make it to normalizers anyway. To confirm i did a test crawl of that http://si.draagle.com/ homepage (with the crazy cookie thing, really, check it out!), here's the output of readdb.

          Url;Status code;Status name;Fetch Time;Modified Time;Retries since fetch;Retry interval seconds;Retry interval days;Score;Signature;Metadata
          "http://si.draagle.com/";6;"db_notmodified";Tue Apr 22 11:57:29 CEST 2014;Tue Mar 11 10:55:11 CET 2014;0;3628800.0;42.0;0.0;"c44af84abaf0042685a03bf2ecfd2927";"Content-Type:text/html|||_pst_:success(1), lastModified=0|||_rs_:25|||"
          "http://si.draagle.com/?_escaped_fragment_=/basket/show/";1;"db_unfetched";Tue Mar 11 10:57:32 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
          "http://si.draagle.com/?_escaped_fragment_=/browse/group/root/";1;"db_unfetched";Tue Mar 11 10:57:32 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
          "http://si.draagle.com/?_escaped_fragment_=/login/";1;"db_unfetched";Tue Mar 11 10:57:32 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
          "http://si.draagle.com/draagle_pogoji_uporabe.html";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
          "http://si.draagle.com/profiles.html";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
          "http://si.draagle.com/tvspot.html";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
          "http://www.apta-medica.com/";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
          "http://www.draagle.si/bolezni/index.html";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
          "http://www.medicina-danes.si/";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
          "http://www.novartisoncology.com/";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
          "http://www.orlkotnik.com/";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
          "http://www.zobozdravstvolavtar.com/";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
          

          I think this patch is nearly ready. Any other things to worry about?

          Show
          markus17 Markus Jelsma added a comment - Updated patch for trunk. Normalizer now relies on SCOPE_INDEXER, otherwise other rules are tried. This solves the problem of cumbersome usage. This new patch does not solve the problem of relative URL's. As far as i know, relative URL's never make it to normalizers anyway. To confirm i did a test crawl of that http://si.draagle.com/ homepage (with the crazy cookie thing, really, check it out!), here's the output of readdb. Url;Status code;Status name;Fetch Time;Modified Time;Retries since fetch;Retry interval seconds;Retry interval days;Score;Signature;Metadata "http: //si.draagle.com/" ;6; "db_notmodified" ;Tue Apr 22 11:57:29 CEST 2014;Tue Mar 11 10:55:11 CET 2014;0;3628800.0;42.0;0.0; "c44af84abaf0042685a03bf2ecfd2927" ; "Content-Type:text/html|||_pst_:success(1), lastModified=0|||_rs_:25|||" "http: //si.draagle.com/?_escaped_fragment_=/basket/show/" ;1; "db_unfetched" ;Tue Mar 11 10:57:32 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0; " null " ;"" "http: //si.draagle.com/?_escaped_fragment_=/browse/group/root/" ;1; "db_unfetched" ;Tue Mar 11 10:57:32 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0; " null " ;"" "http: //si.draagle.com/?_escaped_fragment_=/login/" ;1; "db_unfetched" ;Tue Mar 11 10:57:32 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0; " null " ;"" "http: //si.draagle.com/draagle_pogoji_uporabe.html" ;1; "db_unfetched" ;Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0; " null " ;"" "http: //si.draagle.com/profiles.html" ;1; "db_unfetched" ;Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0; " null " ;"" "http: //si.draagle.com/tvspot.html" ;1; "db_unfetched" ;Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0; " null " ;"" "http: //www.apta-medica.com/" ;1; "db_unfetched" ;Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0; " null " ;"" "http: //www.draagle.si/bolezni/index.html" ;1; "db_unfetched" ;Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0; " null " ;"" "http: //www.medicina-danes.si/" ;1; "db_unfetched" ;Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0; " null " ;"" "http: //www.novartisoncology.com/" ;1; "db_unfetched" ;Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0; " null " ;"" "http: //www.orlkotnik.com/" ;1; "db_unfetched" ;Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0; " null " ;"" "http: //www.zobozdravstvolavtar.com/" ;1; "db_unfetched" ;Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0; " null " ;"" I think this patch is nearly ready. Any other things to worry about?
          Hide
          lewismc Lewis John McGibbney added a comment -

          Markus Jelsma I am +1 on this, if requirement arises to add relative URL's we can address this is subsequent patch. Good work Markus.

          Show
          lewismc Lewis John McGibbney added a comment - Markus Jelsma I am +1 on this, if requirement arises to add relative URL's we can address this is subsequent patch. Good work Markus.
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          +1

          Show
          chrismattmann Chris A. Mattmann added a comment - +1
          Hide
          markus17 Markus Jelsma added a comment -

          Just in time for 1.10, Committed to trunk in revision 1659167.

          Show
          markus17 Markus Jelsma added a comment - Just in time for 1.10, Committed to trunk in revision 1659167.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Nutch-trunk #2971 (See https://builds.apache.org/job/Nutch-trunk/2971/)
          NUTCH-1323 AjaxNormalizer (markus: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1659167)

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/plugin/build.xml
          • /nutch/trunk/src/plugin/urlnormalizer-ajax
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/build.xml
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/ivy.xml
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/plugin.xml
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer/ajax
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer/ajax/AjaxURLNormalizer.java
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer/ajax
          • /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer/ajax/TestAjaxURLNormalizer.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Nutch-trunk #2971 (See https://builds.apache.org/job/Nutch-trunk/2971/ ) NUTCH-1323 AjaxNormalizer (markus: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1659167 ) /nutch/trunk/CHANGES.txt /nutch/trunk/src/plugin/build.xml /nutch/trunk/src/plugin/urlnormalizer-ajax /nutch/trunk/src/plugin/urlnormalizer-ajax/build.xml /nutch/trunk/src/plugin/urlnormalizer-ajax/ivy.xml /nutch/trunk/src/plugin/urlnormalizer-ajax/plugin.xml /nutch/trunk/src/plugin/urlnormalizer-ajax/src /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer/ajax /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer/ajax/AjaxURLNormalizer.java /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer/ajax /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer/ajax/TestAjaxURLNormalizer.java

            People

            • Assignee:
              markus17 Markus Jelsma
              Reporter:
              markus17 Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development