Nutch
  1. Nutch
  2. NUTCH-1142

Normalization and filtering in WebGraph

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The WebGraph programs performs URL normalization. Since normalization of outlinks is already performed during the parse it should become optional. There is also no URL filtering mechanism in the web graph program. When a CrawlDatum is removed from the CrawlDB by an URL filter is should be possible to remove it from the web graph as well.

      1. NUTCH-1142-1.5-3.patch
        8 kB
        Markus Jelsma
      2. NUTCH-1142-1.5-2.patch
        7 kB
        Markus Jelsma
      3. NUTCH-1142-1.4.patch
        6 kB
        Markus Jelsma

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk-ant #76 (See https://builds.apache.org/job/Nutch-trunk-ant/76/)
          NUTCH-1142 Normalization and filtering in WebGraph

          markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1200346
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
          Show
          Hudson added a comment - Integrated in Nutch-trunk-ant #76 (See https://builds.apache.org/job/Nutch-trunk-ant/76/ ) NUTCH-1142 Normalization and filtering in WebGraph markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1200346 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1659 (See https://builds.apache.org/job/Nutch-trunk/1659/)
          NUTCH-1142 Normalization and filtering in WebGraph

          markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1200346
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
          Show
          Hudson added a comment - Integrated in Nutch-trunk #1659 (See https://builds.apache.org/job/Nutch-trunk/1659/ ) NUTCH-1142 Normalization and filtering in WebGraph markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1200346 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
          Hide
          Hudson added a comment -

          Integrated in nutch-trunk-maven #16 (See https://builds.apache.org/job/nutch-trunk-maven/16/)
          NUTCH-1142 Normalization and filtering in WebGraph

          markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1200346
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
          Show
          Hudson added a comment - Integrated in nutch-trunk-maven #16 (See https://builds.apache.org/job/nutch-trunk-maven/16/ ) NUTCH-1142 Normalization and filtering in WebGraph markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1200346 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
          Hide
          Markus Jelsma added a comment -

          Committed for 1.5 in rev. 1200346.

          Show
          Markus Jelsma added a comment - Committed for 1.5 in rev. 1200346.
          Hide
          Markus Jelsma added a comment -

          I'll send this in today.

          Show
          Markus Jelsma added a comment - I'll send this in today.
          Hide
          Markus Jelsma added a comment -

          You are right, of course, although the segments we feed are usually already filtered and normalized by ParseOutputFormat. The same is true for the invertlinks program which is analogous to the parts of the webgraph program.

          I prefer a webgraph that represents the contents of a crawldb

          Ah well, it's optional. Thanks for sharing your thoughts Andrzej.

          Show
          Markus Jelsma added a comment - You are right, of course, although the segments we feed are usually already filtered and normalized by ParseOutputFormat. The same is true for the invertlinks program which is analogous to the parts of the webgraph program. I prefer a webgraph that represents the contents of a crawldb Ah well, it's optional. Thanks for sharing your thoughts Andrzej.
          Hide
          Andrzej Bialecki added a comment -

          +1, the patch looks good.

          (There is one philosophical aspect of this change, as with any situation where you calculate PageRank in presence of URL filtering: does it matter that a page was linked to from other pages that you decided to filter out? I.e. in Pagerank the relative page importance is a function of in-degree, and by filtering out incoming links you change the in-degree. This essentially means that you decide to ignore some evidence of a page being possibly more important, due to links from pages that may not be interesting to you but which still do exist. OTOH the incoming links may have been spam, so one would expect that in the grand picture it evens out.)

          Show
          Andrzej Bialecki added a comment - +1, the patch looks good. (There is one philosophical aspect of this change, as with any situation where you calculate PageRank in presence of URL filtering: does it matter that a page was linked to from other pages that you decided to filter out? I.e. in Pagerank the relative page importance is a function of in-degree, and by filtering out incoming links you change the in-degree. This essentially means that you decide to ignore some evidence of a page being possibly more important, due to links from pages that may not be interesting to you but which still do exist. OTOH the incoming links may have been spam, so one would expect that in the grand picture it evens out.)
          Hide
          Markus Jelsma added a comment -

          The tests finished. Legacy URL's we had laying around (e.g. with unnormalized null path) now finally merged with normalized null path URL's and the values add up! Please comment.

          Show
          Markus Jelsma added a comment - The tests finished. Legacy URL's we had laying around (e.g. with unnormalized null path) now finally merged with normalized null path URL's and the values add up! Please comment.
          Hide
          Markus Jelsma added a comment -

          New patch with the ability to normalize and filter existing LinkDatum url's and also passes the normalized map input key to the reducer. It also avoids writing the _SUCCESS file for all three inner jobs.

          Show
          Markus Jelsma added a comment - New patch with the ability to normalize and filter existing LinkDatum url's and also passes the normalized map input key to the reducer. It also avoids writing the _SUCCESS file for all three inner jobs.
          Hide
          Markus Jelsma added a comment -

          New patch also filters collected outlinks instead of just map keys.

          Show
          Markus Jelsma added a comment - New patch also filters collected outlinks instead of just map keys.
          Hide
          Markus Jelsma added a comment -

          Here's a patch for trunk.

          Show
          Markus Jelsma added a comment - Here's a patch for trunk.

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development