Nutch
  1. Nutch
  2. NUTCH-1142

Normalization and filtering in WebGraph

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The WebGraph programs performs URL normalization. Since normalization of outlinks is already performed during the parse it should become optional. There is also no URL filtering mechanism in the web graph program. When a CrawlDatum is removed from the CrawlDB by an URL filter is should be possible to remove it from the web graph as well.

      1. NUTCH-1142-1.4.patch
        6 kB
        Markus Jelsma
      2. NUTCH-1142-1.5-2.patch
        7 kB
        Markus Jelsma
      3. NUTCH-1142-1.5-3.patch
        8 kB
        Markus Jelsma

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development