Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1142

Normalization and filtering in WebGraph

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.5
    • None
    • None
    • Patch Available

    Description

      The WebGraph programs performs URL normalization. Since normalization of outlinks is already performed during the parse it should become optional. There is also no URL filtering mechanism in the web graph program. When a CrawlDatum is removed from the CrawlDB by an URL filter is should be possible to remove it from the web graph as well.

      Attachments

        1. NUTCH-1142-1.4.patch
          6 kB
          Markus Jelsma
        2. NUTCH-1142-1.5-2.patch
          7 kB
          Markus Jelsma
        3. NUTCH-1142-1.5-3.patch
          8 kB
          Markus Jelsma

        Issue Links

          Activity

            People

              markus17 Markus Jelsma
              markus17 Markus Jelsma
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: