Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      A DupeDB for Nutch and associated tools to create and read a database containing information on duplicates.

        Issue Links

          Activity

          Hide
          Julien Nioche added a comment -

          Can't we achieve the same thing using the new status added in NUTCH-656?

          Show
          Julien Nioche added a comment - Can't we achieve the same thing using the new status added in NUTCH-656 ?
          Hide
          Markus Jelsma added a comment -

          Hi Julien, no, this is something else. The DupeDB is a <DupeDatum,Text> database where the DupeDatum is a compound type of digest, URL path section, domain. The Text is the host part of the URL. This is generated by reading the CrawlDB. This DupeDB is then ingested by NUTCH-1326 together with NUTCH-1325 to output rules for NUTCH-1319.

          All these things are for solving the duplicate host problem in the CrawlDB by using a HostNormalizer. We crawled the internet (without filtering rules) for over a year. We quickly saw the fetcher fetching the same pages from the same domains over and over. The most typical host duplication is a website accessible over http://www.example.org/ and http://example.org/. This means twice as many unique URL's for many domains. You can not use manual URL filters to solve the problem, nor can you manually edit the HostNormalizer on this scale.

          These tools make it happen automatically.

          Here's an example of two DupeDB entries for the common www-problem (the first three columns make up the DupeDatum, the right is the host. The DupeDatum is the key in M/R):
          a218daf4a39ed75b24d977bb90394a11 /grande-bretagne-c-248.html annuaire-loisirs-seniors.fr annuaire-loisirs-seniors.fr
          a218daf4a39ed75b24d977bb90394a11 /grande-bretagne-c-248.html annuaire-loisirs-seniors.fr www.annuaire-loisirs-seniors.fr

          Here's a more interesting problem:
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz znacky.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz siku-farmer.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz impag.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz koleje.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz lifetime.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz penove-dekorace.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz grand.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz maxi.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz groovy-pets.katalog-hracek.cz

          Show
          Markus Jelsma added a comment - Hi Julien, no, this is something else. The DupeDB is a <DupeDatum,Text> database where the DupeDatum is a compound type of digest, URL path section, domain. The Text is the host part of the URL. This is generated by reading the CrawlDB. This DupeDB is then ingested by NUTCH-1326 together with NUTCH-1325 to output rules for NUTCH-1319 . All these things are for solving the duplicate host problem in the CrawlDB by using a HostNormalizer. We crawled the internet (without filtering rules) for over a year. We quickly saw the fetcher fetching the same pages from the same domains over and over. The most typical host duplication is a website accessible over http://www.example.org/ and http://example.org/ . This means twice as many unique URL's for many domains. You can not use manual URL filters to solve the problem, nor can you manually edit the HostNormalizer on this scale. These tools make it happen automatically. Here's an example of two DupeDB entries for the common www-problem (the first three columns make up the DupeDatum, the right is the host. The DupeDatum is the key in M/R): a218daf4a39ed75b24d977bb90394a11 /grande-bretagne-c-248.html annuaire-loisirs-seniors.fr annuaire-loisirs-seniors.fr a218daf4a39ed75b24d977bb90394a11 /grande-bretagne-c-248.html annuaire-loisirs-seniors.fr www.annuaire-loisirs-seniors.fr Here's a more interesting problem: c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz znacky.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz siku-farmer.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz impag.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz koleje.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz lifetime.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz penove-dekorace.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz grand.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz maxi.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz groovy-pets.katalog-hracek.cz

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development