Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.10
    • Component/s: None
    • Labels:
      None

      Description

      A DupeDB for Nutch and associated tools to create and read a database containing information on duplicates.

        Issue Links

          Activity

          Julien Nioche made changes -
          Fix Version/s 1.10 [ 12327187 ]
          Fix Version/s 1.9 [ 12324611 ]
          Hide
          Markus Jelsma added a comment -

          Hi Julien, no, this is something else. The DupeDB is a <DupeDatum,Text> database where the DupeDatum is a compound type of digest, URL path section, domain. The Text is the host part of the URL. This is generated by reading the CrawlDB. This DupeDB is then ingested by NUTCH-1326 together with NUTCH-1325 to output rules for NUTCH-1319.

          All these things are for solving the duplicate host problem in the CrawlDB by using a HostNormalizer. We crawled the internet (without filtering rules) for over a year. We quickly saw the fetcher fetching the same pages from the same domains over and over. The most typical host duplication is a website accessible over http://www.example.org/ and http://example.org/. This means twice as many unique URL's for many domains. You can not use manual URL filters to solve the problem, nor can you manually edit the HostNormalizer on this scale.

          These tools make it happen automatically.

          Here's an example of two DupeDB entries for the common www-problem (the first three columns make up the DupeDatum, the right is the host. The DupeDatum is the key in M/R):
          a218daf4a39ed75b24d977bb90394a11 /grande-bretagne-c-248.html annuaire-loisirs-seniors.fr annuaire-loisirs-seniors.fr
          a218daf4a39ed75b24d977bb90394a11 /grande-bretagne-c-248.html annuaire-loisirs-seniors.fr www.annuaire-loisirs-seniors.fr

          Here's a more interesting problem:
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz znacky.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz siku-farmer.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz impag.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz koleje.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz lifetime.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz penove-dekorace.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz grand.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz maxi.katalog-hracek.cz
          c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz groovy-pets.katalog-hracek.cz

          Show
          Markus Jelsma added a comment - Hi Julien, no, this is something else. The DupeDB is a <DupeDatum,Text> database where the DupeDatum is a compound type of digest, URL path section, domain. The Text is the host part of the URL. This is generated by reading the CrawlDB. This DupeDB is then ingested by NUTCH-1326 together with NUTCH-1325 to output rules for NUTCH-1319 . All these things are for solving the duplicate host problem in the CrawlDB by using a HostNormalizer. We crawled the internet (without filtering rules) for over a year. We quickly saw the fetcher fetching the same pages from the same domains over and over. The most typical host duplication is a website accessible over http://www.example.org/ and http://example.org/ . This means twice as many unique URL's for many domains. You can not use manual URL filters to solve the problem, nor can you manually edit the HostNormalizer on this scale. These tools make it happen automatically. Here's an example of two DupeDB entries for the common www-problem (the first three columns make up the DupeDatum, the right is the host. The DupeDatum is the key in M/R): a218daf4a39ed75b24d977bb90394a11 /grande-bretagne-c-248.html annuaire-loisirs-seniors.fr annuaire-loisirs-seniors.fr a218daf4a39ed75b24d977bb90394a11 /grande-bretagne-c-248.html annuaire-loisirs-seniors.fr www.annuaire-loisirs-seniors.fr Here's a more interesting problem: c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz znacky.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz siku-farmer.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz impag.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz koleje.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz lifetime.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz penove-dekorace.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz grand.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz maxi.katalog-hracek.cz c3b15e9f207aaf48dde67aa8fa6a53a3 /grand/ katalog-hracek.cz groovy-pets.katalog-hracek.cz
          Hide
          Julien Nioche added a comment -

          Can't we achieve the same thing using the new status added in NUTCH-656?

          Show
          Julien Nioche added a comment - Can't we achieve the same thing using the new status added in NUTCH-656 ?
          Julien Nioche made changes -
          Link This issue is related to NUTCH-656 [ NUTCH-656 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.9 [ 12324611 ]
          Fix Version/s 1.7 [ 12323281 ]
          Gavin made changes -
          Link This issue is depended upon by NUTCH-1326 [ NUTCH-1326 ]
          Gavin made changes -
          Link This issue blocks NUTCH-1326 [ NUTCH-1326 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.7 [ 12323281 ]
          Fix Version/s 1.6 [ 12319941 ]
          Markus Jelsma made changes -
          Field Original Value New Value
          Link This issue blocks NUTCH-1326 [ NUTCH-1326 ]
          Markus Jelsma created issue -

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development