Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-656

DeleteDuplicates based on crawlDB only

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Wish
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • None
    • indexer
    • None

    Description

      The existing dedup functionality relies on Lucene indices and can't be used when the indexing is delegated to SOLR.
      I was wondering whether we could use the information from the crawlDB instead to detect URLs to delete then do the deletions in an indexer-neutral way. As far as I understand the content of the crawlDB contains all the elements we need for dedup, namely :

      • URL
      • signature
      • fetch time
      • score

      In map-reduce terms we would have two different jobs :

      • read crawlDB and compare on URLs : keep only most recent element - oldest are stored in a file and will be deleted later
      • read crawlDB and have a map function generating signatures as keys and URL + fetch time +score as value
      • reduce function would depend on which parameter is set (i.e. use signature or score) and would output as list of URLs to delete

      This assumes that we can then use the URLs to identify documents in the indices.

      Any thoughts on this? Am I missing something?

      Julien

      Attachments

        1. NUTCH-656.patch
          8 kB
          Julien Nioche
        2. NUTCH-656.v2.patch
          13 kB
          Julien Nioche
        3. NUTCH-656.v3.patch
          14 kB
          Julien Nioche

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jnioche Julien Nioche
            jnioche Julien Nioche
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment