Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-235

Duplicate Inlink values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8
    • 0.8
    • None
    • None

    Description

      Reading the code for LinkDb.reduce(): if we have page duplicates in input segments, or if we have two copies of the same input segment, we will create the same Inlink values (satisfying Inlink.equals()) multiple times. Since Inlinks is a facade for List, and not a Set, we will get duplicate Inlink-s in Inlinks (if you know what I mean .

      The problem is easy to test: create a new linkdb based on 2 identical segments. This problem also makes it more difficult to properly implement LinkDB updating mechanism (i.e. incremental invertlinks).

      I propose to change Inlinks to use a Set semantics, either explicitly by using a HashSet or implicitly by checking if a value to be added already exists. If there are no objections I'll commit this change shortly.

      Attachments

        1. set-patch.txt
          4 kB
          Andrzej Bialecki
        2. patch.txt
          0.7 kB
          Andrzej Bialecki

        Activity

          People

            ab Andrzej Bialecki
            ab Andrzej Bialecki
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: