Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.8
    • Component/s: None
    • Labels:
      None

      Description

      Reading the code for LinkDb.reduce(): if we have page duplicates in input segments, or if we have two copies of the same input segment, we will create the same Inlink values (satisfying Inlink.equals()) multiple times. Since Inlinks is a facade for List, and not a Set, we will get duplicate Inlink-s in Inlinks (if you know what I mean .

      The problem is easy to test: create a new linkdb based on 2 identical segments. This problem also makes it more difficult to properly implement LinkDB updating mechanism (i.e. incremental invertlinks).

      I propose to change Inlinks to use a Set semantics, either explicitly by using a HashSet or implicitly by checking if a value to be added already exists. If there are no objections I'll commit this change shortly.

      1. set-patch.txt
        4 kB
        Andrzej Bialecki
      2. patch.txt
        0.7 kB
        Andrzej Bialecki

        Activity

        Hide
        Andrzej Bialecki added a comment -

        HashSet-based version of the patch applied.

        Show
        Andrzej Bialecki added a comment - HashSet-based version of the patch applied.
        Hide
        Doug Cutting added a comment -

        +1 This looks good. It will be a little slower for simple crawls, where each link is only processed once, but probably not noticeably. It will be significantly faster when re-crawling is performed, since the link db won't balloon.

        I note that the add() methods are actually unchanged: just reformatted.

        Show
        Doug Cutting added a comment - +1 This looks good. It will be a little slower for simple crawls, where each link is only processed once, but probably not noticeably. It will be significantly faster when re-crawling is performed, since the link db won't balloon. I note that the add() methods are actually unchanged: just reformatted.
        Hide
        Andrzej Bialecki added a comment -

        Same functionality, but using a HashSet.

        Show
        Andrzej Bialecki added a comment - Same functionality, but using a HashSet.
        Hide
        Doug Cutting added a comment -

        The iterator shouldn't be a problem. When we're indexing we also dedup them by domain, which is much more expensive than creating an iterator.

        Show
        Doug Cutting added a comment - The iterator shouldn't be a problem. When we're indexing we also dedup them by domain, which is much more expensive than creating an iterator.
        Hide
        Andrzej Bialecki added a comment -

        No problem, I can change this. However, going through every link will then require creation of an Iterator. We do this when constructing anchor text (which impacts indexing) or showing anchors in NutchBean.

        Show
        Andrzej Bialecki added a comment - No problem, I can change this. However, going through every link will then require creation of an Iterator. We do this when constructing anchor text (which impacts indexing) or showing anchors in NutchBean.
        Hide
        Doug Cutting added a comment -

        I'm concerned about all of the contains() calls this adds to an ArrayList. This is a linear scan, and makes the cost of building a set of links quadratic. If we're making this change, shouldn't we change the underlying set implementation too? A HashSet would probably work well here.

        Show
        Doug Cutting added a comment - I'm concerned about all of the contains() calls this adds to an ArrayList. This is a linear scan, and makes the cost of building a set of links quadratic. If we're making this change, shouldn't we change the underlying set implementation too? A HashSet would probably work well here.
        Hide
        Andrzej Bialecki added a comment -

        Proposed fix for this issue. If there are no objections I'll commit this shortly.

        Show
        Andrzej Bialecki added a comment - Proposed fix for this issue. If there are no objections I'll commit this shortly.

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development