Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1907

Incorrect output of Outlinks to Hosts within HostDbUpdateReducer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.2.1
    • 2.3
    • None
    • None

    Description

      I explained that I found a big in the 2.X HostDb.
      I was looking into the code within Nutch 2.X HostDbUpdateReducer and
      'think' I've discovered a bug in the way we output Host data.
      https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/host/HostDbUpdateReducer.java#L87
      I feel that the following code

      host.getInlinks().put(new Utf8(outlink), new
      Utf8(Integer.toString(outlinkCount.getCount(outlink))));
      

      should be changed to the following

      host.getOutlinks().put(new Utf8(outlink), new
      Utf8(Integer.toString(outlinkCount.getCount(outlink))));
      

      Notice the difference in population of Outlinks to Host instead of repeated Inlinks.

      Attachments

        1. NUTCH-1907.patch
          0.7 kB
          Lewis John McGibbney

        Activity

          People

            lewismc Lewis John McGibbney
            lewismc Lewis John McGibbney
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: