Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1416

IndexerMapReduce can index older version of a document instead of latest one

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Reopened
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • indexer
    • None

    Description

      When we update the index,can not guarantee that the contents which be indexed is the latest.In the class IndexerMapReduce and method reduce(), it has the following code:
      public void reduce(Text key, Iterator<NutchWritable> values,
      OutputCollector<Text, NutchDocument> output, Reporter reporter) throws IOException

      { …… }

      else if (value instanceof ParseData)

      { parseData = (ParseData)value; }

      else if (value instanceof ParseText)

      { parseText = (ParseText)value; }

      ……
      }
      For example,30 days ago,I fetched the web page A,now I fetch it again. Then the key A will correspond to two ParseData objects(located in different segments).But in this code,it does not compare the fetch time and simply overwrites the previous value.So the final value maybe the old one.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              hjy Jianyun He
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: