Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-971

IndexMerger produces indexes itself cannot merge anymore

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Won't Fix
    • Affects Version/s: 1.2
    • Fix Version/s: 1.3
    • Component/s: indexer
    • Labels:
    • Patch Info:
      Patch Available

      Description

      Here's what I do:

      1. index the fetched segs
      $ rm -r $new_indexes $temp_indexes
      $ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/*

      I examine the index with luke and it's as expected.

      2. merge the new index with the previous
      $ bin/nutch merge $temp_indexes $new_indexes $indexes
      IndexMerger: starting at 2011-03-26 10:24:58
      IndexMerger: merging indexes to: crawl/temp_indexes
      Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
      IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01

      On the first iteration, when $indexes is empty it works fine by essentially duplicating $new_indexes into $temp_indexes.
      But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged index $temp_indexes contains only #new_indexes and nothing from $indexes, which indeed still contains the data from the previous iteration. That is, it doesn't merge.
      This unexpected merge behavior is NOT symmetric, i.e.

      $ bin/nutch merge $temp_indexes $indexes $new_indexes
      IndexMerger: starting at 2011-03-26 10:32:15
      IndexMerger: merging indexes to: crawl/temp_indexes
      Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
      IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01

      The morale of the story is that a merged index cannot be merged with another, i.e. bin/nutch merge is meant to merge only 2 indeces generated with bin/nutch index (or solrindex, perhaps).
      The difference between the 2 indeces I can tell is that the merged index doesn't contain file index_done (and a hidden companion), but adding those to the merged index before merging it again doesn't solve either.

      The way/workaround to make the merged index equivalent to the bin/nutch index generated index seems to be putting it in a "part" subdirectory:

      bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
      IndexMerger: starting at 2011-03-26 11:18:10
      IndexMerger: merging indexes to: crawl/temp_indexes/part-1
      Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1
      Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
      IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01

      Where was this documented? I'd recommend rather not documenting but have IndexMerger handle it as in the attached patch.

        Attachments

        1. IndexMerger-part.diff
          0.5 kB
          Gabriele Kahlout

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                simpatico Gabriele Kahlout
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: