Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-972

Mergedb doesn't merge with empty directory, as is the case with merge (for indexes)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.2
    • 1.3
    • storage

    Description

      Just an issue of unexpected behavior. This series of commands works with bin/nutch merge to merge indexes but not with crawldb.

      allcrawldb="crawl/allcrawldb"
      temp_crawldb="crawl/temp_crawldb"
      merge_dbs="$it_crawldb $allcrawldb"

      1. if [[ ! -d $allcrawldb ]]
      2. then
      3. merge_dbs="$it_crawldb"
      4. fi
      5. uncomment the above and mergedb will work fine.
        bin/nutch mergedb $temp_crawldb $merge_dbs
        rm -r $it_crawldb $allcrawldb crawl/segments crawl/linkdb
        mv $temp_crawldb $allcrawldb

      This is the exception that occurs:

      bin/nutch mergedb crawl/temp_crawldb crawl/crawldb crawl/allcrawldb
      CrawlDb merge: starting at 2011-03-27 10:13:06
      Adding crawl/crawldb
      Adding crawl/allcrawldb
      CrawlDb merge: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/simpatico/nutch-1.2/crawl/allcrawldb/current
      at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
      at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
      at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
      at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
      at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
      at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
      at org.apache.nutch.crawl.CrawlDbMerger.merge(CrawlDbMerger.java:126)
      at org.apache.nutch.crawl.CrawlDbMerger.run(CrawlDbMerger.java:187)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
      at org.apache.nutch.crawl.CrawlDbMerger.main(CrawlDbMerger.java:159)

      Beside the scripting workaround I've attached a patch which skips adding the empty folder to the collection of dbs to merge. I've also added it a log of which dbs actually get added, consistent with merge interface.

      Attachments

        1. check_empty.diff
          0.3 kB
          Gabriele Kahlout

        Issue Links

          Activity

            People

              Unassigned Unassigned
              simpatico Gabriele Kahlout
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: