Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-972

Mergedb doesn't merge with empty directory, as is the case with merge (for indexes)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: 1.3
    • Component/s: storage
    • Labels:

      Description

      Just an issue of unexpected behavior. This series of commands works with bin/nutch merge to merge indexes but not with crawldb.

      allcrawldb="crawl/allcrawldb"
      temp_crawldb="crawl/temp_crawldb"
      merge_dbs="$it_crawldb $allcrawldb"

      1. if [[ ! -d $allcrawldb ]]
      2. then
      3. merge_dbs="$it_crawldb"
      4. fi
      5. uncomment the above and mergedb will work fine.
        bin/nutch mergedb $temp_crawldb $merge_dbs
        rm -r $it_crawldb $allcrawldb crawl/segments crawl/linkdb
        mv $temp_crawldb $allcrawldb

      This is the exception that occurs:

      bin/nutch mergedb crawl/temp_crawldb crawl/crawldb crawl/allcrawldb
      CrawlDb merge: starting at 2011-03-27 10:13:06
      Adding crawl/crawldb
      Adding crawl/allcrawldb
      CrawlDb merge: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/simpatico/nutch-1.2/crawl/allcrawldb/current
      at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
      at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
      at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
      at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
      at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
      at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
      at org.apache.nutch.crawl.CrawlDbMerger.merge(CrawlDbMerger.java:126)
      at org.apache.nutch.crawl.CrawlDbMerger.run(CrawlDbMerger.java:187)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
      at org.apache.nutch.crawl.CrawlDbMerger.main(CrawlDbMerger.java:159)

      Beside the scripting workaround I've attached a patch which skips adding the empty folder to the collection of dbs to merge. I've also added it a log of which dbs actually get added, consistent with merge interface.

        Attachments

        1. check_empty.diff
          0.3 kB
          Gabriele Kahlout

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                simpatico Gabriele Kahlout
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: