• Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.15
    • Fix Version/s: 1.15
    • Component/s: segment
    • Environment:

      xubuntu 17.10, docker container of apache/nutch LATEST


      The problem probably occurs since commit

      How to reproduce:

      • create container from apache/nutch image (latest)
      • open terminal in that container
      • set
      • create crawldir and urls file
      • run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
      • run bin/nutch generate (bin/nutch generate mycrawl/crawldb mycrawl/segments 1)
        • this results in a segment (e.g. 20180304134215)
      • run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 -threads 2)
      • run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 -threads 2)
        • ls in the segment folder -> existing folders: content, crawl_fetch, crawl_generate, crawl_parse, parse_data, parse_text
      • run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180304134215)
      • run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments mycrawl/segments/* -filter)
        • console output: `SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text`
        • resulting segment: 20180304134535
      • ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing folder: crawl_generate
      • run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir mycrawl/MERGEDsegments) which results in a consequential error
        • console output: `LinkDb: adding segment: file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535
          LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data
              at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(
              at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(
              at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(
              at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(
              at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(
              at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(
              at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(
              at org.apache.hadoop.mapreduce.Job$
              at org.apache.hadoop.mapreduce.Job$
              at Method)
              at org.apache.hadoop.mapreduce.Job.submit(
              at org.apache.hadoop.mapreduce.Job.waitForCompletion(
              at org.apache.nutch.crawl.LinkDb.invert(
              at org.apache.nutch.crawl.LinkDb.main(`

      So as it seems mapreduce corrupts the segment folder during mergesegs command.


      Pay attention to the fact that this issue is not related on trying to merge a single segment like described above. As you can see on the attached screenshot that problem also appears when executing multiple bin/nutch generate/fetch/parse/updatedb commands before executing mergesegs - resulting in a segment count > 1.



        1. Screenshot_2018-03-03_18-09-28.png
          158 kB
          Marco Ebbinghaus
        2. Screenshot_2018-03-07_07-50-05.png
          239 kB
          Marco Ebbinghaus

          Issue Links



              • Assignee:
                lewismc Lewis John McGibbney
                mebbinghaus Marco Ebbinghaus
              • Votes:
                0 Vote for this issue
                5 Start watching this issue


                • Created: