Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-433

java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.9.0
    • 0.9.0
    • generator, indexer
    • None
    • Both Linux/i686 and Mac OS X PPC/Intel, but platform independent

    Description

      The nightly builds have not been working at all for the past couple of weeks. Sami Siren has narrowed it down to HADOOP-331.

      To replicate: download the nightly, then:

      bin/nutch inject crawl/crawldb urls/ # a single URL is in urls/urls – http://apache.org
      bin/nutch generate crawl/crawldb crawl/segments
      bin/nutch fetch crawl/segments/2007...
      bin/nutch updatedb crawl/crawldb crawl/segments/2007...

      1. generate a new segment with 5 URIs
        bin/nutch generate crawl/crawldb crawl/segments -topN 5
        bin/nutch fetch crawl/segments/2007... # new segment
        bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment
      1. merge the segments and index
        bin/nutch mergesegs crawl/merged -dir crawl/segments
        ..

      We get a crash in the mergesegs. This crash, with the exact same script and start URI, configuration and plugins, does not happen on a nightly from early January.

      2007-01-18 14:57:11,411 INFO segment.SegmentMerger - Merging 2 segments to crawl/merged_07_01_18_14_56_22/20070118145711
      2007-01-18 14:57:11,482 INFO segment.SegmentMerger - SegmentMerger: adding crawl/segments/20070118145628
      2007-01-18 14:57:11,489 INFO segment.SegmentMerger - SegmentMerger: adding crawl/segments/20070118145641
      2007-01-18 14:57:11,495 INFO segment.SegmentMerger - SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text
      2007-01-18 14:57:11,594 INFO mapred.InputFormatBase - Total input paths to process : 12
      2007-01-18 14:57:11,819 INFO mapred.JobClient - Running job: job_5ug2ip
      2007-01-18 14:57:12,073 WARN mapred.LocalJobRunner - job_5ug2ip
      java.io.EOFException
      at java.io.DataInputStream.readFully(DataInputStream.java:178)
      at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57)
      at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
      at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
      at org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
      at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:173)
      at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:61)
      at org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:100)
      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill(MapTask.java:427)
      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:385)
      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$200(MapTask.java:239)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:188)
      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109)

      Attachments

        Activity

          People

            siren Sami Siren
            bwhitman Brian Whitman
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: