Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2756

Segment Part problem with HDFS on distibuted mode

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Resolved
    • Affects Version/s: 1.15
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      During the parsing, it happens sometimes that parts of the data on the HDFS is missing after the parsing.
      When I take a look at our HDFS, I've got this file with 0 bytes (see attachments).

      After that the CrawlDB complains about this specific (corrupted?) part:

      log_crawl

      2019-12-04 22:25:57,454 INFO mapreduce.Job: Task Id : attempt_1575479127636_0047_m_000017_2, Status : FAILED
      Error: java.io.EOFException: hdfs://jobmaster:9000/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 not a SequenceFile
      at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964)
      at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923)
      at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872)
      at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1886)
      at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54)
      at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:560)
      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:798)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)~

      When I check the namenode logs, I don't see any error during the writing of the segment part but one hour later, I've got the following log:

      log_namenode

      2019-12-04 23:23:13,750 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending creates: 2], src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index
      2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index closed.
      2019-12-04 23:23:13,750 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending creates: 1], src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004
      2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 closed.

      This issue is hard to reproduce and I can't figure out what are the preconditions. It seems that it just happens randomly.
      Maybe the problem is coming from a bad management when we close the file.

        Attachments

        1. yarn-site.xml
          2 kB
          Lucas Pauchard
        2. yarn-env.sh
          6 kB
          Lucas Pauchard
        3. syslog
          137 kB
          Lucas Pauchard
        4. mapred-site.xml
          2 kB
          Lucas Pauchard
        5. hdfs-site.xml
          1 kB
          Lucas Pauchard
        6. hadoop-env.sh
          16 kB
          Lucas Pauchard
        7. 0_byte_file_screenshot.PNG
          22 kB
          Lucas Pauchard

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              lucasp Lucas Pauchard
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: