Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5078 [Umbrella] NodeManager health checker improvements
  3. YARN-3797

NodeManager not blacklisting the disk (shuffle) with errors

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • nodemanager
    • None

    Description

      In a multi-node environment, one of the disk (where map outputs are written) in a node went bad. Errors are given below.

      Info fld=0x9ad090a
      sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
      sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 ad 09 08 00 00 08 00
      end_request: critical medium error, dev sdf, sector 162334984
      mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
      sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
      Info fld=0x9af8892
      sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
      sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 af 88 90 00 00 08 00
      end_request: critical medium error, dev sdf, sector 162498704
      mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
      mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
      sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
      Info fld=0x9af8892
      sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
      sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 af 88 90 00 00 08 00
      end_request: critical medium error, dev sdf, sector 162498704
      

      Diskchecker would pass as the system allows to create directories and delete directories without issues. But data being served out can be corrupt and fetchers fail during CRC verification with unwanted delays and retries.

      Ideally node manager should detect such errors and blacklist/remove those disks from NM.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rajesh.balamohan Rajesh Balamohan
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated: