Details
Description
During recovery, the logger copied/sorted a recovery walog to hdfs. The copy was ok, but there was a checksum error when replaying the data. The system did not recover without manual intervention. The work-around was to find the datanode serving the back block, and stop it. Then I removed the bad recovery file and restarted the master. The copy/sort took place again, and used a different datanode. Recovery proceeded successfully.
We need to use a higher replication and/or a more sophisticated approach to verifying and restarting recoveries.