Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-3981

Need a distributed file checksum algorithm for HDFS

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.19.0
    • None
    • None
    • Incompatible change, Reviewed
    • Hide
      Implemented MD5-of-xxxMD5-of-yyyCRC32 which is a distributed file checksum algorithm for HDFS, where xxx is the number of CRCs per block and yyy is the number of bytes per CRC.

      Changed DistCp to use file checksum for comparing files if both source and destination FileSystem(s) support getFileChecksum(...).
      Show
      Implemented MD5-of-xxxMD5-of-yyyCRC32 which is a distributed file checksum algorithm for HDFS, where xxx is the number of CRCs per block and yyy is the number of bytes per CRC. Changed DistCp to use file checksum for comparing files if both source and destination FileSystem(s) support getFileChecksum(...).

    Description

      Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location. HDFS supports large files with multiple tera bytes. The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

      Attachments

        1. 3981_20080909.patch
          22 kB
          Tsz-wo Sze
        2. 3981_20080910.patch
          29 kB
          Tsz-wo Sze
        3. 3981_20080910b.patch
          29 kB
          Tsz-wo Sze
        4. 3981_20080912.patch
          28 kB
          Tsz-wo Sze

        Issue Links

          Activity

            People

              szetszwo Tsz-wo Sze
              szetszwo Tsz-wo Sze
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: