[HADOOP-3981] Need a distributed file checksum algorithm for HDFS - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.19.0
Component/s: None
Labels:
None

Hadoop Flags:

Incompatible change, Reviewed
Release Note:

Hide
Implemented MD5-of-xxxMD5-of-yyyCRC32 which is a distributed file checksum algorithm for HDFS, where xxx is the number of CRCs per block and yyy is the number of bytes per CRC.

Changed DistCp to use file checksum for comparing files if both source and destination FileSystem(s) support getFileChecksum(...).

Show
Implemented MD5-of-xxxMD5-of-yyyCRC32 which is a distributed file checksum algorithm for HDFS, where xxx is the number of CRCs per block and yyy is the number of bytes per CRC. Changed DistCp to use file checksum for comparing files if both source and destination FileSystem(s) support getFileChecksum(...).

Description

Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location. HDFS supports large files with multiple tera bytes. The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

3981_20080909.patch
10/Sep/08 00:12
22 kB
Tsz-wo Sze
3981_20080910.patch
10/Sep/08 20:40
29 kB
Tsz-wo Sze
3981_20080910b.patch
11/Sep/08 01:51
29 kB
Tsz-wo Sze
3981_20080912.patch
12/Sep/08 22:54
28 kB
Tsz-wo Sze

Issue Links

is blocked by

HADOOP-3941 Extend FileSystem API to return file-checksums/file-digests

Closed

is related to

HADOOP-4197 Need to update DATA_TRANSFER_VERSION

Closed

HADOOP-4176 Implement getFileChecksum(Path) in HftpFileSystem

Closed

Activity

People

Assignee:: Tsz-wo Sze

Reporter:: Tsz-wo Sze

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Aug/08 18:05

Updated:: 15/Aug/09 12:37

Resolved:: 16/Sep/08 07:45