Issue Details (XML | Word | Printable)

Key: HADOOP-3941
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Tsz Wo (Nicholas), SZE
Reporter: Tsz Wo (Nicholas), SZE
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Extend FileSystem API to return file-checksums/file-digests

Created: 12/Aug/08 11:05 PM   Updated: 20/Nov/08 11:38 PM
Return to search
Component/s: fs
Affects Version/s: None
Fix Version/s: 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 3941_20080818.patch 2008-08-18 08:32 PM Tsz Wo (Nicholas), SZE 10 kB
Text File Licensed for inclusion in ASF works 3941_20080819.patch 2008-08-19 06:54 PM Tsz Wo (Nicholas), SZE 15 kB
Text File Licensed for inclusion in ASF works 3941_20080819b.patch 2008-08-19 10:36 PM Tsz Wo (Nicholas), SZE 18 kB
Text File Licensed for inclusion in ASF works 3941_20080820.patch 2008-08-20 06:50 PM Tsz Wo (Nicholas), SZE 18 kB
Text File Licensed for inclusion in ASF works 3941_20080826.patch 2008-08-26 09:12 PM Tsz Wo (Nicholas), SZE 14 kB
Text File Licensed for inclusion in ASF works 3941_20080827.patch 2008-08-27 06:25 PM Tsz Wo (Nicholas), SZE 17 kB
Text File Licensed for inclusion in ASF works 3941_20080904.patch 2008-09-05 01:10 AM Tsz Wo (Nicholas), SZE 10 kB
Issue Links:
Blocker
 

Hadoop Flags: Reviewed
Release Note: Added new FileSystem APIs: FileChecksum and FileSystem.getFileChecksum(Path).
Resolution Date: 05/Sep/08 10:53 PM


 Description  « Hide
Suppose we have two files in two locations (may be two clusters) and these two files have the same size. How could we tell whether the content of them are the same?

Currently, the only way is to read both files and compare the content of them. This is a very expensive operation if the files are huge.

So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Tsz Wo (Nicholas), SZE added a comment - 13/Aug/08 10:34 PM
How about we add the following optional method in the FileSystem API?
//a new optional method in FileSystem.java
public abstract FileChecksum getFileChecksum(String algorithm, Path p);

where FileChecksum is a new interface in hadoop.fs package

interface FileChecksum {
  String getAlgorithm();

  int getLength();

  byte[] getBytes();
}

Doug Cutting added a comment - 17/Aug/08 10:18 AM
How about making FileChecksum an abstract class, adding the method:

public abstract equals(Object other);


Doug Cutting added a comment - 17/Aug/08 10:44 AM
Sorry, that should have been something like:
public boolean equals(Object other) {
  if (!(other instanceof FileChecksum))
    return false;
  FileChecksum that = (FileChecksum)other;
  return this.getAlgorithm().equals(that.getAlgorithm())
    && this.getBytes().equals(that.getBytes());
}

Tsz Wo (Nicholas), SZE added a comment - 18/Aug/08 08:32 PM
3941_20080818.patch: API change preview
  • Added getFileChecksum(...) in FileSystem
  • Added abstract class FileChecksum
    • implemented equals(...) and hashCode()
    • renamed the method getAlgorithm() mentioned above to getAlgorithmName()

I am going to implement MD5FileChecksum for LocalFileSysetm.


Tsz Wo (Nicholas), SZE added a comment - 19/Aug/08 06:54 PM
3941_20080819.patch: implemented MD5FileChecksum for RawLocalFileSystem. Need unit tests.

Doug Cutting added a comment - 19/Aug/08 08:49 PM
Why not have the default implementation of getFileChecksum() throw the "unsupported operation" exception so that we don't have duplicated code in every subclass? Also, should this really throw an exception or return null? I would guess that most applications would want to handle this not as an exceptional condition somewhere higher on the stack, but rather explicitly where getFileChecksum() is called, so perhaps null would be better.

Do you intend to implement this for HDFS here, or as a separate issue?


Tsz Wo (Nicholas), SZE added a comment - 19/Aug/08 10:05 PM

Why not have the default implementation of getFileChecksum() throw the "unsupported operation" exception so that we don't have duplicated code in every subclass? Also, should this really throw an exception or return null? I would guess that most applications would want to handle this not as an exceptional condition somewhere higher on the stack, but rather explicitly where getFileChecksum() is called, so perhaps null would be better.

For other optional operaions (e.g. append), we declare an abstract method in FileSystem and let other FileSystem sub-classes throw "Not supported". Should we do the same for getFileChecksum()?

I think throwing IOException might be better than returning null. Otherwise, applications have to check null, or they may get NPE which is a RuntimeException.

The methods defined in java.security.MessageDigest, e.g. getInstance(String algorithm), throw NoSuchAlgorithmException. We might want to do something similar.

Do you intend to implement this for HDFS here, or as a separate issue?

Yes, it is because there are more works for implementing HDFS.


Tsz Wo (Nicholas), SZE added a comment - 19/Aug/08 10:36 PM
3941_20080819b.patch: added a unit test and fixed a Findbugs warning.

Tsz Wo (Nicholas), SZE added a comment - 19/Aug/08 10:45 PM
Below is a summary of the default getFileChecksum() implementation options. We mentioned the first three before. I added the fourth.
  1. no implementation, declare it as abstract
  2. returning null
  3. throwing "Not supported" IOException
  4. if algorithm is MD5, return a MD5FileChecksum. Otherwise, do #2 or #3.

However, MD5 in #4 may not be efficient for HDFS since it will read the entire file.


dhruba borthakur added a comment - 20/Aug/08 05:54 AM
This patch does not implement checksum on HDFS files, right?

Do you plan to generate MD5s for HDFS files too? For HDFS, does it make sense to create a checksum from the blk*.meta files because the size of the meta file will be much much lesser than the size of the data file?


Tsz Wo (Nicholas), SZE added a comment - 20/Aug/08 05:11 PM

This patch does not implement checksum on HDFS files, right?

You are correct. The patch only throws a "not supported" exception for HDFS.

Do you plan to generate MD5s for HDFS files too? For HDFS, does it make sense to create a checksum from the blk*.meta files because the size of the meta file will be much much lesser than the size of the data file?

No, the original MD5 algorithm may not be efficient for large files. I think we need a distributed file digest algorithm for HDFS. Yes, one way is to compute MD5 over the meta files. This will reduce the overhead dramatically. I probably will implement a MD5-over-CRC32 for HDFS.


Tsz Wo (Nicholas), SZE added a comment - 20/Aug/08 06:07 PM
Created HADOOP-3981 - Need a distributed file checksum algorithm for HDFS

Tsz Wo (Nicholas), SZE added a comment - 20/Aug/08 06:50 PM
3941_20080820.patch: fixed a bug.

All the patches up to now implements option #1, which is our usual approach.


Doug Cutting added a comment - 26/Aug/08 10:11 AM
> Below is a summary of the default getFileChecksum() implementation options [ ... ]

The default should minimize code duplication, if possible. An abstract method should only be used for mandatory methods. Since this is an optional method, a default implementation should be provided.

The choice of an exception or null depends on the expected use. An exception should be thrown for unusual situations that are best handled non-locally, somewhere above the call. The absence of a checksum should probably be handled at the site of the call, so returning null seems a better choice than an exception here. Another option might be to return a trivial checksum, e.g., the file's length.

Perhaps we should include a use of this new feature in the patch, to better guide its implementation. Should we extend distcp to use this? Or do you have another canonical application in mind? If we add features without applications of them, we risk a design that does not meet any needs.


Tsz Wo (Nicholas), SZE added a comment - 26/Aug/08 08:12 PM
> ... so returning null seems a better choice than an exception here. Another option might be to return a trivial checksum, e.g., the file's length.

I think returning null make sense. We cannot return a trivial checksum since an algorithm is specified in the call. We should only return the checksum generated by the specified algorithm.

> Should we extend distcp to use this?

Yes, the canonical application is distcp. I could also change distcp to use the new API.


Tsz Wo (Nicholas), SZE added a comment - 26/Aug/08 09:12 PM
3941_20080826.patch:
  • The default implementation is provided in the FileSystem class.
  • It returns null when the algorithm is not found.
  • Added FileLengthChecksum which uses file lengths as checksums

Tsz Wo (Nicholas), SZE added a comment - 27/Aug/08 06:25 PM
3941_20080827.patch: changed DistCp to use LengthFileChecksum

Tsz Wo (Nicholas), SZE added a comment - 28/Aug/08 02:35 AM
passed all tests locally. try Hudson

Hadoop QA added a comment - 28/Aug/08 09:40 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12389024/3941_20080827.patch
against trunk revision 689733.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3134/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3134/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3134/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3134/console

This message is automatically generated.


Doug Cutting added a comment - 28/Aug/08 03:20 PM
I don't see the point in passing the checksum algorithm name to getFileChecksum(). Do we expect a FileSystem to actually checksum a file on demand? I assume not, that this feature is primarily for accessing pre-computed checksums, and that most filesystems will only support a single checksum algorithm.

There are two primary cases to consider:
1. Copying files between filesystems that have pre-computed checksums using the same algorithm.
2. Copying files between filesystems which either do not have pre-computed checksums or use different algorithms.

In (2) copies should use flie lengths or perhaps fail, and in (1) we should use checksums. Right?

In any case, hardwiring distcp to use FileLengthChecksum doesn't seem like an improvement.


Tsz Wo (Nicholas), SZE added a comment - 28/Aug/08 07:06 PM
> Do we expect a FileSystem to actually checksum a file on demand? I assume not, that this feature is primarily for accessing pre-computed checksums, ...

For HDFS, I am not sure whether sending all crcs to client is good enough since the size of all crcs is 1/128 of the file size, which is big for large files. We might want to reduce the network traffic (especially in the case of distcp) by computing a second level of checksums (e.g. compute a MD5 for all the crcs of a block). So, I think this feature is not only for accessing pre-computed checksums, but indeed a framework for supporting checksum algorithms.

> In (2) copies should use flie lengths or perhaps fail, ...

It should not fail. Otherwise, we cannot copy from local fs to hdfs. We are currently using file length as checksum. It is simply too easy to have false positive.

> In any case, hardwiring distcp to use FileLengthChecksum doesn't seem like an improvement.

It is only temporary. Once we have a distributed checksum implementation, we could change DistCp to use it. The distributed checksum implementation will optimize for HDFS, so that coping from HDFS to HDFS will be very efficient (which is the main purpose of distcp). If necessary, we could provide an option in distcp for users to specify the checksum algorithm.


Doug Cutting added a comment - 29/Aug/08 05:57 AM
Distcp should not hardwire any algorithm, but rather use the preferred algorithm of the filesystems involved. That way checksums will not be used just for HDFS->HDFS, but also for S3->S3, etc.

Tsz Wo (Nicholas), SZE added a comment - 29/Aug/08 11:28 PM
> Distcp should not hardwire any algorithm

That is true. We might need a method for getting the supported algorithms of a file system. Algorithms will be sorted by the preference. For example, if S3 supports {MD5, FileLength}, HDFS supports {HDFS-Checksum, FileLength} and LocalFS supports {MD5, HDFS-Checksum, FileLength}, then

  • S3 -> HDFS or HDFS -> S3 will use FileLength
  • S3 -> S3 will use MD5
  • S3 -> LocalFS will use MD5
  • LocalFS -> HDFS will use HDFS-Checksum

Doug Cutting added a comment - 04/Sep/08 09:12 PM
> We might need a method for getting the supported algorithms of a file system.

If we remove the "algorithm" parameter to getFileChecksum() then each FileSystem would simply return checksums using its native algorithm. When these match, cross-filesystem copies would be checksummed. Later, if we have filesystems that implement multiple checksum algorithms, we might consider something more elaborate, but that seems sufficient for now, no?


Tsz Wo (Nicholas), SZE added a comment - 04/Sep/08 09:25 PM

If we remove the "algorithm" parameter to getFileChecksum() then each FileSystem would simply return checksums using its native algorithm. When these match, cross-filesystem copies would be checksummed. Later, if we have filesystems that implement multiple checksum algorithms, we might consider something more elaborate, but that seems sufficient for now, no?

+1 Then, I will only add

public FileChecksum getFileChecksum(Path f) throws IOException

in this patch. If we need more checksum algorithms later, we should add these two methods:

public FileChecksum getFileChecksum(String algorithm, Path f) throws IOException

public String[] getSupportedChecksumAlgorithms()

Tsz Wo (Nicholas), SZE added a comment - 05/Sep/08 01:10 AM
3941_20080904.patch: Changed FileSystem API to getFileChecksum(Path).

Doug Cutting added a comment - 05/Sep/08 08:31 PM
+1 This looks good to me.

Hadoop QA added a comment - 05/Sep/08 10:06 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12389550/3941_20080904.patch
against trunk revision 692492.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3190/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3190/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3190/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3190/console

This message is automatically generated.


Tsz Wo (Nicholas), SZE added a comment - 05/Sep/08 10:53 PM
I just committed this.

Tsz Wo (Nicholas), SZE added a comment - 05/Sep/08 10:59 PM
I forgot to mention that the test failed is nothing to do with this issue. See HADOOP-4078.

Hudson added a comment - 06/Sep/08 01:23 PM