Issue Details (XML | Word | Printable)

Key: HADOOP-4185
Type: Sub-task Sub-task
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Sharad Agarwal
Reporter: Tom White
Votes: 0
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Hadoop Common
HADOOP-3750

Add setVerifyChecksum() method to FileSystem

Created: 16/Sep/08 12:10 PM   Updated: 23/Apr/09 07:17 PM
Return to search
Component/s: fs
Affects Version/s: None
Fix Version/s: 0.20.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 4185_v1.patch 2008-11-11 10:51 AM Sharad Agarwal 3 kB
Text File Licensed for inclusion in ASF works 4185_v2.patch 2008-11-13 08:40 AM Sharad Agarwal 5 kB
Text File Licensed for inclusion in ASF works 4185_v3.patch 2008-11-17 08:56 AM Sharad Agarwal 7 kB

Hadoop Flags: Reviewed
Resolution Date: 18/Nov/08 10:12 AM


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Sharad Agarwal added a comment - 11/Nov/08 10:51 AM
this patch:
adds setVerifyChecksum method to FileSystem, FilterFileSystem and ChecksumFileSystem.

Will appreciate if someone familiar with ChecksumFileSystem can review the changes.


Hairong Kuang added a comment - 11/Nov/08 10:08 PM
Why do you need the method setVerifyChecksum? A ChecksumFileSystem wraps around a raw file system. If a user needs to read a file without verifying its checksum, the user could use the raw file system handle instead.

Tom White added a comment - 11/Nov/08 10:46 PM
The motivation to add setVerifyChecksum to FileSystem is to break core's dependency on HDFS from FsShell by promoting the method from DistributedFileSystem to FileSystem. This change allows users using the shell to disable checksum verification for any filesystem, not just HDFS.

Tsz Wo (Nicholas), SZE added a comment - 11/Nov/08 11:20 PM
Instead of adding setVerifyChecksum(...) to FileSystem, how about putting it to a new interface, say Checksumable, so that DFS and other FS subclasses may implement it. Then, the codes in FsShell should be
if (srcFs instanceof Checksumable) {
      ((Checksumable)srcFs).setVerifyChecksum(verifyChecksum);
    }

Doug Cutting added a comment - 11/Nov/08 11:45 PM
> how about putting it to a new interface, say Checksumable

If any FileSystem might implement this interface then I don't see how this is functionally different than adding this to FileSystem.java. It just adds ugly code whenever you call setVerifyChecksum. Casting and 'instanceof' are indications of poor object modeling. Sometimes they're required, but we shouldn't use them except as a last resort.


Tsz Wo (Nicholas), SZE added a comment - 12/Nov/08 12:03 AM
I agree that we should avoid casting,

However, the current codes introduce a new empty method setVerifyChecksum(...) in FileSystem. Clients cannot tell whether calling this method is effective or not, i.e. clients cannot tell whether the fs supports checksum. In this design, we probably should add another method for telling whether the file system supports setVerifyChecksum(..).

BTW, the name setVerifyChecksum may be confusing. I suggest changing it to setVerifyData. File system may have mechanism other checksum for verifying data.


Hairong Kuang added a comment - 12/Nov/08 12:26 AM
Hadoop only supports two kinds of file systems that support checksum:
1. HDFS. We already have a way to bypass checksum verification in place.
2. ChecksumFileSystem: it wraps around a raw file system which has no idea of checksum. To bypass checksum verification, we could get a handle of the raw FS by calling ChecksumFileSystem.getRawFileSystem() and reading a file using the raw file system handle.

So we could add the following line to FsShell.getSrcFileSystem:

if (!verifyChecksum && srcFs instanceof ChecksumFileSystem) {
  srcFs = ((ChecksumFileSystem)srcFs).getRawFileSystem();
}

Sharad Agarwal added a comment - 12/Nov/08 09:52 AM
@Hairong Your solution will work with respect to ChecksumFileSystem as it exists in core. But the problem is with DistributedFileSystem. As Tom said that to break dependency of core on hdfs, we are promoting the setVerifyChecksum method from DistributedFileSystem to FileSystem. And once we choose to promote then ChecksumFileSystem should honor it.

Does the current patch make sense with respect to making ChecksumFileSystem honor setVerifyChecksum api ?


Doug Cutting added a comment - 12/Nov/08 05:26 PM
> clients cannot tell whether the fs supports checksum

We could have the method return a boolean.

/** @returns true if data verification is enabled. */
public boolean setVerifyData(boolean) { return false; }

But do clients actually care whether the fs supports verification? At this point the FsShell does not care. All that it requires is that one can say, "if you support checksums, disable them", which is what the current patch implements. If we someday in the future need to support such inquiries, then we can add support for that then. But right now, I don't see that that is needed.

I'm +1 for the patch as-is.


Hairong Kuang added a comment - 12/Nov/08 06:54 PM
Does it make more sense if we make "verifyChecksum" as a a parameter to open instead of setting it as an attribute to a file system? File system handles are cached. Setting "verifyChecksum" to be false may bring subtle side effect to later user of the file system handle.

Raghu Angadi added a comment - 12/Nov/08 10:15 PM
Regd the patch : Rather than overriding needChecksum(), verifyChecksum should be passed to FSInputChecker in ChecksumFSInputChecker. This avoids various issues ('sums' is checked for null in the implementation, for e.g.)

Hairong Kuang added a comment - 13/Nov/08 12:22 AM
I meant adding an open method that alllow the user to disable checksum verification to FileSystem.
public FSDataInputStream open(Path f, boolean verifyChecksum) throws IOException;

How does it sound?


Sharad Agarwal added a comment - 13/Nov/08 08:40 AM
Incorporated Raghu's comment. Passing verifyChecksum to FSInputChecker

Doug Cutting added a comment - 13/Nov/08 06:14 PM
> Does it make more sense if we make "verifyChecksum" as a a parameter to open [ ...]

It does make sense to be able to set this on a file-by-file basis, but I'm hesitant to add parameters, since the number of parameters could become unmanageable. Ideally open and create might take an 'options' parameter to pass things like blockSize, verifyData, bufferSize, replication, etc. What do you think of changing (in a separate issue) open and create to an options-style API?


Hairong Kuang added a comment - 14/Nov/08 07:03 PM
> What do you think of changing (in a separate issue) open and create to an options-style API?
This sounds like a great idea to me!

Hairong Kuang added a comment - 14/Nov/08 11:21 PM
The patch looks good to me. Could we get a junit testcase?

Sharad Agarwal added a comment - 17/Nov/08 08:56 AM
included test case

Hairong Kuang added a comment - 18/Nov/08 12:37 AM
+1. The patch looks good to me.

Sharad Agarwal added a comment - 18/Nov/08 09:21 AM
all core tests passed on my machine. Also ant test :
[exec] +1 overall.

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     +1 tests included.  The patch appears to include 4 new or modified tests.

     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.

     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

Devaraj Das added a comment - 18/Nov/08 10:12 AM
I just committed this. Thanks, Sharad!