|
[
Permlink
| « Hide
]
Tom White added a comment - 06/May/08 07:17 PM
Here's a patch for a native S3 filesystem.
Second patch with the following changes:
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381601/hadoop-930-v2.patch against trunk revision 654265. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 159 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 208 release audit warnings (more than the trunk's current 207 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2420/testReport/ This message is automatically generated. +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381659/hadoop-930-v3.patch against trunk revision 654315. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 159 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2428/testReport/ This message is automatically generated. Can someone validate that this code works for them?
– Any reason you didn't use the mime type to denote directory files (as jets3t does)?
public static boolean isDirectory( S3Object object ) { return object.getContentType() != null && object.getContentType().equalsIgnoreCase( MIME_DIRECTORY ); } where public static final String MIME_DIRECTORY = "application/x-directory"; – I believe MD5 checksum should be set on s3 put (via header), and verified on s3 get. I see plenty of read failures because of checksum failures (though they could be side effects of stream reading timeouts in retrospect). This is especially useful if non Hadoop applications are dealing with the S3 data shared with Hadoop. – Sometimes 'legacy' buckets have underscores, might consider trying to survive them.. String userInfo = uri.getUserInfo(); // special handling for underscores in bucket names if( userInfo == null ) { String authority = uri.getAuthority(); String split[] = authority.split( "[:@]" ); if( split.length >= 2 ) userInfo = split[ 0 ] + ":" + split[ 1 ]; } and String bucketName = uri.getAuthority(); // handling for underscore in bucket name if( bucketName.contains( "@" ) ) bucketName = bucketName.split( "@" )[ 1 ]; Thanks for the review Chris.
It's to do with efficiency of listing directories. If you use mime type then you can't tell the difference between files and directories when listing bucket keys. So you have to query each key in a directory which can be prohibitively slow. But if you use the _$folder$ suffix convention (which S3Fox uses too BTW) you can easily distinguish files and directories.
The code should be doing this. I agree that it's useful - in fact, the other s3 filesystem needs updating to do this too.
Thanks for the tip. The code does detect this condition, but it might be nice to try to workaround as you say (perhaps emitting a warning). Have you done this elsewhere?
From what I can tell, s3service.listObjects returns an array of S3Object, where each instance already has any associated meta-data in a HashMap. Content-Type being one of them. So I think the penalty has been paid. Here is the jets3t code. are you seeing a different behavior or disabling meta-data in jets3t for performance reasons? Sorry if i seem little rusty on my jets3t api..
Sorry, didn't see where the checksum was being validated on a read. I see it in NativeS3FsOutputStream but not NativeS3FsInputStream. Does Jets3t do this automatically? If so cool.
I believe those are the only two values that can be munged due to a underscore in the authority.
I don't think all the fields in S3Object are populated - just those returned in the list keys response. See http://docs.amazonwebservices.com/AmazonS3/2006-03-01/ListingKeysResponse.html I think Jets3t does validate MD5 checksums on reads - but I'll double check.
good catch. New patch that works with trunk.
This isn't true, Jets3t doesn't validate MD5 checksums on reads. In fact the stream is sent straight to the client, so it's not possible in general to validate the MD5 checksum - particularly when doing seeks, which use range GETs. Contrast this with S3FileSystem which retrieves data in blocks, so it would be easy to add checksum validate there (I've opened HADOOP-3494 for this). For this issue, I think we should just have write checksum validation. I've also created HADOOP-3495 to address supporting underscores in bucket names. I just committed this. Thanks, Tom!
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||