Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
In order to support HADOOP-8060, MD5MD5CRC32FileChecksum needs to be extended to carry the information on the actual checksum type being used. The interoperability between the extended version and branch-1 should be guaranteed when Filesystem.getFileChecksum() is called over hftp, webhdfs or httpfs.
Attachments
Attachments
- hadoop-8239-after-hadoop-8240.patch.txt
- 9 kB
- Kihwal Lee
- hadoop-8239-after-hadoop-8240.patch.txt
- 6 kB
- Kihwal Lee
- hadoop-8239-before-hadoop-8240.patch.txt
- 9 kB
- Kihwal Lee
- hadoop-8239-before-hadoop-8240.patch.txt
- 6 kB
- Kihwal Lee
- hadoop-8239-branch-0.23.patch.txt
- 11 kB
- Kihwal Lee
- hadoop-8239-trunk-branch2.patch.txt
- 11 kB
- Kihwal Lee
- hadoop-8239-trunk-branch2.patch.txt
- 8 kB
- Kihwal Lee
- hadoop-8239-trunk-branch2.patch.txt
- 11 kB
- Kihwal Lee
Issue Links
- blocks
-
HDFS-3177 Allow DFSClient to find out and use the CRC type being used for a file.
- Closed
- is blocked by
-
HDFS-3176 JsonUtil should not parse the MD5MD5CRC32FileChecksum bytes on its own.
- Closed
- is part of
-
HADOOP-8060 Add a capability to discover and set checksum types per file.
- Closed
Activity
When you fix this jira, can you add public getter methods for getting
bytesPerCRC, crcPerBlock, checksum and checksumType in MD5MD5CRC32FileChecksum?
We have internal tool/library that mimics the computation of the
hdfs checksum for a file downloaded to local filesystems (NFS/Linux filesystem).
This is for customers to validate the checksum of the file after they downloaded it to
other filesystems from hdfs via http(hdfsproxy). Since those fields were not accessible, wrote a wrapper that
parses the MD5MD5CRC32FileChecksum bytes on its own similar to JsonUtil
(HDFS-3176). If there were public getter methods available it would help get
rid of the wrapper class.
The attached patch updates MD5MD5CRC32 to handle the checksum type. The change is backward compatible. E.g. getFileChecksum() calls between 1.x and 2.x custers work over hftp or webhdfs. The actual implementation of the updated call is in HDFS-3177.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12541210/hadoop-8239-before-hadoop-8240.patch.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common:
org.apache.hadoop.ha.TestZKFailoverController
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1317//testReport/
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1317//console
This message is automatically generated.
-1 tests included. The patch doesn't appear to include any new or modified tests.
No new test is added in this patch. A part of the compatibility is checked by existing tests excercising getFileChecksum(). More tests are coming with HDFS:3177.
-1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common:
org.apache.hadoop.ha.TestZKFailoverController
This is a known issue. See HADOOP-8591.
- We cannot use EOFException in readFields(..) since it won't work if the serialized bytes is not at the end of the stream. How about we add a magic number in the new serialization so that it can distinguish the new and old serializations? For example,
//MD5MD5CRC32FileChecksum.java final int MAGIC = 0xABCDEF12; public void readFields(DataInput in) throws IOException { final int firstInt = in.readInt(); if (firstInt != MAGIC) { //old serialization: crcType is CRC32, firstInt is bytesPerCRC crcType = DataChecksum.Type.CRC32; bytesPerCRC = firstInt; crcPerBlock = in.readLong(); md5 = MD5Hash.read(in); } else { //new serialization includes crcType crcType = DataChecksum.Type.valueOf(in.readInt()); bytesPerCRC = in.readInt(); crcPerBlock = in.readLong(); md5 = MD5Hash.read(in); } } public void write(DataOutput out) throws IOException { if (crcType != DataChecksum.Type.CRC32) { //for non-CRC32 type, write magic and crcType out.writeInt(MAGIC); out.writeInt(crcType.id); } out.writeInt(bytesPerCRC); out.writeLong(crcPerBlock); md5.write(out); }
If we decide to do it, we should do the same for XML and JSON.
- Need to update org.apache.hadoop.hdfs.web.JsonUtil. I guess you may already have plan to do it separately.
I think XML is fine. XML parsing is done at the document level, so we can safely find out or ignore the existence of the extra parameter and not worry about the size of data. I tried calling getFileChecksum() over Hftp between a patched 0.23 cluster and a 1.0.x cluster, and it worked fine both ways.
The change you suggested does not solve the whole problem. The magic number is like a simple binary length field. Presence/absence of it tells you how much data you need to read. So the read-side of patched version works even when reading from an unpatched version. But it's not true for the other way around. The unpatched version will always leave something unread in the stream. XML is nice in that it inherently has begin and end marker and not sensitive to size changes.
Since JsonUtil depends on this serialization/deserialization methods I don't think it cannot obtain the bidirectional compatibility by modifying only one side. If it had used XML and did not do the length check, it would have no such problem. Fully Json-ized approach could have worked as well.
One approach I can think of is to leave the current readFields()/write() methods unchanged. I think only WebHdfs is using it and if that is true, we can make WebHdfs actually send and receive everything in JSON format and keep the current "bytes" Json field as is. When it does not find the "new" fields from an old data source, it can do the old deserialization on "bytes". Similarly, it should send everything in individual JSON field as well as the old serialzed "bytes".
It may be better to move the JSON util methods to MD5MD5CRC32FileChecksum.java, since they will have to know the internals of MD5MD5CRC32FileChecksum.
It may be better to move the JSON util methods to MD5MD5CRC32FileChecksum.java, since they will have to know the internals of MD5MD5CRC32FileChecksum.
On a second thought, the JsonUtil is in the o.h.a.hdfs.web package, so I will leave it in there.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12541504/hadoop-8239-before-hadoop-8240.patch.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1328//testReport/
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1328//console
This message is automatically generated.
> One approach I can think of is to leave the current readFields()/write() methods unchanged. I think only WebHdfs is using it and if that is true, we can make WebHdfs actually send and receive everything in JSON format and keep the current "bytes" Json field as is.
FileChecksum is designed to support different kinds of checksum algorithms so that it has the following abstract methods
public abstract String getAlgorithmName(); public abstract int getLength(); public abstract byte[] getBytes();
WebHDFS FileChecksum schema has fields corresponding to these methods.
With FileChecksum, clients like WebHDFS could obtain the corresponding checksum by first getting the checksum algorithm name and then computing the bytes. If we add MD5MD5CRC32FileChecksum specific fields to the JSON format, then it is harder to support other algorithms and harder to specify the WebHDFS API since we have to specify the cases for each algorithm in the API.
For our tasks here, we are actually adding new algorithms as we have to change the algorithm name for different CRC types. So, we may as well add new classes to handle them instead of changing MD5MD5CRC32FileChecksum. BTW, the name "MD5MD5CRC32FileChecksum" is not suitable for the other crc type because it has "CRC32". Thought?
I think adding a new class is a good idea. Since DFS.getFileChcksum is expected to return MD5MD5CRC32FileChecksum in a lot of places, subclassing MD5MD5CRC32FileChecksum for each variant could work.
We can regard "CRC32" in MD5MD5CRC32FileChecksum as a generic term for any 32 bit CRC algorithms. At least that is the case in current 2.0/trunk. If we go with this, subclassing MD5MD5CRC32FileChecksum for each variant makes sense.
The following is what I am thinking:
In MD5MD5CRC32FileChecksum
The constructor sets crcType to DataChecksum.Type.CRC32
/** * getAlgorithmName() will use it to construct the name */ private DataChecksum.Type getCrcType() { return crcType; } public ChecksumOpt getChecksumOpt() { rethrn new ChecksumOpt(getCrcType(), bytesPerCrc); }
Subclass MD5MD5CRC32GzipFileChecksum
The constructor sets crcType to DataChecksum.Type.CRC32
Subclass MD5MD5CRC32CastagnoliFileChecksum
The constructor sets crcType to DataChecksum.Type.CRC32C
Interoperability & compatibility
- Any existing user/hadoop code that expects MD5MD5CRC32FileChecksum from DFS.getFileChecksum() will continue to work.
- Any new code that makes use of the new getChecksumOpt() will work as long as DFSClient#getFileChecksum() creates and returns the right object. This will be done in
HDFS-3177, and without it, every thing will default to CRC32, which is the current behavior of branch-2/trunk. - A newer client calling getFileChecksum() to an old cluster over hftp or webhdfs will work. (always CRC32)
- An older client calling getFileChecksum() to newer cluster - If the remote file on the newer cluster is in CRC32, both hftp and webhdfs work. If CRC32C or anything else, hftp will have a cheksum mismatch. In webhdfs, it will get an algorithm field that won't match anything the old MD5MD5CRC32FileChecksum can create. In WebHdfsFileSystem, it will generate an IOException, "Algorithm not matched:....".
I think this is reasonable. What do you think?
The new patch adds a separate class for each checksum type used in MD5MD5CRC32FileChecksum.
MD5MD5CRC32FileChecksum has the new getCrcType() and the subclasses overrides it.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12541629/hadoop-8239-trunk-branch2.patch.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
-1 findbugs. The patch appears to introduce 6 new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:
org.apache.hadoop.ha.TestZKFailoverController
org.apache.hadoop.hdfs.TestDataTransferKeepalive
org.apache.hadoop.hdfs.TestDistributedFileSystem
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1329//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/1329//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1329//console
This message is automatically generated.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12541651/hadoop-8239-trunk-branch2.patch.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1332//testReport/
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1332//console
This message is automatically generated.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Additional test cases will be in HDFS-3177.
Hi Kihwal,
Forgot to add the new files? The new classes are not found in the patch.
The previous Jenkins build did not test your patch. It seems that it failed to download the patch file. The patch file shown in Build Artifacts had zero byte. However, Jenkins still ran the build as usually. I filed HADOOP-8714 for this problem.
I looked at your previous two patches. Some comments:
- DataChecksum.MIXED is not used. What do we need it? Could we add it later?
- MD5MD5CRC32GzipFileChecksum and MD5MD5CRC32CastagnoliFileChecksum should not have the following fields.
+ private int bytesPerCRC; + private long crcPerBlock; + private MD5Hash md5;
They should use the fields in the super class. Also the constructors should call the super class constructors.
MD5MD5CRC32GzipFileChecksum and MD5MD5CRC32CastagnoliFileChecksum should not have the following fields.
The last patch is supposed to fix this, but the files were not added. Sorry about that.
DataChecksum.MIXED is not used. What do we need it? Could we add it later?
Any file system implementation that's using MD5MD5CRC32FileChecksum will need it, a file can contain blocks with different checksum types. This is not desired, but at least we should be able to detect it. So I think it belongs here and will be used by HDFS-3177.
I will post the corrected patch.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12541715/hadoop-8239-trunk-branch2.patch.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:
org.apache.hadoop.hdfs.TestClientReportBadBlock
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/1338//testReport/
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/1338//console
This message is automatically generated.
Integrated in Hadoop-Hdfs-trunk-Commit #2675 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2675/)
HADOOP-8239. Add subclasses of MD5MD5CRC32FileChecksum to support file checksum with CRC32C. Contributed by Kihwal Lee (Revision 1375450)
Result = SUCCESS
szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1375450
Files :
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32CastagnoliFileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32GzipFileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DataChecksum.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/web/JsonUtil.java
Integrated in Hadoop-Common-trunk-Commit #2611 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2611/)
HADOOP-8239. Add subclasses of MD5MD5CRC32FileChecksum to support file checksum with CRC32C. Contributed by Kihwal Lee (Revision 1375450)
Result = SUCCESS
szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1375450
Files :
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32CastagnoliFileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32GzipFileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DataChecksum.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/web/JsonUtil.java
Integrated in Hadoop-Mapreduce-trunk-Commit #2640 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2640/)
HADOOP-8239. Add subclasses of MD5MD5CRC32FileChecksum to support file checksum with CRC32C. Contributed by Kihwal Lee (Revision 1375450)
Result = FAILURE
szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1375450
Files :
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32CastagnoliFileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32GzipFileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DataChecksum.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/web/JsonUtil.java
Integrated in Hadoop-Hdfs-trunk #1141 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1141/)
HADOOP-8239. Add subclasses of MD5MD5CRC32FileChecksum to support file checksum with CRC32C. Contributed by Kihwal Lee (Revision 1375450)
Result = FAILURE
szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1375450
Files :
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32CastagnoliFileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32GzipFileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DataChecksum.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/web/JsonUtil.java
Integrated in Hadoop-Mapreduce-trunk #1173 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1173/)
HADOOP-8239. Add subclasses of MD5MD5CRC32FileChecksum to support file checksum with CRC32C. Contributed by Kihwal Lee (Revision 1375450)
Result = FAILURE
szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1375450
Files :
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32CastagnoliFileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32GzipFileChecksum.java
- /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DataChecksum.java
- /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/web/JsonUtil.java
Integrated in Hadoop-Hdfs-0.23-Build #351 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/351/)
HADOOP-8239. Add subclasses of MD5MD5CRC32FileChecksum to support file checksum with CRC32C. (Kihwal Lee via szetszwo) (Revision 1375834)
Result = SUCCESS
tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1375834
Files :
- /hadoop/common/branches/branch-0.23/hadoop-common-project/hadoop-common/CHANGES.txt
- /hadoop/common/branches/branch-0.23/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32CastagnoliFileChecksum.java
- /hadoop/common/branches/branch-0.23/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java
- /hadoop/common/branches/branch-0.23/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32GzipFileChecksum.java
- /hadoop/common/branches/branch-0.23/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DataChecksum.java
- /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/web/JsonUtil.java
HDFS-3176must be fixed first. Otherwise, webhdfs will break with this patch.