Hadoop Common
  1. Hadoop Common
  2. HADOOP-8060

Add a capability to discover and set checksum types per file.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0, 0.23.1, 0.24.0
    • Fix Version/s: 0.23.3, 2.0.2-alpha
    • Component/s: fs, util
    • Labels:
      None

      Description

      After the improved CRC32C checksum feature became default, some of use cases involving data movement are no longer supported. For example, when running DistCp to copy from a file stored with the CRC32 checksum to a new cluster with the CRC32C set to default checksum, the final data integrity check fails because of mismatch in checksums.

        Issue Links

          Activity

          Hide
          Kihwal Lee added a comment -

          The post-copy check is done by comparing the results from getFileChecksum(). The getFileChecksum() method is also used by some tools to check whether the destination copy needs to be updated. If a copy of the same content can have a different version of checksum than the source, these checks will no longer be used. Staying with CRC32 is a workaround, but this preclude the possibility of moving to the better performing CRC32C checksum.

          One of the least invasive approaches is to follow one principle: allow the source checksum method to be used for the destination in mixed checksum environment. If the default is CRC32C, all newly created content will use CRC32C, but the existing data with CRC32 will stay with CRC32 even after DistCp. This allows gradual migration to CRC32C.

          This approach requires the following capabilities:

          • Clients should be able to find out the checksum type of existing data.
          • Clients should be able to tell data nodes which checksum type to use for write.

          Without append, these operations can be at file level. But if append is used, a file can contain more than one checksum types (See HDFS-2130 for details), which forces the above operations to be performed for every block. However, exposing block-level detail is not desirable for the FileSystem abstraction.

          I propose we add a configurable feature to make append() to follow the existing checksum method. For zero-byte files, default is used. For non-zero byte files, checking first block is sufficient. Expose this information to clients so that they can use it to specify the write checksum type. There will be additional setup time in the beginning of append(). For this reason, we want to keep the existing append behavior as default and add this new behavior as an option. Or may be the other way around.

          As for exposing the checksum type information, we may add getFileChecksum method to return the checksum and type for the first n bytes of a file. For small n's, it requires contacting only one data node. This method can have other uses such as quick content version check when the header of the file is guaranteed to be different for different versions.

          For create/writes, setting dfs.checksum.type works, but with the FileSystem cache on, the checksum type used for creating FSDataOutputStreamoutput won't change. For the data copy apps that need to switch the checksum type, fs.<fs name>.impl.disable.cache may be set to get a unique instance every time. When dealing with a long list of files, call close() for each instance to avoid bloat and oom.

          This is my rough idea, which I have implemented partially so far. An HDFS subtask may be created, if the changes in common and hdfs are not interdependent. Any feedback is appreciated.

          Show
          Kihwal Lee added a comment - The post-copy check is done by comparing the results from getFileChecksum(). The getFileChecksum() method is also used by some tools to check whether the destination copy needs to be updated. If a copy of the same content can have a different version of checksum than the source, these checks will no longer be used. Staying with CRC32 is a workaround, but this preclude the possibility of moving to the better performing CRC32C checksum. One of the least invasive approaches is to follow one principle: allow the source checksum method to be used for the destination in mixed checksum environment. If the default is CRC32C, all newly created content will use CRC32C, but the existing data with CRC32 will stay with CRC32 even after DistCp. This allows gradual migration to CRC32C. This approach requires the following capabilities: Clients should be able to find out the checksum type of existing data. Clients should be able to tell data nodes which checksum type to use for write. Without append, these operations can be at file level. But if append is used, a file can contain more than one checksum types (See HDFS-2130 for details), which forces the above operations to be performed for every block. However, exposing block-level detail is not desirable for the FileSystem abstraction. I propose we add a configurable feature to make append() to follow the existing checksum method. For zero-byte files, default is used. For non-zero byte files, checking first block is sufficient. Expose this information to clients so that they can use it to specify the write checksum type. There will be additional setup time in the beginning of append(). For this reason, we want to keep the existing append behavior as default and add this new behavior as an option. Or may be the other way around. As for exposing the checksum type information, we may add getFileChecksum method to return the checksum and type for the first n bytes of a file. For small n's, it requires contacting only one data node. This method can have other uses such as quick content version check when the header of the file is guaranteed to be different for different versions. For create/writes, setting dfs.checksum.type works, but with the FileSystem cache on, the checksum type used for creating FSDataOutputStreamoutput won't change. For the data copy apps that need to switch the checksum type, fs.<fs name>.impl.disable.cache may be set to get a unique instance every time. When dealing with a long list of files, call close() for each instance to avoid bloat and oom. This is my rough idea, which I have implemented partially so far. An HDFS subtask may be created, if the changes in common and hdfs are not interdependent. Any feedback is appreciated.
          Hide
          Kihwal Lee added a comment -

          Sorry for the delay on this. I will get a set of initial patches up soon. But here is one design decision I have to make and would appreciate any input on this.

          We need a way to specify the checksum type for create(). Currently the checksum type used for creating DFSOutputStream is set based on dfs.checksum.type when a DFSClient object is created. If there is no file system cache, users can dictate the checksum type for a new file by setting the dfs.checksum.type properly. This does not work when the file system cache is on. The following is why:

          • A DFSClient instance can be shared by many threads, so changing the shared class variable can result in unpredictable behaviors.
          • The FileSystem cache is only keyed on the scheme/authority and UGI. The DFSClient object that was created by a DFS instance in the cache will retain the conf that was used to instantiate it. If the same UGI is used, this DFSClient will be used for all threads that acesses the same HDFS cluster. In this case the threads cannot even change the behavior of DFSClient by changing conf settings, even if we modify DFSClient so that it reads dfs.checksum.type dynamically during create().

          Turning cache off is not an option due to the potential resource exhaustion issues on various part of systems.

          So far, this is the only way I came up with that does not involve FileSystem API change: Add checksum types to CreateFlag. The types already are defined in DataChecksum, so the changes are contained in common. I was initially very reluctant about this because I was comparing the flags to POSIX open flags. But it seems less objectionable once I realized CreateFlag used for create() is nothing like the POSIX one.

          If I don't hear any other suggestion, I will prepare a set of patches based on this. There will be sub-tasks and a separate blocking jira.

          Show
          Kihwal Lee added a comment - Sorry for the delay on this. I will get a set of initial patches up soon. But here is one design decision I have to make and would appreciate any input on this. We need a way to specify the checksum type for create(). Currently the checksum type used for creating DFSOutputStream is set based on dfs.checksum.type when a DFSClient object is created. If there is no file system cache, users can dictate the checksum type for a new file by setting the dfs.checksum.type properly. This does not work when the file system cache is on. The following is why: A DFSClient instance can be shared by many threads, so changing the shared class variable can result in unpredictable behaviors. The FileSystem cache is only keyed on the scheme/authority and UGI. The DFSClient object that was created by a DFS instance in the cache will retain the conf that was used to instantiate it. If the same UGI is used, this DFSClient will be used for all threads that acesses the same HDFS cluster. In this case the threads cannot even change the behavior of DFSClient by changing conf settings, even if we modify DFSClient so that it reads dfs.checksum.type dynamically during create(). Turning cache off is not an option due to the potential resource exhaustion issues on various part of systems. So far, this is the only way I came up with that does not involve FileSystem API change: Add checksum types to CreateFlag. The types already are defined in DataChecksum, so the changes are contained in common. I was initially very reluctant about this because I was comparing the flags to POSIX open flags. But it seems less objectionable once I realized CreateFlag used for create() is nothing like the POSIX one. If I don't hear any other suggestion, I will prepare a set of patches based on this. There will be sub-tasks and a separate blocking jira.
          Hide
          Kihwal Lee added a comment -

          > so the changes are contained in common

          Maybe this is confusing. The changes pertaining to the addition of checksum types to CreateFlag does not necessitate new dependencies that are external to Common. I.e. can be done in a separate jira and won't break anything alone.

          Show
          Kihwal Lee added a comment - > so the changes are contained in common Maybe this is confusing. The changes pertaining to the addition of checksum types to CreateFlag does not necessitate new dependencies that are external to Common. I.e. can be done in a separate jira and won't break anything alone.
          Hide
          Todd Lipcon added a comment -

          Hi Kihwal. What about making the checksum type part of the FileSystem cache key (like we do for UGI?) It seems like we would have similar problems with configurable timeouts, etc.

          Show
          Todd Lipcon added a comment - Hi Kihwal. What about making the checksum type part of the FileSystem cache key (like we do for UGI?) It seems like we would have similar problems with configurable timeouts, etc.
          Hide
          Kihwal Lee added a comment -

          The ctor for Key already accepts conf, but it ignores it. It is easy to make conf a part of the key. If a user wants a new FS instance with some settings tweaked, s/he can clone/create a conf and do it. Will there be any significant negative effect from adding conf to the FS cache key elements?

          Show
          Kihwal Lee added a comment - The ctor for Key already accepts conf, but it ignores it. It is easy to make conf a part of the key. If a user wants a new FS instance with some settings tweaked, s/he can clone/create a conf and do it. Will there be any significant negative effect from adding conf to the FS cache key elements?
          Hide
          Kihwal Lee added a comment -

          What about making the checksum type part of the FileSystem cache key

          The checksum type is a dfs config item. We can't do that in FileSystem, which is in common. But FileSystem already has things like setVerfyChecksum() and getFileChecksum(). So we could make the checksum type a Filesystem-level config.

          To address the issue of dynamically configurable properties, we could introduce a file system config digest method, which is kind of like hashCode(). The tricky part will be to get the hdfs part of the formula added to Configuration when, say, HdfsConfiguration.init() is called. Or maybe having each file system implement a digest method is better.

          For this jira, I will just add conf as a part of the key. The equality check will be just a shallow comparison.

          Show
          Kihwal Lee added a comment - What about making the checksum type part of the FileSystem cache key The checksum type is a dfs config item. We can't do that in FileSystem, which is in common. But FileSystem already has things like setVerfyChecksum() and getFileChecksum(). So we could make the checksum type a Filesystem-level config. To address the issue of dynamically configurable properties, we could introduce a file system config digest method, which is kind of like hashCode(). The tricky part will be to get the hdfs part of the formula added to Configuration when, say, HdfsConfiguration.init() is called. Or maybe having each file system implement a digest method is better. For this jira, I will just add conf as a part of the key. The equality check will be just a shallow comparison.
          Hide
          Todd Lipcon added a comment -

          Doing shallow conf comparison as part of the FS key seems a bit dangerous – I'm guessing we'll end up with a lot of leakage issues in long running daemons like the NM/RM.

          Anyone else have some other ideas how to deal with this? I don't think the CreateFlag idea is bad – maybe better than futzing with the cache.

          Show
          Todd Lipcon added a comment - Doing shallow conf comparison as part of the FS key seems a bit dangerous – I'm guessing we'll end up with a lot of leakage issues in long running daemons like the NM/RM. Anyone else have some other ideas how to deal with this? I don't think the CreateFlag idea is bad – maybe better than futzing with the cache.
          Hide
          Kihwal Lee added a comment -

          Created subtasks and specified dependency among them. HDFS-3176 should go in first to avoid any breakage until the commit of the hdfs portions.

          I will do more testing and add test cases. I will appreciate any feedback on the patches.

          Show
          Kihwal Lee added a comment - Created subtasks and specified dependency among them. HDFS-3176 should go in first to avoid any breakage until the commit of the hdfs portions. I will do more testing and add test cases. I will appreciate any feedback on the patches.
          Hide
          Kihwal Lee added a comment -

          In HDFS-3177, Sanjay suggested that the one checksum type per file be enforced architecturally, rather than DFSClient doing it using existing facility. The changes in HDFS-3177 still allows DistCp, etc. to discover and set checksum parameters so that the results of getFileChecksum() on copies can match. I will resolve this jira with a modified summary. I expect Sanjay to file a new Jira when he has a proposal.

          Show
          Kihwal Lee added a comment - In HDFS-3177 , Sanjay suggested that the one checksum type per file be enforced architecturally, rather than DFSClient doing it using existing facility. The changes in HDFS-3177 still allows DistCp, etc. to discover and set checksum parameters so that the results of getFileChecksum() on copies can match. I will resolve this jira with a modified summary. I expect Sanjay to file a new Jira when he has a proposal.

            People

            • Assignee:
              Kihwal Lee
              Reporter:
              Kihwal Lee
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development