Hadoop Common
  1. Hadoop Common
  2. HADOOP-7178

FileSystem should have an option to control the .crc file creations at Local.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: fs
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When we copy the files from DFS to local, it is creating the .crc files in local filesystem for the verification of checksum. When user dont want to do any check sum verifications, this files will not be useful.

      Command line already has an option ignoreCrc, to control this.
      So, we should have the similar option on FileSystem also..... like fs.ignoreCrc().
      This should set the setVerifyChecksum to false and also should select the non CheckSumFileSystem as local fs.

      1. HADOOP-7178_COMMON.patch
        4 kB
        Uma Maheswara Rao G
      2. HADOOP-7178_HDFS.patch
        3 kB
        Uma Maheswara Rao G
      3. HADOOP-7178.1.patch
        4 kB
        Uma Maheswara Rao G
      4. HADOOP-7178.2.patch
        3 kB
        Uma Maheswara Rao G
      5. HADOOP-7178.3.patch
        3 kB
        Uma Maheswara Rao G
      6. HADOOP-7178.patch
        4 kB
        Uma Maheswara Rao G

        Activity

        Hide
        Uma Maheswara Rao G added a comment -

        FileSystem already has the API setVerifyChecksum. As per current implementation this flag will be set in DistributedFilesystem.Based on this flag, checksum verification will happen. Bydefault LocalFileSystem will be selected as local file system . But LocalFileSystem will create crc files in local disk in all the cases because it is CheckSumFileSystem.
        But when user explicitly set the verifyCheckSum flag to false, then LocalFileSystem will create the crc files in local disk.

        One Option to solve this issue is:
        We can provide an API in FileSystem which will be the getter method(getVerifyChecksum) for setVerifyChecksum. DistributedFileSystem can override this method.

        In Filesystem’s copyToLocalFile API, we can use this flag status.
        If this flag is disabled, then we can select RawLocalFileSystem as local file system. This RawLocalFileSystem will not create any of the CRC files.

        Show
        Uma Maheswara Rao G added a comment - FileSystem already has the API setVerifyChecksum. As per current implementation this flag will be set in DistributedFilesystem.Based on this flag, checksum verification will happen. Bydefault LocalFileSystem will be selected as local file system . But LocalFileSystem will create crc files in local disk in all the cases because it is CheckSumFileSystem. But when user explicitly set the verifyCheckSum flag to false, then LocalFileSystem will create the crc files in local disk. One Option to solve this issue is: We can provide an API in FileSystem which will be the getter method(getVerifyChecksum) for setVerifyChecksum. DistributedFileSystem can override this method. In Filesystem’s copyToLocalFile API, we can use this flag status. If this flag is disabled, then we can select RawLocalFileSystem as local file system. This RawLocalFileSystem will not create any of the CRC files.
        Hide
        Uma Maheswara Rao G added a comment -

        Implemented the getVerifyChecksum method in DistributedFileSystem. Provided the patch for the changes in HDFS. So, HADOOP-7178_HDFS.patch need to be updated in HDFS project.

        Show
        Uma Maheswara Rao G added a comment - Implemented the getVerifyChecksum method in DistributedFileSystem. Provided the patch for the changes in HDFS. So, HADOOP-7178 _HDFS.patch need to be updated in HDFS project.
        Hide
        Daryn Sharp added a comment -

        I'm not an expert on the filesystem code, but I have a few questions/concerns.

        Should the code really be casting to a ChecksumFileSystem w/o first checking instanceof?

        Should the default getVerifyChecksum() return instanceof ChecksumFileSystem instead of always true? That may help prevent changing other filesystem classes which may do not implement setVerifyChecksum.

        It looks like more fs classes may need to be changed than just Distributed. Other filesystem appear to have private booleans for verifyChecksum. The ChRoot will need to reach into its embedded fs.

        Show
        Daryn Sharp added a comment - I'm not an expert on the filesystem code, but I have a few questions/concerns. Should the code really be casting to a ChecksumFileSystem w/o first checking instanceof? Should the default getVerifyChecksum() return instanceof ChecksumFileSystem instead of always true? That may help prevent changing other filesystem classes which may do not implement setVerifyChecksum. It looks like more fs classes may need to be changed than just Distributed. Other filesystem appear to have private booleans for verifyChecksum. The ChRoot will need to reach into its embedded fs.
        Hide
        Uma Maheswara Rao G added a comment -

        Hi Daryn Sharp,
        Thanks for your comments,

         
        Should the code really be casting to a ChecksumFileSystem w/o first checking instanceof?
        

        For CopyToLocalFile api, we will select LocalFilesystem as local file system. Here gteLocal() api used to get that local file system.This API will always return LocalFileSystem which extends CheckSumFileFystem
        So, instanceOf check not required.

         
        Should the default getVerifyChecksum() return instanceof ChecksumFileSystem instead of always true? That may help prevent changing other filesystem classes which may do not implement setVerifyChecksum.
        

        Here By deafult local file system will be LocalFileSystem. This will create crc files. So, I put default as true. If user wanted to disable it, he will provide the implementation for setVerifyChecksum() and getVerifyChecksum() like DistributedFileSystem.

          
        It looks like more fs classes may need to be changed than just Distributed. Other filesystem appear to have private booleans for verifyChecksum. 
        

        Here we are not forcing implementation classes to implement this methods. If Concrete classes does not want to manage this checkSumVerification flag then default implemenation will work.

        Show
        Uma Maheswara Rao G added a comment - Hi Daryn Sharp, Thanks for your comments, Should the code really be casting to a ChecksumFileSystem w/o first checking instanceof? For CopyToLocalFile api, we will select LocalFilesystem as local file system. Here gteLocal() api used to get that local file system.This API will always return LocalFileSystem which extends CheckSumFileFystem So, instanceOf check not required. Should the default getVerifyChecksum() return instanceof ChecksumFileSystem instead of always true? That may help prevent changing other filesystem classes which may do not implement setVerifyChecksum. Here By deafult local file system will be LocalFileSystem. This will create crc files. So, I put default as true. If user wanted to disable it, he will provide the implementation for setVerifyChecksum() and getVerifyChecksum() like DistributedFileSystem. It looks like more fs classes may need to be changed than just Distributed. Other filesystem appear to have private booleans for verifyChecksum. Here we are not forcing implementation classes to implement this methods. If Concrete classes does not want to manage this checkSumVerification flag then default implemenation will work.
        Hide
        Daryn Sharp added a comment -

        The getlocal method is returning a LocalFileSystem, so if "fs" is declared as such, could the cast be removed?

        Regarding the getVerifyChecksum(), my concern is it's semantically wrong for all filesystems to claim (by default) that checksum files are supported. The default should be the truth. Granted, the implementation is backwards compatible, but future callers may operation on false assumptions of crc files.

        Would it be better to default as false in FileSystem, and ChecksumFileSystem overrides for true? I think this would also retain backwards compatibility?

        Show
        Daryn Sharp added a comment - The getlocal method is returning a LocalFileSystem, so if "fs" is declared as such, could the cast be removed? Regarding the getVerifyChecksum(), my concern is it's semantically wrong for all filesystems to claim (by default) that checksum files are supported. The default should be the truth. Granted, the implementation is backwards compatible, but future callers may operation on false assumptions of crc files. Would it be better to default as false in FileSystem, and ChecksumFileSystem overrides for true? I think this would also retain backwards compatibility?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12473555/HADOOP-7178_HDFS.patch
        against trunk revision 1094750.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/360//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12473555/HADOOP-7178_HDFS.patch against trunk revision 1094750. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/360//console This message is automatically generated.
        Hide
        Todd Lipcon added a comment -

        I don't agree with the intent of this issue. Adding a new flag to turn on/off the writing of checksums sounds reasonable – but disabling the writing of checksums based on whether you asked to verify it on the way out of HDFS seems like an unrelated choice.

        Show
        Todd Lipcon added a comment - I don't agree with the intent of this issue. Adding a new flag to turn on/off the writing of checksums sounds reasonable – but disabling the writing of checksums based on whether you asked to verify it on the way out of HDFS seems like an unrelated choice.
        Hide
        Todd Lipcon added a comment -

        Canceling patch available status for discussion.

        Show
        Todd Lipcon added a comment - Canceling patch available status for discussion.
        Hide
        Uma Maheswara Rao G added a comment -

        Hi Todd,

        Thanks for the comments.

        I don't agree with the intent of this issue. Adding a new flag to turn on/off the writing of checksums sounds reasonable – but disabling the writing of checksums based on whether you asked to verify it on the way out of HDFS seems like an unrelated choice.

        Agree . As you said, Using setVerifyChecksum api to control the checksum outs may not be correct choice.
        To maintain the consistancy, we can provide one API on FileSystem, i.e, fs.ignoreCRC().
        Internally this API should take care of setting the flag setVerifyChecksum to false and also should select the non CheckSumFileSystem as local file system. I think, this should be fine.

        Initially we thought like, why we need to create the checksum out when user don't want to do any checksum related verification. this may be little bit confusion when we look at the api setVerifyChecksum. I will update the issue title as well.

        Please give your opinion also.

        As i know this particular behavior already available with command line.
        This issue mainly for selecting the local side file system based on the flag. Copying the checksum outs will be depending on the fileSystem which it selects at that point.

        This is handled in command line options by providing the option 'ignoreCrc' this flag also will set the setVerifyChecksum to false on FileSystem.

        Show
        Uma Maheswara Rao G added a comment - Hi Todd, Thanks for the comments. I don't agree with the intent of this issue. Adding a new flag to turn on/off the writing of checksums sounds reasonable – but disabling the writing of checksums based on whether you asked to verify it on the way out of HDFS seems like an unrelated choice. Agree . As you said, Using setVerifyChecksum api to control the checksum outs may not be correct choice. To maintain the consistancy, we can provide one API on FileSystem, i.e, fs.ignoreCRC(). Internally this API should take care of setting the flag setVerifyChecksum to false and also should select the non CheckSumFileSystem as local file system. I think, this should be fine. Initially we thought like, why we need to create the checksum out when user don't want to do any checksum related verification. this may be little bit confusion when we look at the api setVerifyChecksum. I will update the issue title as well. Please give your opinion also. As i know this particular behavior already available with command line. This issue mainly for selecting the local side file system based on the flag. Copying the checksum outs will be depending on the fileSystem which it selects at that point. This is handled in command line options by providing the option 'ignoreCrc' this flag also will set the setVerifyChecksum to false on FileSystem.
        Hide
        Daryn Sharp added a comment -

        I'm not sure I understand the entire inheritance of the FileSystem classes, but... Is it unreasonable for a client app to use getRawFileSystem() to avoid crc files?

        Show
        Daryn Sharp added a comment - I'm not sure I understand the entire inheritance of the FileSystem classes, but... Is it unreasonable for a client app to use getRawFileSystem() to avoid crc files?
        Hide
        Uma Maheswara Rao G added a comment -

        Hi Daryn,

        Thanks a lot for taking a look on this issue.

        Client app will deal with FileSystem APIs only right?

        public void copyToLocalFile(boolean delSrc, Path src, Path dst)
            throws IOException {
            FileUtil.copy(this, src, getLocal(getConf()), dst, delSrc, getConf());
          }
        

        But here it will select LocalFileSystem always.

        What my idea is, if client app calls ignoreCrc on fs, then it sould internally set verifyCheckSum to false and also selects the RawLocalFileSystem instead of LocalFileSystem.

        Give me your opinion on this.

        Show
        Uma Maheswara Rao G added a comment - Hi Daryn, Thanks a lot for taking a look on this issue. Client app will deal with FileSystem APIs only right? public void copyToLocalFile( boolean delSrc, Path src, Path dst) throws IOException { FileUtil.copy( this , src, getLocal(getConf()), dst, delSrc, getConf()); } But here it will select LocalFileSystem always. What my idea is, if client app calls ignoreCrc on fs, then it sould internally set verifyCheckSum to false and also selects the RawLocalFileSystem instead of LocalFileSystem. Give me your opinion on this.
        Hide
        Uma Maheswara Rao G added a comment -

        Updated a patch for review!

        Show
        Uma Maheswara Rao G added a comment - Updated a patch for review!
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12488007/HADOOP-7178.patch
        against trunk revision 1151505.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.cli.TestCLI
        org.apache.hadoop.fs.TestFilterFileSystem
        org.apache.hadoop.fs.TestFSMainOperationsLocalFileSystem
        org.apache.hadoop.fs.viewfs.TestFSMainOperationsLocalFileSystem

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/773//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/773//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/773//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12488007/HADOOP-7178.patch against trunk revision 1151505. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestCLI org.apache.hadoop.fs.TestFilterFileSystem org.apache.hadoop.fs.TestFSMainOperationsLocalFileSystem org.apache.hadoop.fs.viewfs.TestFSMainOperationsLocalFileSystem +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/773//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/773//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/773//console This message is automatically generated.
        Hide
        Uma Maheswara Rao G added a comment -

        Looks CLI test failures are due to HADOOP-7485 issue.

        Show
        Uma Maheswara Rao G added a comment - Looks CLI test failures are due to HADOOP-7485 issue.
        Hide
        Daryn Sharp added a comment -

        The unrelated HADOOP-7378 (other than it touched the same file) should have fixed the test failure.

        Show
        Daryn Sharp added a comment - The unrelated HADOOP-7378 (other than it touched the same file) should have fixed the test failure.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12488016/HADOOP-7178.1.patch
        against trunk revision 1151594.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/775//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/775//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/775//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12488016/HADOOP-7178.1.patch against trunk revision 1151594. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/775//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/775//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/775//console This message is automatically generated.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Since only FileSystem.copyToLocalFile(..) uses it, how about adding it as a parameter?

        BTW, it may be better to rename it to useRawLocalFileSystem.

        i.e.

           public void copyToLocalFile(boolean delSrc, Path src, Path dst) {
             copyToLocalFile(delSrc, src, dst, false);
           }
        
           public void copyToLocalFile(boolean delSrc, Path src, Path dst, boolean useRawLocalFileSystem) {
             ...
           }
        
        Show
        Tsz Wo Nicholas Sze added a comment - Since only FileSystem.copyToLocalFile(..) uses it, how about adding it as a parameter? BTW, it may be better to rename it to useRawLocalFileSystem. i.e. public void copyToLocalFile( boolean delSrc, Path src, Path dst) { copyToLocalFile(delSrc, src, dst, false ); } public void copyToLocalFile( boolean delSrc, Path src, Path dst, boolean useRawLocalFileSystem) { ... }
        Hide
        Uma Maheswara Rao G added a comment -

        Hi Nicholas,

        Thanks alot for taking a look on this issue.
        If you suggest to add API, i am open for that!

        reason Why i added ignoreCrc api is, commandline already has the option to specify ignoreCrc. Just to make it sync i proposed ignoreCrc in FileSystem also.

        your idea also good. I will update the patch.

        --Thanks

        Show
        Uma Maheswara Rao G added a comment - Hi Nicholas, Thanks alot for taking a look on this issue. If you suggest to add API, i am open for that! reason Why i added ignoreCrc api is, commandline already has the option to specify ignoreCrc. Just to make it sync i proposed ignoreCrc in FileSystem also. your idea also good. I will update the patch. --Thanks
        Hide
        Uma Maheswara Rao G added a comment -

        Hi Nicholas,

        Updated the patch for review!

        --Thanks

        Show
        Uma Maheswara Rao G added a comment - Hi Nicholas, Updated the patch for review! --Thanks
        Hide
        Tsz Wo Nicholas Sze added a comment -

        The patch adds an empty line.

        @@ -2282,4 +2313,5 @@
                                  ": " + pair.getValue());
             }
           }
        +
         }
        

        It looks good other than that.

        Please run "ant test" and "ant test-patch" manually.

        Show
        Tsz Wo Nicholas Sze added a comment - The patch adds an empty line. @@ -2282,4 +2313,5 @@ ": " + pair.getValue()); } } + } It looks good other than that. Please run "ant test" and "ant test-patch" manually.
        Hide
        Uma Maheswara Rao G added a comment -

        Hi Nicholas,

        Thanks a lot for taking a look.

        Removed the empty line.

        Ant test patch results:
        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        [exec]
        [exec] +1 system test framework. The patch passed system test framework compile.
        [exec]
        [exec]
        [exec] ======================================================================
        [exec] ======================================================================
        [exec] Finished build.
        [exec] ======================================================================
        [exec] ======================================================================
        [exec]

        I verified manually, Tests are passing.

        Show
        Uma Maheswara Rao G added a comment - Hi Nicholas, Thanks a lot for taking a look. Removed the empty line. Ant test patch results: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system test framework. The patch passed system test framework compile. [exec] [exec] [exec] ====================================================================== [exec] ====================================================================== [exec] Finished build. [exec] ====================================================================== [exec] ====================================================================== [exec] I verified manually, Tests are passing.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        +1 patch looks good

        I have committed this. Thanks, Uma!

        Show
        Tsz Wo Nicholas Sze added a comment - +1 patch looks good I have committed this. Thanks, Uma!

          People

          • Assignee:
            Uma Maheswara Rao G
            Reporter:
            Uma Maheswara Rao G
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development