Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-alpha2
    • Component/s: None
    • Labels:
      None

      Description

      ZStandard: https://github.com/facebook/zstd has been used in production for 6 months by facebook now. v1.0 was recently released. Create a codec for this library.

      1. HADOOP-13578.patch
        73 kB
        churro morales
      2. HADOOP-13578.v1.patch
        73 kB
        churro morales
      3. HADOOP-13578.v2.patch
        65 kB
        churro morales
      4. HADOOP-13578.v3.patch
        147 kB
        churro morales
      5. HADOOP-13578.v4.patch
        151 kB
        churro morales
      6. HADOOP-13578.v5.patch
        107 kB
        churro morales
      7. HADOOP-13578.v6.patch
        111 kB
        churro morales
      8. HADOOP-13578.v7.patch
        107 kB
        churro morales
      9. HADOOP-13578.v8.patch
        112 kB
        churro morales
      10. HADOOP-13578.v9.patch
        112 kB
        churro morales
      11. HADOOP-13578-branch-2.v9.patch
        110 kB
        churro morales

        Issue Links

          Activity

          Hide
          churromorales churro morales added a comment -

          i was wondering if people were interested in getting zstandard integrated into the hadoop-mapreduce-client-nativetask? That is my only task left for trunk (since that code only resides there). I have backports ready for the 2.7 and 2.6 branches as well. Maybe we can split that feature up into a separate jira ticket? If thats the case I can have patches up very soon. Was just wondering everyone's thoughts here?

          Show
          churromorales churro morales added a comment - i was wondering if people were interested in getting zstandard integrated into the hadoop-mapreduce-client-nativetask? That is my only task left for trunk (since that code only resides there). I have backports ready for the 2.7 and 2.6 branches as well. Maybe we can split that feature up into a separate jira ticket? If thats the case I can have patches up very soon. Was just wondering everyone's thoughts here?
          Hide
          churromorales churro morales added a comment - - edited

          This is the patch for trunk only. I can provide the back-ports once this patch is approved. Also didn't include ZStandard in the mapreduce-nativetask. That can be a different ticket and patch since it didn't seem like people were interested in that feature.

          Show
          churromorales churro morales added a comment - - edited This is the patch for trunk only. I can provide the back-ports once this patch is approved. Also didn't include ZStandard in the mapreduce-nativetask. That can be a different ticket and patch since it didn't seem like people were interested in that feature.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for posting a patch! Some comments after a quick look, I'll try to get a more thorough look later this week:

          The BUILDING.txt addition must have been copied from snappy, since a lot of it still refers to snappy.

          There's a TODO in the code that should get sorted out:

             // TODO:RG Figure out the overhead, right now keeping it similar to zlib
          

          Shouldn't this come from IO_COMPRESSION_CODEC_ZSTD_LEVEL_DEFAULT that's already in the config keys?

          @VisibleForTesting static final int DEFAULT_COMPRESSION_LEVEL = 2;
          

          Speaking of default compression level, could you elaborate on why level 2 was chosen? The zstd CLI defaults to level 3.

          Show
          jlowe Jason Lowe added a comment - Thanks for posting a patch! Some comments after a quick look, I'll try to get a more thorough look later this week: The BUILDING.txt addition must have been copied from snappy, since a lot of it still refers to snappy. There's a TODO in the code that should get sorted out: // TODO:RG Figure out the overhead, right now keeping it similar to zlib Shouldn't this come from IO_COMPRESSION_CODEC_ZSTD_LEVEL_DEFAULT that's already in the config keys? @VisibleForTesting static final int DEFAULT_COMPRESSION_LEVEL = 2; Speaking of default compression level, could you elaborate on why level 2 was chosen? The zstd CLI defaults to level 3.
          Hide
          churromorales churro morales added a comment - - edited

          Jason Lowe Thank you for taking a look.

          The BUILDING.txt addition must have been copied from snappy, since a lot of it still refers to snappy.

          Yes did a copy and paste with the Building.txt, will take a look.

          There's a TODO in the code that should get sorted out:

          Yes I think I have it figured out, had to read up a bit . Since we compress the entire buffer at once, that means it's transformed into a single ZStandard Frame.
          So the overhead is a Magic # (4 bytes) + Header (14 bytes max) + Checksum (optional 4 bytes). So I believe that can be a constant of 22 Bytes because of the way we compress in hadoop. I'll update that as well.

          Shouldn't this come from IO_COMPRESSION_CODEC_ZSTD_LEVEL_DEFAULT that's already in the config keys?

          Correct!, will change that back to 3, must've been a typo since the constant is (3). Will use constant.

          I'll incorporate your fixes. Comments much appreciated.

          Show
          churromorales churro morales added a comment - - edited Jason Lowe Thank you for taking a look. The BUILDING.txt addition must have been copied from snappy, since a lot of it still refers to snappy. Yes did a copy and paste with the Building.txt, will take a look. There's a TODO in the code that should get sorted out: Yes I think I have it figured out, had to read up a bit . Since we compress the entire buffer at once, that means it's transformed into a single ZStandard Frame. So the overhead is a Magic # (4 bytes) + Header (14 bytes max) + Checksum (optional 4 bytes). So I believe that can be a constant of 22 Bytes because of the way we compress in hadoop. I'll update that as well. Shouldn't this come from IO_COMPRESSION_CODEC_ZSTD_LEVEL_DEFAULT that's already in the config keys? Correct!, will change that back to 3, must've been a typo since the constant is (3). Will use constant. I'll incorporate your fixes. Comments much appreciated.
          Hide
          churromorales churro morales added a comment -

          Jason Lowe Just wanted to let you know, I have incorporated your changes, but was waiting for a more detailed review before putting a patch up. Don't want to spam you with too many patches

          Show
          churromorales churro morales added a comment - Jason Lowe Just wanted to let you know, I have incorporated your changes, but was waiting for a more detailed review before putting a patch up. Don't want to spam you with too many patches
          Hide
          jlowe Jason Lowe added a comment -

          Sorry for the delay in getting a more detailed review. Before I delved deep into the code I ran the codec through some basic tests and found a number of problems.

          The native code compiles with warnings that should be cleaned up:

          [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c: In function ‘Java_org_apache_hadoop_io_compress_zstd_ZStandardDecompressor_decompressBytes’:
          [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:110: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’
          [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:110: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’
          [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c: In function ‘Java_org_apache_hadoop_io_compress_zstd_ZStandardDecompressor_decompressBytes’:
          [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:110: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’
          [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:110: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’
          

          The codec is not working as an intermediate codec for MapReduce jobs. Running a wordcount job with -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec works, but specifying -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec causes the reducers to fail while fetching outputs complaining about premature EOF:

          2016-10-03 13:51:32,140 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#5 about to shuffle output of map attempt_1475501532481_0007_m_000000_0 decomp: 323113 len: 93339 to MEMORY
          2016-10-03 13:51:32,149 WARN [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to shuffle for fetcher#5
          java.io.IOException: Premature EOF from inputStream
          	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:209)
          	at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.doShuffle(InMemoryMapOutput.java:90)
          	at org.apache.hadoop.mapreduce.task.reduce.IFileWrappedMapOutput.shuffle(IFileWrappedMapOutput.java:63)
          	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:536)
          	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336)
          	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
          

          The codec also has some issues with MapReduce jobs when reading input from a previous job's output that has been zstd compressed. For example, this sequence of steps generates output one would expect, where we're effectively word counting the output of wordcount on /etc/services (just some sample input for wordcount):

          $ hadoop fs -put /etc/services wcin
          $ hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec wcin wcout-gzip
          $ hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar wordcount wcout-gzip wcout-gzip2
          

          But if we do the same with org.apache.hadoop.io.compress.ZStandardCodec there's an odd record consisting of about 25K of NULs (i.e.: 0x00 bytes) in the output of the second job.

          The output of the ZStandardCodec is not readable by the zstd CLI utility, nor is output generated by the zstd CLI utility readable by ZStandardCodec.

          Show
          jlowe Jason Lowe added a comment - Sorry for the delay in getting a more detailed review. Before I delved deep into the code I ran the codec through some basic tests and found a number of problems. The native code compiles with warnings that should be cleaned up: [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c: In function ‘Java_org_apache_hadoop_io_compress_zstd_ZStandardDecompressor_decompressBytes’: [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:110: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’ [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:110: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’ [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c: In function ‘Java_org_apache_hadoop_io_compress_zstd_ZStandardDecompressor_decompressBytes’: [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:110: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’ [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:110: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’ The codec is not working as an intermediate codec for MapReduce jobs. Running a wordcount job with -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec works, but specifying -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec causes the reducers to fail while fetching outputs complaining about premature EOF: 2016-10-03 13:51:32,140 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#5 about to shuffle output of map attempt_1475501532481_0007_m_000000_0 decomp: 323113 len: 93339 to MEMORY 2016-10-03 13:51:32,149 WARN [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to shuffle for fetcher#5 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:209) at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.doShuffle(InMemoryMapOutput.java:90) at org.apache.hadoop.mapreduce.task.reduce.IFileWrappedMapOutput.shuffle(IFileWrappedMapOutput.java:63) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:536) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193) The codec also has some issues with MapReduce jobs when reading input from a previous job's output that has been zstd compressed. For example, this sequence of steps generates output one would expect, where we're effectively word counting the output of wordcount on /etc/services (just some sample input for wordcount): $ hadoop fs -put /etc/services wcin $ hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec wcin wcout-gzip $ hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar wordcount wcout-gzip wcout-gzip2 But if we do the same with org.apache.hadoop.io.compress.ZStandardCodec there's an odd record consisting of about 25K of NULs (i.e.: 0x00 bytes) in the output of the second job. The output of the ZStandardCodec is not readable by the zstd CLI utility, nor is output generated by the zstd CLI utility readable by ZStandardCodec.
          Hide
          churromorales churro morales added a comment -

          Jason Lowe Thanks for the update. I will look into the issues. As far as the first issue were you using the hadoop-mapreduce-native-task code for intermediate compression for MR jobs? If so I didn't implement that feature because I didn't know if enough people were interested. As far as the warnings, I will clean them up first and will run a word count job to ensure that the decompression / compression works correctly. I'll see if I can reproduce your word count results, that is a bit concerning. Thanks for the review, I'll take a look at it today / tomorrow.

          Show
          churromorales churro morales added a comment - Jason Lowe Thanks for the update. I will look into the issues. As far as the first issue were you using the hadoop-mapreduce-native-task code for intermediate compression for MR jobs? If so I didn't implement that feature because I didn't know if enough people were interested. As far as the warnings, I will clean them up first and will run a word count job to ensure that the decompression / compression works correctly. I'll see if I can reproduce your word count results, that is a bit concerning. Thanks for the review, I'll take a look at it today / tomorrow.
          Hide
          jlowe Jason Lowe added a comment -

          As far as the first issue were you using the hadoop-mapreduce-native-task code for intermediate compression for MR jobs?

          No, I don't believe so unless trunk uses that native code by default. This was just a straightforward mapreduce wordcount job with just the two settings I mentioned above, mapreduce.map.output.compress=true and mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec.

          Show
          jlowe Jason Lowe added a comment - As far as the first issue were you using the hadoop-mapreduce-native-task code for intermediate compression for MR jobs? No, I don't believe so unless trunk uses that native code by default. This was just a straightforward mapreduce wordcount job with just the two settings I mentioned above, mapreduce.map.output.compress=true and mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec.
          Hide
          churromorales churro morales added a comment -

          Jason Lowe Great catch! I figured out the issue I was returning the wrong buffer length in the decompressBytes function in ZStandardDecompressor.c . Made the change locally and everything works. I'll fix the warnings, make sure all your test cases work and incorporate your earlier comments as well. Thanks again for the review, I really appreciate you taking the time to take a look.

          Show
          churromorales churro morales added a comment - Jason Lowe Great catch! I figured out the issue I was returning the wrong buffer length in the decompressBytes function in ZStandardDecompressor.c . Made the change locally and everything works. I'll fix the warnings, make sure all your test cases work and incorporate your earlier comments as well. Thanks again for the review, I really appreciate you taking the time to take a look.
          Hide
          churromorales churro morales added a comment - - edited

          Ran the mapreduce jobs with

           
          hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar wordcount -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec -Dmapreduce.map.output.compress=true -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec wcin wcout-zst 
          
          hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar wordcount -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec  wcout-zst wcout-zst2
          
          
          • Fixed the warnings for ZStandardDecompressor.c
          • Updated the Building.txt
          • Used the constant IO_COMPRESSION_CODEC_ZSTD_LEVEL_DEFAULT and fixed the default compression level
          • Sorted out the TODO for the compression overhead.

          Jason Lowe what do you think about adding a test that would go through all the codecs and run the M/R job you used as your example with the MRMiniCluster, do you think that would be worthwhile?

          Show
          churromorales churro morales added a comment - - edited Ran the mapreduce jobs with hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar wordcount -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec -Dmapreduce.map.output.compress=true -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec wcin wcout-zst hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar wordcount -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec wcout-zst wcout-zst2 Fixed the warnings for ZStandardDecompressor.c Updated the Building.txt Used the constant IO_COMPRESSION_CODEC_ZSTD_LEVEL_DEFAULT and fixed the default compression level Sorted out the TODO for the compression overhead. Jason Lowe what do you think about adding a test that would go through all the codecs and run the M/R job you used as your example with the MRMiniCluster, do you think that would be worthwhile?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 14s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          0 mvndep 1m 38s Maven dependency ordering for branch
          +1 mvninstall 9m 18s trunk passed
          +1 compile 8m 50s trunk passed
          +1 checkstyle 1m 42s trunk passed
          +1 mvnsite 12m 9s trunk passed
          +1 mvneclipse 1m 18s trunk passed
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 30s trunk passed
          +1 javadoc 5m 45s trunk passed
          0 mvndep 0m 22s Maven dependency ordering for patch
          +1 mvninstall 10m 18s the patch passed
          +1 compile 7m 35s the patch passed
          +1 cc 7m 35s the patch passed
          +1 javac 7m 35s the patch passed
          -0 checkstyle 1m 35s root: The patch generated 53 new + 82 unchanged - 1 fixed = 135 total (was 83)
          +1 mvnsite 10m 4s the patch passed
          +1 mvneclipse 1m 6s the patch passed
          +1 shellcheck 0m 13s There were no new shellcheck issues.
          +1 shelldocs 0m 9s There were no new shelldocs issues.
          -1 whitespace 0m 0s The patch has 7 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 xml 0m 5s The patch has no ill-formed XML file.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 23s the patch passed
          +1 javadoc 5m 8s the patch passed
          -1 unit 99m 50s root in the patch failed.
          -1 asflicense 0m 43s The patch generated 2 ASF License warnings.
          205m 48s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency
            hadoop.yarn.server.applicationhistoryservice.webapp.TestAHSWebServices
            hadoop.yarn.server.nodemanager.containermanager.queuing.TestQueuingContainerManager



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Issue HADOOP-13578
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12831421/HADOOP-13578.v1.patch
          Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle
          uname Linux bdcab7ad1a4b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 853d65a
          Default Java 1.8.0_101
          shellcheck v0.4.4
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10647/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/10647/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/10647/artifact/patchprocess/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10647/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/10647/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10647/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 1m 38s Maven dependency ordering for branch +1 mvninstall 9m 18s trunk passed +1 compile 8m 50s trunk passed +1 checkstyle 1m 42s trunk passed +1 mvnsite 12m 9s trunk passed +1 mvneclipse 1m 18s trunk passed 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 30s trunk passed +1 javadoc 5m 45s trunk passed 0 mvndep 0m 22s Maven dependency ordering for patch +1 mvninstall 10m 18s the patch passed +1 compile 7m 35s the patch passed +1 cc 7m 35s the patch passed +1 javac 7m 35s the patch passed -0 checkstyle 1m 35s root: The patch generated 53 new + 82 unchanged - 1 fixed = 135 total (was 83) +1 mvnsite 10m 4s the patch passed +1 mvneclipse 1m 6s the patch passed +1 shellcheck 0m 13s There were no new shellcheck issues. +1 shelldocs 0m 9s There were no new shelldocs issues. -1 whitespace 0m 0s The patch has 7 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 xml 0m 5s The patch has no ill-formed XML file. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 23s the patch passed +1 javadoc 5m 8s the patch passed -1 unit 99m 50s root in the patch failed. -1 asflicense 0m 43s The patch generated 2 ASF License warnings. 205m 48s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency   hadoop.yarn.server.applicationhistoryservice.webapp.TestAHSWebServices   hadoop.yarn.server.nodemanager.containermanager.queuing.TestQueuingContainerManager Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HADOOP-13578 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12831421/HADOOP-13578.v1.patch Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle uname Linux bdcab7ad1a4b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 853d65a Default Java 1.8.0_101 shellcheck v0.4.4 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10647/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/10647/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/10647/artifact/patchprocess/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10647/testReport/ asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/10647/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10647/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for updating the patch!

          what do you think about adding a test that would go through all the codecs and run the M/R job you used as your example with the MRMiniCluster

          I think this would be overkill. Those tests would not be fast to execute, especially across all codecs. Also they won't run during precommit builds that just touch the codecs since they would necessarily have to be in the mapreduce project to pick up the MR cluster and MR job support code. If this was simply a case of returning the wrong buffer length then I'm wondering why the basic sanity tests of the codec don't catch this. Those tests should be updated to catch this without needing to run the full cluster end-to-end scenario.

          Most of the tests above now pass, but the codec is still not able to produce output that the zstd CLI utility can understand, nor can the codec parse files that were created by the zstd CLI utility. For example, trying to decode a file created by the zstd CLI utility blows up like this:

          $ zstd < /etc/services | hadoop fs -put - services.zst
          $ hadoop fs -text services.zst
          WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
          2016-10-05 13:21:36,236 INFO  [main] compress.CodecPool (CodecPool.java:getDecompressor(181)) - Got brand-new decompressor [.zst]
          -text: Fatal internal error
          java.lang.ArrayIndexOutOfBoundsException
          	at org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.setInput(ZStandardDecompressor.java:109)
          	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:104)
          	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
          	at java.io.InputStream.read(InputStream.java:101)
          	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:91)
          	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:65)
          	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:126)
          	at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:105)
          	at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:100)
          	at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:321)
          	at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:293)
          	at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:275)
          	at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:259)
          	at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
          	at org.apache.hadoop.fs.shell.Command.run(Command.java:166)
          	at org.apache.hadoop.fs.FsShell.run(FsShell.java:326)
          	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
          	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
          	at org.apache.hadoop.fs.FsShell.main(FsShell.java:384)
          

          Similary the zstd CLI utility just complains that any output produced by the codec is not in in zstd format, e.g.:

          $ hadoop fs -cat wcout-zstandard/part-r-00000.zst | zstd -d
          WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
          zstd: stdin: not in zstd format 
          

          Looks like this codec is tacking on an extra 8 bytes, since when I strip those off then the zstd CLI utility is able to decode it. Similarly when the codec tries to read this Hadoop-specific header from a file created by the zstd CLI utility then it blows up since that header is not there. The GzipCodec works without this Hadoop-specific header, so we can model this codec after it to help make this codec compatible with the standard .zst file format.

          The code still compiles with warnings:

          [INFO] Running make -j 2 VERBOSE=1
          [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c: In function ‘Java_org_apache_hadoop_io_compress_zstd_ZStandardDecompressor_decompressBytes’:
          [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:82: warning: unused variable ‘uncompressed_direct_buf_len’
          [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c: In function ‘Java_org_apache_hadoop_io_compress_zstd_ZStandardDecompressor_decompressBytes’:
          [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:82: warning: unused variable ‘uncompressed_direct_buf_len’
          

          Those warnings are a red flag. The code is not checking if the amount of space available in the uncompressed bytes buffer. If I'm reading the code correctly, it's asking the zstd library how many bytes would be needed to fully decompress the data and passing that as the output buffer size rather than the actual buffer size. That's very bad, since corrupted or malicious input could cause the codec into a buffer overflow and corrupt the JVM itself. Seems like we either need to call the codec with the actual output buffer size and hope the data from the compressed frame fits (and the user needs to bump the output buffer size if it doesn't) or we need to switch the codec to use the streaming mode.

          Error handling in the native decompressor code should be much more informative to the user. Currently it just complains about a size mismatch, but it should be using the ZSTD_isError and ZSTD_getErrorName like the compressor native code. Then if the output buffer size is too small then the user could see that error message and hopefully discern what to do.

          HADOOP-13684 looks relevant to this codec since it was heavily copied from Snappy. Looks like the init code error handling needs to differentiate between native libraries not being loaded and the native library not having ZStandard support included.

          Nits:

          • clazz has a suppress warnings for unchecked that looks like it should be rawtypes instead.
          • compressionLevel has a suppress warnings for unchecked that is unnecessary
          • "compressoin" s/b "compression"
          • There's no indentation in this block:
                if (uncompressed_bytes == 0) {
                return (jint)0;
                }
            
          • The indentation in general for the native code is very inconsistent. The decompressor file alone uses indents varying across 0, 1, 2, 4, and 6 spaces.
          Show
          jlowe Jason Lowe added a comment - Thanks for updating the patch! what do you think about adding a test that would go through all the codecs and run the M/R job you used as your example with the MRMiniCluster I think this would be overkill. Those tests would not be fast to execute, especially across all codecs. Also they won't run during precommit builds that just touch the codecs since they would necessarily have to be in the mapreduce project to pick up the MR cluster and MR job support code. If this was simply a case of returning the wrong buffer length then I'm wondering why the basic sanity tests of the codec don't catch this. Those tests should be updated to catch this without needing to run the full cluster end-to-end scenario. Most of the tests above now pass, but the codec is still not able to produce output that the zstd CLI utility can understand, nor can the codec parse files that were created by the zstd CLI utility. For example, trying to decode a file created by the zstd CLI utility blows up like this: $ zstd < /etc/services | hadoop fs -put - services.zst $ hadoop fs -text services.zst WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX. 2016-10-05 13:21:36,236 INFO [main] compress.CodecPool (CodecPool.java:getDecompressor(181)) - Got brand-new decompressor [.zst] -text: Fatal internal error java.lang.ArrayIndexOutOfBoundsException at org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.setInput(ZStandardDecompressor.java:109) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:104) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:91) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:65) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:126) at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:105) at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:100) at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:321) at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:293) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:275) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:259) at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119) at org.apache.hadoop.fs.shell.Command.run(Command.java:166) at org.apache.hadoop.fs.FsShell.run(FsShell.java:326) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hadoop.fs.FsShell.main(FsShell.java:384) Similary the zstd CLI utility just complains that any output produced by the codec is not in in zstd format, e.g.: $ hadoop fs -cat wcout-zstandard/part-r-00000.zst | zstd -d WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX. zstd: stdin: not in zstd format Looks like this codec is tacking on an extra 8 bytes, since when I strip those off then the zstd CLI utility is able to decode it. Similarly when the codec tries to read this Hadoop-specific header from a file created by the zstd CLI utility then it blows up since that header is not there. The GzipCodec works without this Hadoop-specific header, so we can model this codec after it to help make this codec compatible with the standard .zst file format. The code still compiles with warnings: [INFO] Running make -j 2 VERBOSE=1 [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c: In function ‘Java_org_apache_hadoop_io_compress_zstd_ZStandardDecompressor_decompressBytes’: [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:82: warning: unused variable ‘uncompressed_direct_buf_len’ [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c: In function ‘Java_org_apache_hadoop_io_compress_zstd_ZStandardDecompressor_decompressBytes’: [WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:82: warning: unused variable ‘uncompressed_direct_buf_len’ Those warnings are a red flag. The code is not checking if the amount of space available in the uncompressed bytes buffer. If I'm reading the code correctly, it's asking the zstd library how many bytes would be needed to fully decompress the data and passing that as the output buffer size rather than the actual buffer size. That's very bad, since corrupted or malicious input could cause the codec into a buffer overflow and corrupt the JVM itself. Seems like we either need to call the codec with the actual output buffer size and hope the data from the compressed frame fits (and the user needs to bump the output buffer size if it doesn't) or we need to switch the codec to use the streaming mode. Error handling in the native decompressor code should be much more informative to the user. Currently it just complains about a size mismatch, but it should be using the ZSTD_isError and ZSTD_getErrorName like the compressor native code. Then if the output buffer size is too small then the user could see that error message and hopefully discern what to do. HADOOP-13684 looks relevant to this codec since it was heavily copied from Snappy. Looks like the init code error handling needs to differentiate between native libraries not being loaded and the native library not having ZStandard support included. Nits: clazz has a suppress warnings for unchecked that looks like it should be rawtypes instead. compressionLevel has a suppress warnings for unchecked that is unnecessary "compressoin" s/b "compression" There's no indentation in this block: if (uncompressed_bytes == 0) { return (jint)0; } The indentation in general for the native code is very inconsistent. The decompressor file alone uses indents varying across 0, 1, 2, 4, and 6 spaces.
          Hide
          churromorales churro morales added a comment - - edited

          Jason Lowe thank you for the thorough review. The reason that the zstd cli and hadoop can't read each other's compressed / decompressed data is that ZStandardCodec uses the Block(Compressor|Decompressor)Stream. I had made an assumption we wanted to compress at the hdfs block level. When you use this stream each hdfs block gets a header and some compressed data. I believe the 8 bytes you are referring are two ints (the size of the compressed and uncompressed block). If you remove these headers then the cli will be able to read the zstd blocks and if you use the zstd-cli and compress a file (prepend the header for sizes) it will work in hadoop.

          The snappy compressor / decompressor works in the same way. I do not believe you can compress in snappy format using hadoop then transfer the file locally and call Snappy.uncompress() without removing the headers.

          If we do not want this to be compressed at a block level, that is fine. Otherwise we can just add a utility in hadoop to take care of the block headers like they did with hadoop-snappy or some of the CLI libraries for snappy like snzip.

          As far as the decompressed bytes, I agree. I will check to see that the size returned from the function that tells you how many bytes are necessary to uncompress the buffer and ensure thats not larger than our buffer size. I can also add the isError and getErrorName to ZStandardDecompressor.c. The reason I explicitly checked if the expected size was equal to the desired size is because the error that zstd provided was too vague. But I'll add it in case there are other errors.

          Yes I will look at Hadoop-13684. The build of the codec was very similar to snappy because the license was BSD so we could package it in like snappy so I basically cut and paste the build / packaging code and made it work for ZStandard..

          I can also take care of the nits you described as well.

          Are we okay with the compression being at block level? If we are then this implementation will work just like all of the other block compression codecs where it will add / require a header for each hdfs-block.

          Thanks again for the review.

          Show
          churromorales churro morales added a comment - - edited Jason Lowe thank you for the thorough review. The reason that the zstd cli and hadoop can't read each other's compressed / decompressed data is that ZStandardCodec uses the Block(Compressor|Decompressor)Stream. I had made an assumption we wanted to compress at the hdfs block level. When you use this stream each hdfs block gets a header and some compressed data. I believe the 8 bytes you are referring are two ints (the size of the compressed and uncompressed block). If you remove these headers then the cli will be able to read the zstd blocks and if you use the zstd-cli and compress a file (prepend the header for sizes) it will work in hadoop. The snappy compressor / decompressor works in the same way. I do not believe you can compress in snappy format using hadoop then transfer the file locally and call Snappy.uncompress() without removing the headers. If we do not want this to be compressed at a block level, that is fine. Otherwise we can just add a utility in hadoop to take care of the block headers like they did with hadoop-snappy or some of the CLI libraries for snappy like snzip. As far as the decompressed bytes, I agree. I will check to see that the size returned from the function that tells you how many bytes are necessary to uncompress the buffer and ensure thats not larger than our buffer size. I can also add the isError and getErrorName to ZStandardDecompressor.c. The reason I explicitly checked if the expected size was equal to the desired size is because the error that zstd provided was too vague. But I'll add it in case there are other errors. Yes I will look at Hadoop-13684. The build of the codec was very similar to snappy because the license was BSD so we could package it in like snappy so I basically cut and paste the build / packaging code and made it work for ZStandard.. I can also take care of the nits you described as well. Are we okay with the compression being at block level? If we are then this implementation will work just like all of the other block compression codecs where it will add / require a header for each hdfs-block. Thanks again for the review.
          Hide
          jlowe Jason Lowe added a comment -

          I don't believe the block stuff has anything to do with HDFS blocks. Rather it describes compression occurring in chunks (blocks) of data at a time. Without the small header at the beginning of each block, it becomes difficult in a general way to know how much data is in the next compressed block when decompressing it. Using the Block codec streams doesn't inherently make the data splittable since one can't easily locate the codec block boundaries at an arbitrary split in the data stream (i.e.: HDFS block boundaries). IMHO if we want to chunk the data for splitting then we can just use a SequenceFile configured for block compression with this codec.

          Using the Block streams is a big drawback since it makes the format incompatible with the compression standard. This already causes problems with LZ4, see HADOOP-12990. Rather that compressing in blocks that we have to put extra headers on to decode we can use the zstd streaming APIs to stream the data through the compressor and decompressor. That lets us keep the file format compatible and avoids error scenarios where the codec is configured to use a buffer size that is too small to decompress one of the codec blocks. With the streaming API we are decoupling our buffer size from the size of the data to compress/decompress.

          Show
          jlowe Jason Lowe added a comment - I don't believe the block stuff has anything to do with HDFS blocks. Rather it describes compression occurring in chunks (blocks) of data at a time. Without the small header at the beginning of each block, it becomes difficult in a general way to know how much data is in the next compressed block when decompressing it. Using the Block codec streams doesn't inherently make the data splittable since one can't easily locate the codec block boundaries at an arbitrary split in the data stream (i.e.: HDFS block boundaries). IMHO if we want to chunk the data for splitting then we can just use a SequenceFile configured for block compression with this codec. Using the Block streams is a big drawback since it makes the format incompatible with the compression standard. This already causes problems with LZ4, see HADOOP-12990 . Rather that compressing in blocks that we have to put extra headers on to decode we can use the zstd streaming APIs to stream the data through the compressor and decompressor. That lets us keep the file format compatible and avoids error scenarios where the codec is configured to use a buffer size that is too small to decompress one of the codec blocks. With the streaming API we are decoupling our buffer size from the size of the data to compress/decompress.
          Hide
          churromorales churro morales added a comment -

          Here is the streaming version of the patch. Jason Lowe it passed all scenarios which failed for you using the block based approach. Thank you for the review.

          Show
          churromorales churro morales added a comment - Here is the streaming version of the patch. Jason Lowe it passed all scenarios which failed for you using the block based approach. Thank you for the review.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 24s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 1s The patch appears to include 3 new or modified test files.
          0 mvndep 0m 58s Maven dependency ordering for branch
          +1 mvninstall 9m 23s trunk passed
          +1 compile 9m 22s trunk passed
          +1 checkstyle 1m 51s trunk passed
          +1 mvnsite 12m 37s trunk passed
          +1 mvneclipse 1m 22s trunk passed
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 29s trunk passed
          +1 javadoc 5m 35s trunk passed
          0 mvndep 0m 23s Maven dependency ordering for patch
          +1 mvninstall 10m 47s the patch passed
          +1 compile 7m 26s the patch passed
          +1 cc 7m 26s the patch passed
          +1 javac 7m 26s the patch passed
          -0 checkstyle 1m 37s root: The patch generated 56 new + 91 unchanged - 1 fixed = 147 total (was 92)
          +1 mvnsite 10m 8s the patch passed
          +1 mvneclipse 1m 12s the patch passed
          +1 shellcheck 0m 14s There were no new shellcheck issues.
          +1 shelldocs 0m 10s There were no new shelldocs issues.
          -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          -1 whitespace 0m 0s The patch 132 line(s) with tabs.
          +1 xml 0m 4s The patch has no ill-formed XML file.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 17s the patch passed
          +1 javadoc 4m 33s the patch passed
          -1 unit 11m 33s root in the patch failed.
          -1 asflicense 0m 31s The patch generated 3 ASF License warnings.
          118m 29s



          Reason Tests
          Failed junit tests hadoop.net.TestDNS



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Issue HADOOP-13578
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12836628/HADOOP-13578.v2.patch
          Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle
          uname Linux bd603994464b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 0dc2a6a
          Default Java 1.8.0_101
          shellcheck v0.4.4
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/artifact/patchprocess/whitespace-eol.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/artifact/patchprocess/whitespace-tabs.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/artifact/patchprocess/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 24s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 1s The patch appears to include 3 new or modified test files. 0 mvndep 0m 58s Maven dependency ordering for branch +1 mvninstall 9m 23s trunk passed +1 compile 9m 22s trunk passed +1 checkstyle 1m 51s trunk passed +1 mvnsite 12m 37s trunk passed +1 mvneclipse 1m 22s trunk passed 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 29s trunk passed +1 javadoc 5m 35s trunk passed 0 mvndep 0m 23s Maven dependency ordering for patch +1 mvninstall 10m 47s the patch passed +1 compile 7m 26s the patch passed +1 cc 7m 26s the patch passed +1 javac 7m 26s the patch passed -0 checkstyle 1m 37s root: The patch generated 56 new + 91 unchanged - 1 fixed = 147 total (was 92) +1 mvnsite 10m 8s the patch passed +1 mvneclipse 1m 12s the patch passed +1 shellcheck 0m 14s There were no new shellcheck issues. +1 shelldocs 0m 10s There were no new shelldocs issues. -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply -1 whitespace 0m 0s The patch 132 line(s) with tabs. +1 xml 0m 4s The patch has no ill-formed XML file. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 17s the patch passed +1 javadoc 4m 33s the patch passed -1 unit 11m 33s root in the patch failed. -1 asflicense 0m 31s The patch generated 3 ASF License warnings. 118m 29s Reason Tests Failed junit tests hadoop.net.TestDNS Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HADOOP-13578 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12836628/HADOOP-13578.v2.patch Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle uname Linux bd603994464b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 0dc2a6a Default Java 1.8.0_101 shellcheck v0.4.4 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/artifact/patchprocess/whitespace-eol.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/artifact/patchprocess/whitespace-tabs.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/artifact/patchprocess/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/testReport/ asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10962/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for updating the patch! It's great to see compatibility between the zstd CLI and the Hadoop codec.

          I'm confused why the codec returns null for getCompressorType, createCompressor, getDecompressorType, and createDecompressor. This is going to break code that expects to wield the compressor/decompressor rather than streams directly. I would expect it to work more like the DefaultCodec or BZip2Codec where the codec returns appropriate Compressor and Decompressor classes and leverages the existing CompressorStream, DecompressorStream, and CompressionCodec.Util classes to handle the stream interfaces of Codec. For example, the CompressorStream abstract class already handles the single-byte write method, protection from double-finish and double-close, the DecompressorStream already implemetns skip, etc.

          The original approach used direct byte buffers, but the new patch no longer does. All the other codecs leverage direct byte buffers, so I'm curious about the reasoning for that change. I'm not a JVM expert, but I'm wondering if the *PrimitiveArrayCritical methods have unfavorable impacts on other threads in the JVM (e.g.: due to pausing GC or other effects). Given that GetPrimitiveArrayCritical could trigger an array copy in order to perform it's task and we have to copy the bytes from the output buffer into the output stream anyway, I would expect direct byte buffers to be faster for at least the output buffer case.

          Speaking of double-finish, it looks like that could be problematic and lead to double-free's in the native code layer. In addition to leveraging CompressorStream/DecompressorStream as mentioned above to help with this, we could zero the stream field after we are finished and check for a non-null stream context before doing operations.

          We should throw NPE if the caller passes a null for the source data buffer.

          Similarly the native code should throw an error if it cannot obtain pointers to the Java buffers. Currently it just silently returns no progress which will result in an infinite loop in practice as it tries to reach the end of the input and never gets there if the error keeps occurring on each JNI invocation.

          The documentation for the zstd streaming interface mentions that flushing or ending a stream may require multiple invocations in order to fully accomplish the task, but I don't see corresponding loops in the java code to handle that scenario:

          *  At any moment, it's possible to flush whatever data remains within buffer, using ZSTD_flushStream().
          *  `output->pos` will be updated.
          *  Note some content might still be left within internal buffer if `output->size` is too small.
          *  @return : nb of bytes still present within internal buffer (0 if it's empty)
          *            or an error code, which can be tested using ZSTD_isError().
          *
          *  ZSTD_endStream() instructs to finish a frame.
          *  It will perform a flush and write frame epilogue.
          *  The epilogue is required for decoders to consider a frame completed.
          *  Similar to ZSTD_flushStream(), it may not be able to flush the full content if `output->size` is too small.
          *  In which case, call again ZSTD_endStream() to complete the flush.
          

          IOUtils.closeQuietly is being called on the output stream which could lead to silent data corruption. If there was an issue writing out the last bits of data as a result of the close call then this will silently eat the error. The user is left with a "successful" operation that did not actually succeed.

          The getLibraryName native method should do something sane for the non-UNIX case, like returning the Unavailable string as some codecs do when they can't compute it in the UNIX case. Bonus points for adding a WINDOWS case, and it looks like we can model after the Windows implementations of getLibraryName in the other codecs.

          In the decompressor, why are we finished when we decode a frame? I thought it was valid for a ZStandard compression stream to be made up of one or more frames, so if there is more input after decoding a frame it seems like the prudent thing to do is try decoding another frame.

          It would be nice to allow the input and output buffer sizes to be configurable. When used as part of wide merge sorts, there are going to be a lot of codec instances. It may be necessary to use a smaller buffer size per codec to reduce the merge's memory footprint due to all those codec I/O buffers. It makes sense to use the library-recommended size as the default value, but it should be straightforward to allow the user to override this.

          Nit: srcSize is not really a size in the following code but rather a terminating offset. Renaming it to something like endOffset would be more clear, since we're comparing it against the inputBufferOffset. Same comment for dstSize in the decompressor.

              int srcSize = offset + len;
              inputBufferOffset = offset;
              while (inputBufferOffset < srcSize) {
          

          Nit: The whitespace was removed around this line in the ASF license header, and it would be good to restore it so it's consistent with the ASF headers in other files.

           * http://www.apache.org/licenses/LICENSE-2.0
          
          Show
          jlowe Jason Lowe added a comment - Thanks for updating the patch! It's great to see compatibility between the zstd CLI and the Hadoop codec. I'm confused why the codec returns null for getCompressorType, createCompressor, getDecompressorType, and createDecompressor. This is going to break code that expects to wield the compressor/decompressor rather than streams directly. I would expect it to work more like the DefaultCodec or BZip2Codec where the codec returns appropriate Compressor and Decompressor classes and leverages the existing CompressorStream, DecompressorStream, and CompressionCodec.Util classes to handle the stream interfaces of Codec. For example, the CompressorStream abstract class already handles the single-byte write method, protection from double-finish and double-close, the DecompressorStream already implemetns skip, etc. The original approach used direct byte buffers, but the new patch no longer does. All the other codecs leverage direct byte buffers, so I'm curious about the reasoning for that change. I'm not a JVM expert, but I'm wondering if the *PrimitiveArrayCritical methods have unfavorable impacts on other threads in the JVM (e.g.: due to pausing GC or other effects). Given that GetPrimitiveArrayCritical could trigger an array copy in order to perform it's task and we have to copy the bytes from the output buffer into the output stream anyway, I would expect direct byte buffers to be faster for at least the output buffer case. Speaking of double-finish, it looks like that could be problematic and lead to double-free's in the native code layer. In addition to leveraging CompressorStream/DecompressorStream as mentioned above to help with this, we could zero the stream field after we are finished and check for a non-null stream context before doing operations. We should throw NPE if the caller passes a null for the source data buffer. Similarly the native code should throw an error if it cannot obtain pointers to the Java buffers. Currently it just silently returns no progress which will result in an infinite loop in practice as it tries to reach the end of the input and never gets there if the error keeps occurring on each JNI invocation. The documentation for the zstd streaming interface mentions that flushing or ending a stream may require multiple invocations in order to fully accomplish the task, but I don't see corresponding loops in the java code to handle that scenario: * At any moment, it's possible to flush whatever data remains within buffer, using ZSTD_flushStream(). * `output->pos` will be updated. * Note some content might still be left within internal buffer if `output->size` is too small. * @return : nb of bytes still present within internal buffer (0 if it's empty) * or an error code, which can be tested using ZSTD_isError(). * * ZSTD_endStream() instructs to finish a frame. * It will perform a flush and write frame epilogue. * The epilogue is required for decoders to consider a frame completed. * Similar to ZSTD_flushStream(), it may not be able to flush the full content if `output->size` is too small. * In which case, call again ZSTD_endStream() to complete the flush. IOUtils.closeQuietly is being called on the output stream which could lead to silent data corruption. If there was an issue writing out the last bits of data as a result of the close call then this will silently eat the error. The user is left with a "successful" operation that did not actually succeed. The getLibraryName native method should do something sane for the non-UNIX case, like returning the Unavailable string as some codecs do when they can't compute it in the UNIX case. Bonus points for adding a WINDOWS case, and it looks like we can model after the Windows implementations of getLibraryName in the other codecs. In the decompressor, why are we finished when we decode a frame? I thought it was valid for a ZStandard compression stream to be made up of one or more frames, so if there is more input after decoding a frame it seems like the prudent thing to do is try decoding another frame. It would be nice to allow the input and output buffer sizes to be configurable. When used as part of wide merge sorts, there are going to be a lot of codec instances. It may be necessary to use a smaller buffer size per codec to reduce the merge's memory footprint due to all those codec I/O buffers. It makes sense to use the library-recommended size as the default value, but it should be straightforward to allow the user to override this. Nit: srcSize is not really a size in the following code but rather a terminating offset. Renaming it to something like endOffset would be more clear, since we're comparing it against the inputBufferOffset. Same comment for dstSize in the decompressor. int srcSize = offset + len; inputBufferOffset = offset; while (inputBufferOffset < srcSize) { Nit: The whitespace was removed around this line in the ASF license header, and it would be good to restore it so it's consistent with the ASF headers in other files. * http://www.apache.org/licenses/LICENSE-2.0
          Hide
          churromorales churro morales added a comment -

          I'm confused why the codec returns null for getCompressorType, createCompressor, getDecompressorType, and createDecompressor. This is going to break code that expects to wield the compressor/decompressor rather than streams directly. I would expect it to work more like the DefaultCodec or BZip2Codec where the codec returns appropriate Compressor and Decompressor classes and leverages the existing CompressorStream, DecompressorStream, and CompressionCodec.Util classes to handle the stream interfaces of Codec. For example, the CompressorStream abstract class already handles the single-byte write method, protection from double-finish and double-close, the DecompressorStream already implemetns skip, etc.

          The reason for this was that in testing the code, namely the decompressor logic. The smallest unit of work in zstandard is a block. I noticed many cases during decompression where I would fill up the output buffer and still have some data left in my input buffer. The data in my input buffer was not a complete block of compressed data and thus I was having quite a few issues when using the decompressor API which I felt was really clunky, you have to return special values to see if you need input and then you have no control over how much of the buffer you fill with compressed data. For example if I have half a block of compressed data at the end of the input buffer, it was really difficult to state

          I need more data in the input buffer, because I don't have a complete compressed block,
          I have to move that partial compressed block to the beginning of the input buffer
          I also have to ensure that needsInput now returns true, even though I have compressed data to decompress (but not enough for a block)

          The entire streaming decompressor API is built around Bzip2 and Zlib which if you look have the exact same C API's and even return the same values for their respective inflate / deflate methods. I would much prefer to have an inputstream / outputstream and just have my library compress / decompress on that stream itself because the zstd api does not match that of zlib or bzip2.

          The original approach used direct byte buffers, but the new patch no longer does. All the other codecs leverage direct byte buffers, so I'm curious about the reasoning for that change. I'm not a JVM expert, but I'm wondering if the *PrimitiveArrayCritical methods have unfavorable impacts on other threads in the JVM (e.g.: due to pausing GC or other effects). Given that GetPrimitiveArrayCritical could trigger an array copy in order to perform it's task and we have to copy the bytes from the output buffer into the output stream anyway, I would expect direct byte buffers to be faster for at least the output buffer case.

          This is from the oracle documentation:

          Here is the signature of GetPrimativeArrayCritical: void * GetPrimitiveArrayCritical(JNIEnv *env, jarray array, jboolean *isCopy);
          Note that GetPrimitiveArrayCritical might still make a copy of the array if the VM internally represents arrays in a different format. Therefore we need to check its return value against NULL for possible out of memory situations.

          Looks like for most of the jvm's on linux platforms it references a memory address, but we can still use direct byte buffers if you feel strongly about it. At least for the output buffer case as you had stated. I didnt notice any performance hit on my local test environment but can look further into this.

          Speaking of double-finish, it looks like that could be problematic and lead to double-free's in the native code layer. In addition to leveraging CompressorStream/DecompressorStream as mentioned above to help with this, we could zero the stream field after we are finished and check for a non-null stream context before doing operations.

          We should throw NPE if the caller passes a null for the source data buffer.
          Similarly the native code should throw an error if it cannot obtain pointers to the Java buffers. Currently it just silently returns no progress which will result in an infinite loop in practice as it tries to reach the end of the input and never gets there if the error keeps occurring on each JNI invocation.

          totally agree, will fix this.

          The documentation for the zstd streaming interface mentions that flushing or ending a stream may require multiple invocations in order to fully accomplish the task, but I don't see corresponding loops in the java code to handle that scenario:

          Will make sure to take care of this scenario.

          IOUtils.closeQuietly is being called on the output stream which could lead to silent data corruption. If there was an issue writing out the last bits of data as a result of the close call then this will silently eat the error. The user is left with a "successful" operation that did not actually succeed.

          Great catch, will change this.

          The getLibraryName native method should do something sane for the non-UNIX case, like returning the Unavailable string as some codecs do when they can't compute it in the UNIX case. Bonus points for adding a WINDOWS case, and it looks like we can model after the Windows implementations of getLibraryName in the other codecs.

          Sure we can change this.

          In the decompressor, why are we finished when we decode a frame? I thought it was valid for a ZStandard compression stream to be made up of one or more frames, so if there is more input after decoding a frame it seems like the prudent thing to do is try decoding another frame.

          from the docs

          • @return : 0 when a frame is completely decoded and fully flushed, an error code, which can be tested using ZSTD_isError(), any other value > 0, which means there is still some work to do to complete the frame. The return value is a suggested next input size (just an hint, to help latency).

          Also we can read multiple frames just fine with the decompressor. I tested this by doing the following with the cli: I compressed two separate files with the CLI, then concatenated them together. Then took that file and fed it into the ZStandardDecompressorStream. (this concatenated file had 2 frames) and it decompressed just fine. That is because finished field is only there to make sure we make a call to the zstd library to re-initialize the stream when we have reached the end of a frame. But in the actual ZStandardDecompressorStream we check if there is more input to read even if we reach the end of a frame. The finished field is more to reinitialize the zstd stream than the actual java decompressor stream.

          It would be nice to allow the input and output buffer sizes to be configurable. When used as part of wide merge sorts, there are going to be a lot of codec instances. It may be necessary to use a smaller buffer size per codec to reduce the merge's memory footprint due to all those codec I/O buffers. It makes sense to use the library-recommended size as the default value, but it should be straightforward to allow the user to override this.

          That shouldn't be a problem, but the zstd suggested buffer size especially for the decompressor is the minimum buffer size to decompress one block. So we could allow the user to configure and take the larger of the 2 buffer sizes in case they configure it with a buffer size that is too small.

          Nit: srcSize is not really a size in the following code but rather a terminating offset. Renaming it to something like endOffset would be more clear, since we're comparing it against the inputBufferOffset. Same comment for dstSize in the decompressor.

          Sure i can change that to endOffset and inputBufferOffset.

          Nit: The whitespace was removed around this line in the ASF license header, and it would be good to restore it so it's consistent with the ASF headers in other files.

          Sure thing no problem.

          Thank you for such a thorough review. I hope implementing my own streams is not too big of an issue. Looking at the code, it didn't seem like it would be.

          Show
          churromorales churro morales added a comment - I'm confused why the codec returns null for getCompressorType, createCompressor, getDecompressorType, and createDecompressor. This is going to break code that expects to wield the compressor/decompressor rather than streams directly. I would expect it to work more like the DefaultCodec or BZip2Codec where the codec returns appropriate Compressor and Decompressor classes and leverages the existing CompressorStream, DecompressorStream, and CompressionCodec.Util classes to handle the stream interfaces of Codec. For example, the CompressorStream abstract class already handles the single-byte write method, protection from double-finish and double-close, the DecompressorStream already implemetns skip, etc. The reason for this was that in testing the code, namely the decompressor logic. The smallest unit of work in zstandard is a block. I noticed many cases during decompression where I would fill up the output buffer and still have some data left in my input buffer. The data in my input buffer was not a complete block of compressed data and thus I was having quite a few issues when using the decompressor API which I felt was really clunky, you have to return special values to see if you need input and then you have no control over how much of the buffer you fill with compressed data. For example if I have half a block of compressed data at the end of the input buffer, it was really difficult to state I need more data in the input buffer, because I don't have a complete compressed block, I have to move that partial compressed block to the beginning of the input buffer I also have to ensure that needsInput now returns true, even though I have compressed data to decompress (but not enough for a block) The entire streaming decompressor API is built around Bzip2 and Zlib which if you look have the exact same C API's and even return the same values for their respective inflate / deflate methods. I would much prefer to have an inputstream / outputstream and just have my library compress / decompress on that stream itself because the zstd api does not match that of zlib or bzip2. The original approach used direct byte buffers, but the new patch no longer does. All the other codecs leverage direct byte buffers, so I'm curious about the reasoning for that change. I'm not a JVM expert, but I'm wondering if the *PrimitiveArrayCritical methods have unfavorable impacts on other threads in the JVM (e.g.: due to pausing GC or other effects). Given that GetPrimitiveArrayCritical could trigger an array copy in order to perform it's task and we have to copy the bytes from the output buffer into the output stream anyway, I would expect direct byte buffers to be faster for at least the output buffer case. This is from the oracle documentation: Here is the signature of GetPrimativeArrayCritical: void * GetPrimitiveArrayCritical(JNIEnv *env, jarray array, jboolean *isCopy); Note that GetPrimitiveArrayCritical might still make a copy of the array if the VM internally represents arrays in a different format. Therefore we need to check its return value against NULL for possible out of memory situations. Looks like for most of the jvm's on linux platforms it references a memory address, but we can still use direct byte buffers if you feel strongly about it. At least for the output buffer case as you had stated. I didnt notice any performance hit on my local test environment but can look further into this. Speaking of double-finish, it looks like that could be problematic and lead to double-free's in the native code layer. In addition to leveraging CompressorStream/DecompressorStream as mentioned above to help with this, we could zero the stream field after we are finished and check for a non-null stream context before doing operations. We should throw NPE if the caller passes a null for the source data buffer. Similarly the native code should throw an error if it cannot obtain pointers to the Java buffers. Currently it just silently returns no progress which will result in an infinite loop in practice as it tries to reach the end of the input and never gets there if the error keeps occurring on each JNI invocation. totally agree, will fix this. The documentation for the zstd streaming interface mentions that flushing or ending a stream may require multiple invocations in order to fully accomplish the task, but I don't see corresponding loops in the java code to handle that scenario: Will make sure to take care of this scenario. IOUtils.closeQuietly is being called on the output stream which could lead to silent data corruption. If there was an issue writing out the last bits of data as a result of the close call then this will silently eat the error. The user is left with a "successful" operation that did not actually succeed. Great catch, will change this. The getLibraryName native method should do something sane for the non-UNIX case, like returning the Unavailable string as some codecs do when they can't compute it in the UNIX case. Bonus points for adding a WINDOWS case, and it looks like we can model after the Windows implementations of getLibraryName in the other codecs. Sure we can change this. In the decompressor, why are we finished when we decode a frame? I thought it was valid for a ZStandard compression stream to be made up of one or more frames, so if there is more input after decoding a frame it seems like the prudent thing to do is try decoding another frame. from the docs @return : 0 when a frame is completely decoded and fully flushed, an error code, which can be tested using ZSTD_isError(), any other value > 0, which means there is still some work to do to complete the frame. The return value is a suggested next input size (just an hint, to help latency). Also we can read multiple frames just fine with the decompressor. I tested this by doing the following with the cli: I compressed two separate files with the CLI, then concatenated them together. Then took that file and fed it into the ZStandardDecompressorStream. (this concatenated file had 2 frames) and it decompressed just fine. That is because finished field is only there to make sure we make a call to the zstd library to re-initialize the stream when we have reached the end of a frame. But in the actual ZStandardDecompressorStream we check if there is more input to read even if we reach the end of a frame. The finished field is more to reinitialize the zstd stream than the actual java decompressor stream. It would be nice to allow the input and output buffer sizes to be configurable. When used as part of wide merge sorts, there are going to be a lot of codec instances. It may be necessary to use a smaller buffer size per codec to reduce the merge's memory footprint due to all those codec I/O buffers. It makes sense to use the library-recommended size as the default value, but it should be straightforward to allow the user to override this. That shouldn't be a problem, but the zstd suggested buffer size especially for the decompressor is the minimum buffer size to decompress one block. So we could allow the user to configure and take the larger of the 2 buffer sizes in case they configure it with a buffer size that is too small. Nit: srcSize is not really a size in the following code but rather a terminating offset. Renaming it to something like endOffset would be more clear, since we're comparing it against the inputBufferOffset. Same comment for dstSize in the decompressor. Sure i can change that to endOffset and inputBufferOffset. Nit: The whitespace was removed around this line in the ASF license header, and it would be good to restore it so it's consistent with the ASF headers in other files. http://www.apache.org/licenses/LICENSE-2.0 Sure thing no problem. Thank you for such a thorough review. I hope implementing my own streams is not too big of an issue. Looking at the code, it didn't seem like it would be.
          Hide
          jlowe Jason Lowe added a comment -

          The problem with ignoring the Codec interface is that code using the Compressor/Decompressor interfaces will break if it happens to get the zstd codec but work for all others. That's not really adding a full Codec for zstd. I agree it's not the most ideal interface for the way zstd wants to be called, but that doesn't change the fact that existing code that should work as-is on a Hadoop-provided codec will not if we don't implement the Compressor/Decompressor interfaces.

          It looks like the zstd code already includes a wrapper that translates the zlib API into the zstd API (see https://github.com/facebook/zstd/tree/dev/zlibWrapper) which we can maybe leverage or at least use as a model for the Compressor/Decompressor implementations. To be clear, I think it's a good thing that the compressor/decompressor streams are using zstd more natively, but I also think it's important to not ignore the non-stream Codec APIs. I'm also OK if the initial implementation of the zstd Compressor/Decompressor is not super optimal. One tactic would be to always copy the input data to an internal input buffer and track the amount of data zstd wants. Then the needsInput method is easy, just return true when the amount of data in the internal input buffer is less than what zstd last requested. The decompress method would return 0 if needInput would return true. Not super efficient since we make a copy of all the input data before passing it to zstd, but it should be relatively straightforward to implement.

          I'm also not so sure about the minimum buffer assertion. I saw in the zstd unit tests there is a byte-by-byte streaming decompression test where it tries to decompress a buffer with a 1-byte input buffer and a 1-byte output buffer. I also verified this independently by applying the following changes to the streaming_decompression.c example:

          $ diff -u streaming_decompression.c.orig streaming_decompression.c
          --- streaming_decompression.c.orig	2016-11-14 18:39:43.423069894 +0000
          +++ streaming_decompression.c	2016-11-14 19:08:36.395286748 +0000
          @@ -63,10 +63,10 @@
           static void decompressFile_orDie(const char* fname)
           {
               FILE* const fin  = fopen_orDie(fname, "rb");
          -    size_t const buffInSize = ZSTD_DStreamInSize();
          +    size_t const buffInSize = 1;
               void*  const buffIn  = malloc_orDie(buffInSize);
               FILE* const fout = stdout;
          -    size_t const buffOutSize = ZSTD_DStreamOutSize();  /* Guarantee to successfully flush at least one complete compressed block in all circumstances. */
          +    size_t const buffOutSize = 1;
               void*  const buffOut = malloc_orDie(buffOutSize);
           
               ZSTD_DStream* const dstream = ZSTD_createDStream();
          @@ -80,6 +80,7 @@
                   while (input.pos < input.size) {
                       ZSTD_outBuffer output = { buffOut, buffOutSize, 0 };
                       toRead = ZSTD_decompressStream(dstream, &output , &input);  /* toRead : size of next compressed block */
          +	    toRead = (toRead > buffInSize) ? buffInSize : toRead;
                       if (ZSTD_isError(toRead)) { fprintf(stderr, "ZSTD_decompressStream() error : %s \n", ZSTD_getErrorName(toRead)); exit(12); }
                       fwrite_orDie(buffOut, output.pos, fout);
                   }
          

          With those changes the streaming_decompression example was able to decompress zstd files even though it was using an extremely small input and output buffer. So even though I'm not giving the zstd library a full block of input it's able to compensate.

          The finished field is more to reinitialize the zstd stream than the actual java decompressor stream.

          Ah, I see that now. Thanks for the explanation.

          So we could allow the user to configure and take the larger of the 2 buffer sizes in case they configure it with a buffer size that is too small.

          See input/output buffer discussion above. I don't think we need to apply safety rails to what the user specifies because the zstd library apparently can accommodate.

          Show
          jlowe Jason Lowe added a comment - The problem with ignoring the Codec interface is that code using the Compressor/Decompressor interfaces will break if it happens to get the zstd codec but work for all others. That's not really adding a full Codec for zstd. I agree it's not the most ideal interface for the way zstd wants to be called, but that doesn't change the fact that existing code that should work as-is on a Hadoop-provided codec will not if we don't implement the Compressor/Decompressor interfaces. It looks like the zstd code already includes a wrapper that translates the zlib API into the zstd API (see https://github.com/facebook/zstd/tree/dev/zlibWrapper ) which we can maybe leverage or at least use as a model for the Compressor/Decompressor implementations. To be clear, I think it's a good thing that the compressor/decompressor streams are using zstd more natively, but I also think it's important to not ignore the non-stream Codec APIs. I'm also OK if the initial implementation of the zstd Compressor/Decompressor is not super optimal. One tactic would be to always copy the input data to an internal input buffer and track the amount of data zstd wants. Then the needsInput method is easy, just return true when the amount of data in the internal input buffer is less than what zstd last requested. The decompress method would return 0 if needInput would return true. Not super efficient since we make a copy of all the input data before passing it to zstd, but it should be relatively straightforward to implement. I'm also not so sure about the minimum buffer assertion. I saw in the zstd unit tests there is a byte-by-byte streaming decompression test where it tries to decompress a buffer with a 1-byte input buffer and a 1-byte output buffer. I also verified this independently by applying the following changes to the streaming_decompression.c example: $ diff -u streaming_decompression.c.orig streaming_decompression.c --- streaming_decompression.c.orig 2016-11-14 18:39:43.423069894 +0000 +++ streaming_decompression.c 2016-11-14 19:08:36.395286748 +0000 @@ -63,10 +63,10 @@ static void decompressFile_orDie(const char* fname) { FILE* const fin = fopen_orDie(fname, "rb"); - size_t const buffInSize = ZSTD_DStreamInSize(); + size_t const buffInSize = 1; void* const buffIn = malloc_orDie(buffInSize); FILE* const fout = stdout; - size_t const buffOutSize = ZSTD_DStreamOutSize(); /* Guarantee to successfully flush at least one complete compressed block in all circumstances. */ + size_t const buffOutSize = 1; void* const buffOut = malloc_orDie(buffOutSize); ZSTD_DStream* const dstream = ZSTD_createDStream(); @@ -80,6 +80,7 @@ while (input.pos < input.size) { ZSTD_outBuffer output = { buffOut, buffOutSize, 0 }; toRead = ZSTD_decompressStream(dstream, &output , &input); /* toRead : size of next compressed block */ + toRead = (toRead > buffInSize) ? buffInSize : toRead; if (ZSTD_isError(toRead)) { fprintf(stderr, "ZSTD_decompressStream() error : %s \n", ZSTD_getErrorName(toRead)); exit(12); } fwrite_orDie(buffOut, output.pos, fout); } With those changes the streaming_decompression example was able to decompress zstd files even though it was using an extremely small input and output buffer. So even though I'm not giving the zstd library a full block of input it's able to compensate. The finished field is more to reinitialize the zstd stream than the actual java decompressor stream. Ah, I see that now. Thanks for the explanation. So we could allow the user to configure and take the larger of the 2 buffer sizes in case they configure it with a buffer size that is too small. See input/output buffer discussion above. I don't think we need to apply safety rails to what the user specifies because the zstd library apparently can accommodate.
          Hide
          churromorales churro morales added a comment -

          Jason Lowe

          Thanks for the feedback. As for your concerns:

          The problem with ignoring the Codec interface is that code using the Compressor/Decompressor interfaces will break if it happens to get the zstd codec but work for all others. That's not really adding a full Codec for zstd. I agree it's not the most ideal interface for the way zstd wants to be called, but that doesn't change the fact that existing code that should work as-is on a Hadoop-provided codec will not if we don't implement the Compressor/Decompressor interfaces.

          As much as it pains me, I agree with your sentiments. I will implement the Compressor / Decompressor interfaces.

          It looks like the zstd code already includes a wrapper that translates the zlib API into the zstd API (see https://github.com/facebook/zstd/tree/dev/zlibWrapper) which we can maybe leverage or at least use as a model for the Compressor/Decompressor implementations.

          Good point, I did look at using the zlib wrapper originally, but in v1.0.0 (https://github.com/facebook/zstd/tree/v1.0.0/zlibWrapper) they didn't have support for deflateReset and inflateReset which would have errored out and broken our code. No worries though, I think I have it figured out using the Compressor / Decompressor logic without having to reference the zlibwrapper.

          To be clear, I think it's a good thing that the compressor/decompressor streams are using zstd more natively, but I also think it's important to not ignore the non-stream Codec APIs.

          I think the reason we need to implement the compressor / decompressor is more for the reuse from the CodecPool. Each compressor / decompressor / codec is very much tied to whether the compression library has been implemented using a stream based or block based approach. From what I can see, the API is called as follows:

           
          Compressor compressor =  CodecPool.getCompressor();
          codec.createOutputStream(OutputStream, compressor);
          // do work
          

          And when we look at specific implementations of codec.createOutputStream(), codec's like Snappy always returns a Block(Compressor|Decompressor)Stream and it seems to me that the Snappy(Compressor|Decompressor) can only work on these type of streams. Looks like the way it is used everywhere (including hbase) is that we always get a stream and work on it, the compressor / decompressor are passed in so we have a pool and don't have to constantly re-initialize these native libs.

          For example I wrote this test, and if you use the Compression(Input|Output)Streams that are native to the codec the test will pass. But if you create a Compressor/Decompressor and use a Compression(Input|Output)Stream that is not specific to this codec the test will fail.

           @Test
            public void testHandlingWithStreams() throws Exception {
              byte[] bytes = BytesGenerator.get(1024*64);
              ByteArrayInputStream inputStream = new ByteArrayInputStream(bytes);
              SnappyCodec codec = new SnappyCodec();
              codec.setConf(new Configuration());
              Compressor compressor = codec.createCompressor();
              ByteArrayOutputStream baos = new ByteArrayOutputStream();
              CompressionOutputStream compressorStream = new CompressorStream(baos, compressor);
              // if you replace compressorStream with 
              // CompressionOutputStream compressorStream = codec.createOutputStream(baos, compressor);
              // things work fine.
              byte[] buffer = new byte[100];
              int result;
          
              while ((result = inputStream.read(buffer, 0, buffer.length)) != -1) {
                compressorStream.write(buffer, 0, result);
              }
              compressorStream.flush();
              compressorStream.finish();
          
              // lets make sure the compressed bytes are able to be decompressed to read
              byte[] compressedBytes = baos.toByteArray();
              ByteArrayInputStream bais = new ByteArrayInputStream(compressedBytes);
              baos = new ByteArrayOutputStream();
              Decompressor decompressor = codec.createDecompressor();
              CompressionInputStream decompressorStream = new DecompressorStream(bais, decompressor);
              // if you replace decompressorStream with 
              // CompressionInputStream decompressorStream = codec.createInputStream(bais, decompressor);
              // things work fine. 
              byte[] decompressionBuffer = new byte[100];
              while ((result = decompressorStream.read(decompressionBuffer, 0, buffer.length)) != -1) {
                baos.write(decompressionBuffer, 0, result);
              }
              decompressorStream.close();
              byte[] decompressBytes = baos.toByteArray();
              assertArrayEquals(bytes, decompressBytes);
            }
          

          I'm also not so sure about the minimum buffer assertion. I saw in the zstd unit tests there is a byte-by-byte streaming decompression test where it tries to decompress a buffer with a 1-byte input buffer and a 1-byte output buffer.

          You are absolutely correct. I will allow the user to specify the buffer size and default to the one zstandard recommends.

          I almost have a patch ready for you by implementing a Compressor | Decompressor. Unfortunately this API makes things a bit clunky but is needed because of the pooling so we don't have to reinitialize the shared libraries. I will have a new patch up soon and thank you for taking the time to comment and review this work, it is much appreciated.

          Show
          churromorales churro morales added a comment - Jason Lowe Thanks for the feedback. As for your concerns: The problem with ignoring the Codec interface is that code using the Compressor/Decompressor interfaces will break if it happens to get the zstd codec but work for all others. That's not really adding a full Codec for zstd. I agree it's not the most ideal interface for the way zstd wants to be called, but that doesn't change the fact that existing code that should work as-is on a Hadoop-provided codec will not if we don't implement the Compressor/Decompressor interfaces. As much as it pains me, I agree with your sentiments. I will implement the Compressor / Decompressor interfaces. It looks like the zstd code already includes a wrapper that translates the zlib API into the zstd API (see https://github.com/facebook/zstd/tree/dev/zlibWrapper ) which we can maybe leverage or at least use as a model for the Compressor/Decompressor implementations. Good point, I did look at using the zlib wrapper originally, but in v1.0.0 ( https://github.com/facebook/zstd/tree/v1.0.0/zlibWrapper ) they didn't have support for deflateReset and inflateReset which would have errored out and broken our code. No worries though, I think I have it figured out using the Compressor / Decompressor logic without having to reference the zlibwrapper. To be clear, I think it's a good thing that the compressor/decompressor streams are using zstd more natively, but I also think it's important to not ignore the non-stream Codec APIs. I think the reason we need to implement the compressor / decompressor is more for the reuse from the CodecPool. Each compressor / decompressor / codec is very much tied to whether the compression library has been implemented using a stream based or block based approach. From what I can see, the API is called as follows: Compressor compressor = CodecPool.getCompressor(); codec.createOutputStream(OutputStream, compressor); // do work And when we look at specific implementations of codec.createOutputStream(), codec's like Snappy always returns a Block(Compressor|Decompressor)Stream and it seems to me that the Snappy(Compressor|Decompressor) can only work on these type of streams. Looks like the way it is used everywhere (including hbase) is that we always get a stream and work on it, the compressor / decompressor are passed in so we have a pool and don't have to constantly re-initialize these native libs. For example I wrote this test, and if you use the Compression(Input|Output)Streams that are native to the codec the test will pass. But if you create a Compressor/Decompressor and use a Compression(Input|Output)Stream that is not specific to this codec the test will fail. @Test public void testHandlingWithStreams() throws Exception { byte [] bytes = BytesGenerator.get(1024*64); ByteArrayInputStream inputStream = new ByteArrayInputStream(bytes); SnappyCodec codec = new SnappyCodec(); codec.setConf( new Configuration()); Compressor compressor = codec.createCompressor(); ByteArrayOutputStream baos = new ByteArrayOutputStream(); CompressionOutputStream compressorStream = new CompressorStream(baos, compressor); // if you replace compressorStream with // CompressionOutputStream compressorStream = codec.createOutputStream(baos, compressor); // things work fine. byte [] buffer = new byte [100]; int result; while ((result = inputStream.read(buffer, 0, buffer.length)) != -1) { compressorStream.write(buffer, 0, result); } compressorStream.flush(); compressorStream.finish(); // lets make sure the compressed bytes are able to be decompressed to read byte [] compressedBytes = baos.toByteArray(); ByteArrayInputStream bais = new ByteArrayInputStream(compressedBytes); baos = new ByteArrayOutputStream(); Decompressor decompressor = codec.createDecompressor(); CompressionInputStream decompressorStream = new DecompressorStream(bais, decompressor); // if you replace decompressorStream with // CompressionInputStream decompressorStream = codec.createInputStream(bais, decompressor); // things work fine. byte [] decompressionBuffer = new byte [100]; while ((result = decompressorStream.read(decompressionBuffer, 0, buffer.length)) != -1) { baos.write(decompressionBuffer, 0, result); } decompressorStream.close(); byte [] decompressBytes = baos.toByteArray(); assertArrayEquals(bytes, decompressBytes); } I'm also not so sure about the minimum buffer assertion. I saw in the zstd unit tests there is a byte-by-byte streaming decompression test where it tries to decompress a buffer with a 1-byte input buffer and a 1-byte output buffer. You are absolutely correct. I will allow the user to specify the buffer size and default to the one zstandard recommends. I almost have a patch ready for you by implementing a Compressor | Decompressor. Unfortunately this API makes things a bit clunky but is needed because of the pooling so we don't have to reinitialize the shared libraries. I will have a new patch up soon and thank you for taking the time to comment and review this work, it is much appreciated.
          Hide
          jlowe Jason Lowe added a comment -

          Cool, thanks for the update and looking into things.

          I was hoping with the discovery that the zstd streaming interface can seemingly take arbitrary input and output buffer sizes that we could avoid the headaches of copying leftover input data, etc. Instead we keep feeding input data to ZSTD_decompressStream in a loop until either we run out of input data or the output buffer is full. As long as there is still input data remaining (i.e.: output buffer filled) then we don't need more input and therefore don't need to shuffle data around in the input buffer. Hopefully that makes implementing the Decompressor less clunky than before.

          Show
          jlowe Jason Lowe added a comment - Cool, thanks for the update and looking into things. I was hoping with the discovery that the zstd streaming interface can seemingly take arbitrary input and output buffer sizes that we could avoid the headaches of copying leftover input data, etc. Instead we keep feeding input data to ZSTD_decompressStream in a loop until either we run out of input data or the output buffer is full. As long as there is still input data remaining (i.e.: output buffer filled) then we don't need more input and therefore don't need to shuffle data around in the input buffer. Hopefully that makes implementing the Decompressor less clunky than before.
          Hide
          churromorales churro morales added a comment -

          Here is the patch with the Compressor Decompressor approach. Thanks again for the review.

          Show
          churromorales churro morales added a comment - Here is the patch with the Compressor Decompressor approach. Thanks again for the review.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 13s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
          0 mvndep 0m 17s Maven dependency ordering for branch
          +1 mvninstall 7m 25s trunk passed
          -1 compile 6m 30s root in trunk failed.
          +1 checkstyle 1m 40s trunk passed
          +1 mvnsite 10m 3s trunk passed
          +1 mvneclipse 1m 7s trunk passed
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 1m 56s trunk passed
          +1 javadoc 4m 43s trunk passed
          0 mvndep 0m 21s Maven dependency ordering for patch
          +1 mvninstall 8m 35s the patch passed
          +1 compile 9m 38s the patch passed
          -1 cc 9m 38s root generated 5 new + 2 unchanged - 0 fixed = 7 total (was 2)
          -1 javac 9m 38s root generated 513 new + 179 unchanged - 0 fixed = 692 total (was 179)
          -0 checkstyle 1m 44s root: The patch generated 68 new + 91 unchanged - 1 fixed = 159 total (was 92)
          +1 mvnsite 10m 13s the patch passed
          +1 mvneclipse 1m 15s the patch passed
          +1 shellcheck 0m 13s There were no new shellcheck issues.
          +1 shelldocs 0m 10s There were no new shelldocs issues.
          -1 whitespace 0m 0s The patch has 2385 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 xml 0m 6s The patch has no ill-formed XML file.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 44s the patch passed
          +1 javadoc 5m 46s the patch passed
          -1 unit 47m 49s root in the patch failed.
          -1 asflicense 0m 47s The patch generated 3 ASF License warnings.
          147m 43s



          Reason Tests
          Failed junit tests hadoop.crypto.key.kms.server.TestKMS



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-13578
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12839066/HADOOP-13578.v3.patch
          Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle
          uname Linux 5a19c632ff79 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / f121d0b
          Default Java 1.8.0_101
          compile https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/branch-compile-root.txt
          shellcheck v0.4.4
          findbugs v3.0.0
          cc https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/diff-compile-cc-root.txt
          javac https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/diff-compile-javac-root.txt
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. 0 mvndep 0m 17s Maven dependency ordering for branch +1 mvninstall 7m 25s trunk passed -1 compile 6m 30s root in trunk failed. +1 checkstyle 1m 40s trunk passed +1 mvnsite 10m 3s trunk passed +1 mvneclipse 1m 7s trunk passed 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 1m 56s trunk passed +1 javadoc 4m 43s trunk passed 0 mvndep 0m 21s Maven dependency ordering for patch +1 mvninstall 8m 35s the patch passed +1 compile 9m 38s the patch passed -1 cc 9m 38s root generated 5 new + 2 unchanged - 0 fixed = 7 total (was 2) -1 javac 9m 38s root generated 513 new + 179 unchanged - 0 fixed = 692 total (was 179) -0 checkstyle 1m 44s root: The patch generated 68 new + 91 unchanged - 1 fixed = 159 total (was 92) +1 mvnsite 10m 13s the patch passed +1 mvneclipse 1m 15s the patch passed +1 shellcheck 0m 13s There were no new shellcheck issues. +1 shelldocs 0m 10s There were no new shelldocs issues. -1 whitespace 0m 0s The patch has 2385 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 xml 0m 6s The patch has no ill-formed XML file. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 44s the patch passed +1 javadoc 5m 46s the patch passed -1 unit 47m 49s root in the patch failed. -1 asflicense 0m 47s The patch generated 3 ASF License warnings. 147m 43s Reason Tests Failed junit tests hadoop.crypto.key.kms.server.TestKMS Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-13578 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12839066/HADOOP-13578.v3.patch Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle uname Linux 5a19c632ff79 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / f121d0b Default Java 1.8.0_101 compile https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/branch-compile-root.txt shellcheck v0.4.4 findbugs v3.0.0 cc https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/diff-compile-cc-root.txt javac https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/diff-compile-javac-root.txt checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/testReport/ asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11070/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          churromorales churro morales added a comment - - edited

          found a small issue with the latest patch which has a memory leak as well as closing a stream that doesn't exist.

          I will fix this and put up another patch in a few hours.

          Show
          churromorales churro morales added a comment - - edited found a small issue with the latest patch which has a memory leak as well as closing a stream that doesn't exist. I will fix this and put up another patch in a few hours.
          Hide
          jlowe Jason Lowe added a comment -

          Looking forward to the updated patch with the memory fixes. Some comments on the latest patch. I didn't get time to look at the native code or unit test changes in detail, but here are the comments I had thus far:

          io.compression.codec.zst.buffersize should be io.compression.codec.zstd.buffersize to have a consistent prefix with io.compression.codec.zstd.level

          In ZStandardCodec#createOutputStream and ZStandardCodec#createInputStream I'm curious why we're not honoring the user's specified buffer size and just using the default buffer size. Many codecs use io.file.buffer.size here, but others use a codec-specific config property.

          In ZStandardCompressor and ZStandardDecompressor, the buffers should simply be declared as ByteBuffer instead of Buffer. That way the ugly casting when doing bulk get/put with them is unnecessary.

          In ZStandardDecompressor it's a little confusing that some buffer length variables like userBufLen reflect the number of bytes remaining to be consumed in the buffer, while compressedDirectBufLen reflects the number of bytes originally placed in the buffer, so the real length of data to be consumed is (compressedDirectBufLen - compressedDirectBufOffset). Maybe a variable name change would help notify future maintainers of the semantic difference with similar variables for other buffers, or we should change the semantics so that variable is consistent with the others?

          I'm a little wary of the public getBytesWritten and getBytesRead methods in ZStandardDecompressor. getBytesWritten doesn't appear to be all that useful since the caller can trivially compute it based on the amount of data they read from the input stream the decompressor is writing. Also both of these methods reset values to zero at frame boundaries in the input stream, which may confuse callers who are comparing these values to the entire compressed input data size. I'd recommend removing getBytesWritten which is unused and making getBytesRead package-private just for unit testing.

          It would be nice to change the logging from org.apache.commons.logging to org.slf4j. Sorry I didn't catch this earlier. We're supposed to be slowly migrating to SLF4J, and new code should be using it. Fortunately it's an easy change, and we can remove the conditional checks for debug logging.

          Do we need to have test_file.txt? I'm wondering if simply generating random data (maybe with a constant seed so the test is deterministic) into a byte buffer would be sufficient input for testing. Typically we want to avoid test files in the source tree unless there's something very specific about the file's contents necessary to exercise a unit test case.

          Typos and coding style minor nits:

          • ZStandardCodec
            • conf should be private in ZStandardCodec since it isn't accessed outside of the class.
            • "objec." s/b "object." and "ZStandard Decompressor" s/b "ZStandardDecompressor"
          • ZStandardCompressor
            • Comment refers to ZLIB format
            • Missing whitespace in "(b== null)" in setInput
            • Missing braces around "keepUncompressedBuf && uncompressedDirectBufLen > 0" conditional block in needsInput
            • 'else' should be on same line as closing brace in compress
            • 'atmost' s/b 'at most'
            • Missing braces around conditional block in checkStream
          • ZStandardDecompressor
            • Annotation should appear on line before getDirectBufferSize method.
          Show
          jlowe Jason Lowe added a comment - Looking forward to the updated patch with the memory fixes. Some comments on the latest patch. I didn't get time to look at the native code or unit test changes in detail, but here are the comments I had thus far: io.compression.codec.zst.buffersize should be io.compression.codec.zstd.buffersize to have a consistent prefix with io.compression.codec.zstd.level In ZStandardCodec#createOutputStream and ZStandardCodec#createInputStream I'm curious why we're not honoring the user's specified buffer size and just using the default buffer size. Many codecs use io.file.buffer.size here, but others use a codec-specific config property. In ZStandardCompressor and ZStandardDecompressor, the buffers should simply be declared as ByteBuffer instead of Buffer. That way the ugly casting when doing bulk get/put with them is unnecessary. In ZStandardDecompressor it's a little confusing that some buffer length variables like userBufLen reflect the number of bytes remaining to be consumed in the buffer, while compressedDirectBufLen reflects the number of bytes originally placed in the buffer, so the real length of data to be consumed is (compressedDirectBufLen - compressedDirectBufOffset). Maybe a variable name change would help notify future maintainers of the semantic difference with similar variables for other buffers, or we should change the semantics so that variable is consistent with the others? I'm a little wary of the public getBytesWritten and getBytesRead methods in ZStandardDecompressor. getBytesWritten doesn't appear to be all that useful since the caller can trivially compute it based on the amount of data they read from the input stream the decompressor is writing. Also both of these methods reset values to zero at frame boundaries in the input stream, which may confuse callers who are comparing these values to the entire compressed input data size. I'd recommend removing getBytesWritten which is unused and making getBytesRead package-private just for unit testing. It would be nice to change the logging from org.apache.commons.logging to org.slf4j. Sorry I didn't catch this earlier. We're supposed to be slowly migrating to SLF4J, and new code should be using it. Fortunately it's an easy change, and we can remove the conditional checks for debug logging. Do we need to have test_file.txt? I'm wondering if simply generating random data (maybe with a constant seed so the test is deterministic) into a byte buffer would be sufficient input for testing. Typically we want to avoid test files in the source tree unless there's something very specific about the file's contents necessary to exercise a unit test case. Typos and coding style minor nits: ZStandardCodec conf should be private in ZStandardCodec since it isn't accessed outside of the class. "objec." s/b "object." and "ZStandard Decompressor" s/b "ZStandardDecompressor" ZStandardCompressor Comment refers to ZLIB format Missing whitespace in "(b== null)" in setInput Missing braces around "keepUncompressedBuf && uncompressedDirectBufLen > 0" conditional block in needsInput 'else' should be on same line as closing brace in compress 'atmost' s/b 'at most' Missing braces around conditional block in checkStream ZStandardDecompressor Annotation should appear on line before getDirectBufferSize method.
          Hide
          churromorales churro morales added a comment -

          Hi Jason thanks for the review. I figured out the memory issue and will submit a patch monday, it was quite a bit of fun trying to figure out why the JVM was segfaulting for particular scenarios .

          io.compression.codec.zst.buffersize should be io.compression.codec.zstd.buffersize to have a consistent prefix with io.compression.codec.zstd.level
          In ZStandardCodec#createOutputStream and ZStandardCodec#createInputStream I'm curious why we're not honoring the user's specified buffer size and just using the default buffer size. Many codecs use io.file.buffer.size here, but others use a codec-specific config property.

          I totally agree, I'm going to just use io.file.buffer.size and that will be the buffer size and we will honor it.

          In ZStandardCompressor and ZStandardDecompressor, the buffers should simply be declared as ByteBuffer instead of Buffer. That way the ugly casting when doing bulk get/put with them is unnecessary.

          done

          In ZStandardDecompressor it's a little confusing that some buffer length variables like userBufLen reflect the number of bytes remaining to be consumed in the buffer, while compressedDirectBufLen reflects the number of bytes originally placed in the buffer, so the real length of data to be consumed is (compressedDirectBufLen - compressedDirectBufOffset). Maybe a variable name change would help notify future maintainers of the semantic difference with similar variables for other buffers, or we should change the semantics so that variable is consistent with the others?

          Yeah that is confusing, I'll change the name of the variables. I was just following on how other Decompressors did their naming.

          I'm a little wary of the public getBytesWritten and getBytesRead methods in ZStandardDecompressor. getBytesWritten doesn't appear to be all that useful since the caller can trivially compute it based on the amount of data they read from the input stream the decompressor is writing. Also both of these methods reset values to zero at frame boundaries in the input stream, which may confuse callers who are comparing these values to the entire compressed input data size. I'd recommend removing getBytesWritten which is unused and making getBytesRead package-private just for unit testing.

          I originally had that for testing purposes, I totally forgot to remove it. I'll get rid of those methods entirely.

          It would be nice to change the logging from org.apache.commons.logging to org.slf4j. Sorry I didn't catch this earlier. We're supposed to be slowly migrating to SLF4J, and new code should be using it. Fortunately it's an easy change, and we can remove the conditional checks for debug logging.

          sure thing

          Do we need to have test_file.txt? I'm wondering if simply generating random data (maybe with a constant seed so the test is deterministic) into a byte buffer would be sufficient input for testing. Typically we want to avoid test files in the source tree unless there's something very specific about the file's contents necessary to exercise a unit test case.

          Sorry, I forgot to commit the other file that went along: test_file.txt.zst. That file was compressed using the command line utility and wanted to make sure we could go back and forth between command line and hadoop. I think that is useful to have as a test case as if anyone changes that behavior down the line, tests will break and it should be pretty clear as to why.

          I will fix all coding style issues and other minor nits no problem.

          Again thank you so much for taking the time to review this patch. I will get the hopefully final patch up Monday.

          Show
          churromorales churro morales added a comment - Hi Jason thanks for the review. I figured out the memory issue and will submit a patch monday, it was quite a bit of fun trying to figure out why the JVM was segfaulting for particular scenarios . io.compression.codec.zst.buffersize should be io.compression.codec.zstd.buffersize to have a consistent prefix with io.compression.codec.zstd.level In ZStandardCodec#createOutputStream and ZStandardCodec#createInputStream I'm curious why we're not honoring the user's specified buffer size and just using the default buffer size. Many codecs use io.file.buffer.size here, but others use a codec-specific config property. I totally agree, I'm going to just use io.file.buffer.size and that will be the buffer size and we will honor it. In ZStandardCompressor and ZStandardDecompressor, the buffers should simply be declared as ByteBuffer instead of Buffer. That way the ugly casting when doing bulk get/put with them is unnecessary. done In ZStandardDecompressor it's a little confusing that some buffer length variables like userBufLen reflect the number of bytes remaining to be consumed in the buffer, while compressedDirectBufLen reflects the number of bytes originally placed in the buffer, so the real length of data to be consumed is (compressedDirectBufLen - compressedDirectBufOffset). Maybe a variable name change would help notify future maintainers of the semantic difference with similar variables for other buffers, or we should change the semantics so that variable is consistent with the others? Yeah that is confusing, I'll change the name of the variables. I was just following on how other Decompressors did their naming. I'm a little wary of the public getBytesWritten and getBytesRead methods in ZStandardDecompressor. getBytesWritten doesn't appear to be all that useful since the caller can trivially compute it based on the amount of data they read from the input stream the decompressor is writing. Also both of these methods reset values to zero at frame boundaries in the input stream, which may confuse callers who are comparing these values to the entire compressed input data size. I'd recommend removing getBytesWritten which is unused and making getBytesRead package-private just for unit testing. I originally had that for testing purposes, I totally forgot to remove it. I'll get rid of those methods entirely. It would be nice to change the logging from org.apache.commons.logging to org.slf4j. Sorry I didn't catch this earlier. We're supposed to be slowly migrating to SLF4J, and new code should be using it. Fortunately it's an easy change, and we can remove the conditional checks for debug logging. sure thing Do we need to have test_file.txt? I'm wondering if simply generating random data (maybe with a constant seed so the test is deterministic) into a byte buffer would be sufficient input for testing. Typically we want to avoid test files in the source tree unless there's something very specific about the file's contents necessary to exercise a unit test case. Sorry, I forgot to commit the other file that went along: test_file.txt.zst. That file was compressed using the command line utility and wanted to make sure we could go back and forth between command line and hadoop. I think that is useful to have as a test case as if anyone changes that behavior down the line, tests will break and it should be pretty clear as to why. I will fix all coding style issues and other minor nits no problem. Again thank you so much for taking the time to review this patch. I will get the hopefully final patch up Monday.
          Hide
          churromorales churro morales added a comment -

          Jason Lowe hope you had a good thanksgiving. Attached is the latest patch, thanks for taking the time to review.

          Show
          churromorales churro morales added a comment - Jason Lowe hope you had a good thanksgiving. Attached is the latest patch, thanks for taking the time to review.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 22s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 0m 18s Maven dependency ordering for branch
          +1 mvninstall 7m 0s trunk passed
          +1 compile 9m 41s trunk passed
          +1 checkstyle 1m 39s trunk passed
          +1 mvnsite 9m 48s trunk passed
          +1 mvneclipse 1m 10s trunk passed
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 1m 55s trunk passed
          +1 javadoc 4m 29s trunk passed
          0 mvndep 0m 20s Maven dependency ordering for patch
          +1 mvninstall 8m 21s the patch passed
          +1 compile 9m 30s the patch passed
          +1 cc 9m 30s the patch passed
          +1 javac 9m 30s the patch passed
          -0 checkstyle 1m 46s root: The patch generated 73 new + 159 unchanged - 1 fixed = 232 total (was 160)
          +1 mvnsite 9m 45s the patch passed
          +1 mvneclipse 1m 16s the patch passed
          +1 shellcheck 0m 11s There were no new shellcheck issues.
          +1 shelldocs 0m 9s There were no new shelldocs issues.
          -1 whitespace 0m 0s The patch has 2383 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 xml 0m 5s The patch has no ill-formed XML file.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 18s the patch passed
          +1 javadoc 4m 32s the patch passed
          -1 unit 13m 59s root in the patch failed.
          -1 asflicense 0m 44s The patch generated 3 ASF License warnings.
          113m 44s



          Reason Tests
          Failed junit tests hadoop.net.TestDNS



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-13578
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12840717/HADOOP-13578.v4.patch
          Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle
          uname Linux 18fab6925437 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / a2b1ff0
          Default Java 1.8.0_111
          shellcheck v0.4.5
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11149/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11149/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11149/artifact/patchprocess/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11149/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11149/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11149/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 22s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 0m 18s Maven dependency ordering for branch +1 mvninstall 7m 0s trunk passed +1 compile 9m 41s trunk passed +1 checkstyle 1m 39s trunk passed +1 mvnsite 9m 48s trunk passed +1 mvneclipse 1m 10s trunk passed 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 1m 55s trunk passed +1 javadoc 4m 29s trunk passed 0 mvndep 0m 20s Maven dependency ordering for patch +1 mvninstall 8m 21s the patch passed +1 compile 9m 30s the patch passed +1 cc 9m 30s the patch passed +1 javac 9m 30s the patch passed -0 checkstyle 1m 46s root: The patch generated 73 new + 159 unchanged - 1 fixed = 232 total (was 160) +1 mvnsite 9m 45s the patch passed +1 mvneclipse 1m 16s the patch passed +1 shellcheck 0m 11s There were no new shellcheck issues. +1 shelldocs 0m 9s There were no new shelldocs issues. -1 whitespace 0m 0s The patch has 2383 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 xml 0m 5s The patch has no ill-formed XML file. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 18s the patch passed +1 javadoc 4m 32s the patch passed -1 unit 13m 59s root in the patch failed. -1 asflicense 0m 44s The patch generated 3 ASF License warnings. 113m 44s Reason Tests Failed junit tests hadoop.net.TestDNS Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-13578 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12840717/HADOOP-13578.v4.patch Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle uname Linux 18fab6925437 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / a2b1ff0 Default Java 1.8.0_111 shellcheck v0.4.5 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11149/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11149/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11149/artifact/patchprocess/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11149/testReport/ asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11149/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11149/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for updating the patch!

          I'm torn on the buffer size, whether we should use the generic io.file.buffer.size or have a zst-specific config. If the codec can significantly speed up with a larger default buffer size than the file buffer size then maybe we should consider having a separate, zst-specific buffer size config. I'd recommend a zst-specific config key with a default value of 0 that means use whatever buffer size the zst library wants to use, but the user can override it to a custom size. That way users can change the zst codec buffer size (again, useful for keeping memory usage reasonable for very wide sort factor merges) but not change the buffer sizes of anything else. Similarly, someone changing the default buffer size to something relatively small (e.g.: 4K) for some unrelated use case could unknowingly hamper the performance of the zst codec.

          Regarding the test input file, is it copyrighted or do we otherwise have the redistribution rights? Also the .zst binary test file is missing from the patch.

          ZStandardCodec.java
          Nit: should import the specific things needed rather than wildcard imports.

          ZStandardCompressor.java
          reinit calls reset then init, but reset already is calling init.

          LOG should be private rather than public.

          The LOG.isDebugEnabled check is unnecessary.

          Nit: The VisibleForTesting annotation should be on a separate line like the other annotations.

          ZStandardCompressor.c:
          Shouldn't we throw if GetDirectBufferAddress returns null?

          From the zstd documentation, ZSTD_compressStream may not always fully consume the input. Therefore if the finish flag is set in deflateBytesDirect I think we need to avoid calling ZTD_endStream if there is still bytes left in the input or we risk dropping some of the input. We can just return without setting the finished flag and expect to get called again from the outer compression loop logic.

          Similarly the documentation for ZSTD_endStream says that it may not be able to flush the full contents of the data in one call, and if it indicates there is more data left to flush then we should call it again. So if we tried to end the stream and there's more data left then we shouldn't throw an error. Instead we simply leave the finished flag unset and return so the outer logic loop will empty the output buffer and call compress() again. We will then again call ZSTD_endStream to continue attempting to finish the compression. Bonus points for adding a test case that uses a 1-byte output buffer to demonstrate the extreme case is working properly.

          Is it necessary to call ZSTD_flushStream after every ZSTD_compressStream call? Seems to me we can skip it entirely, but I might be missing something.

          ZStandardDecompressor.java:
          LOG should be private rather than public.

          Should there be a checkStream call at the beginning of decompress()? This would mirror the same check in the compressor for the corresponding compress() method.

          In inflateDirect, the setting of preSliced = dst in the conditional block is redundant since it was just done before the block. Actually I think it would be faster and create less garbage to avoid creating a temporary ByteBuffer and have the JNI code to lookup the output buffer's position and limit and adjust accordingly. That way we aren't creating an object each time.

          The swapping of variables in inflateDirect is cumbersome. It would be much cleaner if inflateBytesDirect took arguments so we can pass what the JNI needs directly to it rather than shuffling the parameters into the object's fields before we call it. That also cleans up the JNI code a bit since it doesn't need to do many of the field lookups. For example:

            private native int inflateBytesDirect(ByteBuffer src, int srcOffset, int srcLen, ByteBuffer dst, int dstOffset, int dstLen);
          

          We could simplify the variables even more if the JNI code called the ByteBuffer methods to update the positions and limits directly rather than poking the values into the Java object and having Java do it when the JNI returns. I'll leave the details up to you, just pointing out that we can clean this up and avoid a lot of swapping. We could do a similar thing on the compressor side.

          I'm a little confused about the endOfInput flag for the direct decompressor. I assume it's for handling the case where there are multiple frames within the input buffer. The JNI code will set finished = true once it hits the end of the frame, and it appears it will remain that way even if more input is sent (i.e.: finished will go from false to true after the first frame in the input buffer and remain true throughout all subsequent frames). It makes me wonder if we should be setting finished to false if we didn't see the end of a frame this iteration?

          ZStandardDecompressor.c
          The handling of result codes from ZSTD_freeCStream is different between the compressor and decompressor. Not sure a non-zero result means unflushed data, and I'd recommend updating the handling here to match what is done in the compressor JNI code.

          Shouldn't we throw if GetDirectBufferAddress returns null?

          The x ^ ((x ^ y) & -(x < y)) expression is clever, but it's probably too smart for many compilers to realize what this is and generate the optimal code for max() on that platform. I tested this on gcc for x64 and it generates smaller, better code if written as a naive implementation than with this expression. Therefore I recommend we just do the simple thing here (i.e.: ternary operator or plain ol' if statement) which will likely be faster in practice (because compilers can easily detect and optimize for it) and easier to understand.

          TestZStandardCompressorDecompressor.java:
          org.junit.Ignore import is unnecessary

          Nit: Should not do wildcard imports but only import what is necessary

          uncompressedFile and compressedFile lookups only need to be done once, so we can do this in the default constructor rather than before every test in the Before method (and those fields could be marked final).

          Show
          jlowe Jason Lowe added a comment - Thanks for updating the patch! I'm torn on the buffer size, whether we should use the generic io.file.buffer.size or have a zst-specific config. If the codec can significantly speed up with a larger default buffer size than the file buffer size then maybe we should consider having a separate, zst-specific buffer size config. I'd recommend a zst-specific config key with a default value of 0 that means use whatever buffer size the zst library wants to use, but the user can override it to a custom size. That way users can change the zst codec buffer size (again, useful for keeping memory usage reasonable for very wide sort factor merges) but not change the buffer sizes of anything else. Similarly, someone changing the default buffer size to something relatively small (e.g.: 4K) for some unrelated use case could unknowingly hamper the performance of the zst codec. Regarding the test input file, is it copyrighted or do we otherwise have the redistribution rights? Also the .zst binary test file is missing from the patch. ZStandardCodec.java Nit: should import the specific things needed rather than wildcard imports. ZStandardCompressor.java reinit calls reset then init, but reset already is calling init. LOG should be private rather than public. The LOG.isDebugEnabled check is unnecessary. Nit: The VisibleForTesting annotation should be on a separate line like the other annotations. ZStandardCompressor.c: Shouldn't we throw if GetDirectBufferAddress returns null? From the zstd documentation, ZSTD_compressStream may not always fully consume the input. Therefore if the finish flag is set in deflateBytesDirect I think we need to avoid calling ZTD_endStream if there is still bytes left in the input or we risk dropping some of the input. We can just return without setting the finished flag and expect to get called again from the outer compression loop logic. Similarly the documentation for ZSTD_endStream says that it may not be able to flush the full contents of the data in one call, and if it indicates there is more data left to flush then we should call it again. So if we tried to end the stream and there's more data left then we shouldn't throw an error. Instead we simply leave the finished flag unset and return so the outer logic loop will empty the output buffer and call compress() again. We will then again call ZSTD_endStream to continue attempting to finish the compression. Bonus points for adding a test case that uses a 1-byte output buffer to demonstrate the extreme case is working properly. Is it necessary to call ZSTD_flushStream after every ZSTD_compressStream call? Seems to me we can skip it entirely, but I might be missing something. ZStandardDecompressor.java: LOG should be private rather than public. Should there be a checkStream call at the beginning of decompress()? This would mirror the same check in the compressor for the corresponding compress() method. In inflateDirect, the setting of preSliced = dst in the conditional block is redundant since it was just done before the block. Actually I think it would be faster and create less garbage to avoid creating a temporary ByteBuffer and have the JNI code to lookup the output buffer's position and limit and adjust accordingly. That way we aren't creating an object each time. The swapping of variables in inflateDirect is cumbersome. It would be much cleaner if inflateBytesDirect took arguments so we can pass what the JNI needs directly to it rather than shuffling the parameters into the object's fields before we call it. That also cleans up the JNI code a bit since it doesn't need to do many of the field lookups. For example: private native int inflateBytesDirect(ByteBuffer src, int srcOffset, int srcLen, ByteBuffer dst, int dstOffset, int dstLen); We could simplify the variables even more if the JNI code called the ByteBuffer methods to update the positions and limits directly rather than poking the values into the Java object and having Java do it when the JNI returns. I'll leave the details up to you, just pointing out that we can clean this up and avoid a lot of swapping. We could do a similar thing on the compressor side. I'm a little confused about the endOfInput flag for the direct decompressor. I assume it's for handling the case where there are multiple frames within the input buffer. The JNI code will set finished = true once it hits the end of the frame, and it appears it will remain that way even if more input is sent (i.e.: finished will go from false to true after the first frame in the input buffer and remain true throughout all subsequent frames). It makes me wonder if we should be setting finished to false if we didn't see the end of a frame this iteration? ZStandardDecompressor.c The handling of result codes from ZSTD_freeCStream is different between the compressor and decompressor. Not sure a non-zero result means unflushed data, and I'd recommend updating the handling here to match what is done in the compressor JNI code. Shouldn't we throw if GetDirectBufferAddress returns null? The x ^ ((x ^ y) & -(x < y)) expression is clever, but it's probably too smart for many compilers to realize what this is and generate the optimal code for max() on that platform. I tested this on gcc for x64 and it generates smaller, better code if written as a naive implementation than with this expression. Therefore I recommend we just do the simple thing here (i.e.: ternary operator or plain ol' if statement) which will likely be faster in practice (because compilers can easily detect and optimize for it) and easier to understand. TestZStandardCompressorDecompressor.java: org.junit.Ignore import is unnecessary Nit: Should not do wildcard imports but only import what is necessary uncompressedFile and compressedFile lookups only need to be done once, so we can do this in the default constructor rather than before every test in the Before method (and those fields could be marked final).
          Hide
          jlowe Jason Lowe added a comment -

          I forgot to mention it would also be good to address at least some of the checkstyle issues, such as the indentation levels and line-length.

          Show
          jlowe Jason Lowe added a comment - I forgot to mention it would also be good to address at least some of the checkstyle issues, such as the indentation levels and line-length.
          Hide
          churromorales churro morales added a comment -

          Hi Jason Lowe first of all thank you for taking the time to review this code. I have attached another patch and addressed the comments you had below.

          I'm torn on the buffer size, whether we should use the generic io.file.buffer.size or have a zst-specific config. If the codec can significantly speed up with a larger default buffer size than the file buffer size then maybe we should consider having a separate, zst-specific buffer size config. I'd recommend a zst-specific config key with a default value of 0 that means use whatever buffer size the zst library wants to use, but the user can override it to a custom size. That way users can change the zst codec buffer size (again, useful for keeping memory usage reasonable for very wide sort factor merges) but not change the buffer sizes of anything else. Similarly, someone changing the default buffer size to something relatively small (e.g.: 4K) for some unrelated use case could unknowingly hamper the performance of the zst codec.

          Thats fine, created a new configuration option for buffer size (which is codec specific, with a default value of 0), if the value is 0 then we use the recommended library buffer size unless the user specifically overrides it.

          Regarding the test input file, is it copyrighted or do we otherwise have the redistribution rights? Also the .zst binary test file is missing from the patch.

          I updated the file to one that I know is not copyrighted and added a license header. Also added the binary.

          ZStandardCodec.java
          Nit: should import the specific things needed rather than wildcard imports.

          Fixed

          ZStandardCompressor.java
          reinit calls reset then init, but reset already is calling init.

          removed the call to init from reinit

          LOG should be private rather than public.

          done

          The LOG.isDebugEnabled check is unnecessary.

          done

          Nit: The VisibleForTesting annotation should be on a separate line like the other annotations.

          done

          ZStandardCompressor.c:
          Shouldn't we throw if GetDirectBufferAddress returns null?

          added a check to throw when we can't resolve a memory address for the buffer and we get back null.

          From the zstd documentation, ZSTD_compressStream may not always fully consume the input. Therefore if the finish flag is set in deflateBytesDirect I think we need to avoid calling ZTD_endStream if there is still bytes left in the input or we risk dropping some of the input. We can just return without setting the finished flag and expect to get called again from the outer compression loop logic.

          Similarly the documentation for ZSTD_endStream says that it may not be able to flush the full contents of the data in one call, and if it indicates there is more data left to flush then we should call it again. So if we tried to end the stream and there's more data left then we shouldn't throw an error. Instead we simply leave the finished flag unset and return so the outer logic loop will empty the output buffer and call compress() again. We will then again call ZSTD_endStream to continue attempting to finish the compression. Bonus points for adding a test case that uses a 1-byte output buffer to demonstrate the extreme case is working properly.

          Sure thing, I think not throwing if there is data remaining to flush when endStream is called and instead setting finished to false will resolve both issues. I added a test called testCompressingWithOneByteOutputBuffer()

          Is it necessary to call ZSTD_flushStream after every ZSTD_compressStream call? Seems to me we can skip it entirely, but I might be missing something.

          To keep consistent with the API unfortunately I have to call flushStream. The input and output buffers of the zstandard library "pos" fields are only updated when the stream is flushed so unfortunately that is necessary with the Compressor / Decompressor framework.

          ZStandardDecompressor.java:
          LOG should be private rather than public.

          done

          Should there be a checkStream call at the beginning of decompress()? This would mirror the same check in the compressor for the corresponding compress() method.

          done

          In inflateDirect, the setting of preSliced = dst in the conditional block is redundant since it was just done before the block. Actually I think it would be faster and create less garbage to avoid creating a temporary ByteBuffer and have the JNI code to lookup the output buffer's position and limit and adjust accordingly. That way we aren't creating an object each time.

          yes that is not necessary, passing parameters in will fix that.

          The swapping of variables in inflateDirect is cumbersome. It would be much cleaner if inflateBytesDirect took arguments so we can pass what the JNI needs directly to it rather than shuffling the parameters into the object's fields before we call it. That also cleans up the JNI code a bit since it doesn't need to do many of the field lookups. For example:

          private native int inflateBytesDirect(ByteBuffer src, int srcOffset, int srcLen, ByteBuffer dst, int dstOffset, int dstLen);

          inflateBytesDirect is now parameterized.

          We could simplify the variables even more if the JNI code called the ByteBuffer methods to update the positions and limits directly rather than poking the values into the Java object and having Java do it when the JNI returns. I'll leave the details up to you, just pointing out that we can clean this up and avoid a lot of swapping. We could do a similar thing on the compressor side.

          We could but we have to do things like getting the object class, then the method ids we just push the work and complexity into the c code. I fixed the variable swapping in the direct decompressor and also parametrized the inflate and deflate methods to make things cleaner.

          I'm a little confused about the endOfInput flag for the direct decompressor. I assume it's for handling the case where there are multiple frames within the input buffer. The JNI code will set finished = true once it hits the end of the frame, and it appears it will remain that way even if more input is sent (i.e.: finished will go from false to true after the first frame in the input buffer and remain true throughout all subsequent frames). It makes me wonder if we should be setting finished to false if we didn't see the end of a frame this iteration?

          agreed you are correct we should be setting finished to false if we don't see end of frame for an iteration. This would allow for the multiple frames to be handled correctly.

          ZStandardDecompressor.c
          The handling of result codes from ZSTD_freeCStream is different between the compressor and decompressor. Not sure a non-zero result means unflushed data, and I'd recommend updating the handling here to match what is done in the compressor JNI code.

          good catch, will fix it to check for error and throw.

          Shouldn't we throw if GetDirectBufferAddress returns null?

          I will throw instead of returning 0 if we can't get a memory address, I agree with your sentiments.

          The x ^ ((x ^ y) & -(x < y)) expression is clever, but it's probably too smart for many compilers to realize what this is and generate the optimal code for max() on that platform. I tested this on gcc for x64 and it generates smaller, better code if written as a naive implementation than with this expression. Therefore I recommend we just do the simple thing here (i.e.: ternary operator or plain ol' if statement) which will likely be faster in practice (because compilers can easily detect and optimize for it) and easier to understand.

          Sounds good you make a good point. I'll change it.

          TestZStandardCompressorDecompressor.java:
          org.junit.Ignore import is unnecessary

          Nit: Should not do wildcard imports but only import what is necessary

          done.

          uncompressedFile and compressedFile lookups only need to be done once, so we can do this in the default constructor rather than before every test in the Before method (and those fields could be marked final).

          will put then in a static before class.

          I forgot to mention it would also be good to address at least some of the checkstyle issues, such as the indentation levels and line-length.

          fixed everything java related. The c code wraps quite a bit if we keep by the standards and then becomes unreadable. Many of the method names due to jni are longer than 80 chars themselves which cannot be wrapped. I looked at other c code in the compress package and none of them follow the line length guidelines either. I assume that its because at a line length of 80 everything would become wrapped and many things simply just can't be wrapped.

          Show
          churromorales churro morales added a comment - Hi Jason Lowe first of all thank you for taking the time to review this code. I have attached another patch and addressed the comments you had below. I'm torn on the buffer size, whether we should use the generic io.file.buffer.size or have a zst-specific config. If the codec can significantly speed up with a larger default buffer size than the file buffer size then maybe we should consider having a separate, zst-specific buffer size config. I'd recommend a zst-specific config key with a default value of 0 that means use whatever buffer size the zst library wants to use, but the user can override it to a custom size. That way users can change the zst codec buffer size (again, useful for keeping memory usage reasonable for very wide sort factor merges) but not change the buffer sizes of anything else. Similarly, someone changing the default buffer size to something relatively small (e.g.: 4K) for some unrelated use case could unknowingly hamper the performance of the zst codec. Thats fine, created a new configuration option for buffer size (which is codec specific, with a default value of 0), if the value is 0 then we use the recommended library buffer size unless the user specifically overrides it. Regarding the test input file, is it copyrighted or do we otherwise have the redistribution rights? Also the .zst binary test file is missing from the patch. I updated the file to one that I know is not copyrighted and added a license header. Also added the binary. ZStandardCodec.java Nit: should import the specific things needed rather than wildcard imports. Fixed ZStandardCompressor.java reinit calls reset then init, but reset already is calling init. removed the call to init from reinit LOG should be private rather than public. done The LOG.isDebugEnabled check is unnecessary. done Nit: The VisibleForTesting annotation should be on a separate line like the other annotations. done ZStandardCompressor.c: Shouldn't we throw if GetDirectBufferAddress returns null? added a check to throw when we can't resolve a memory address for the buffer and we get back null. From the zstd documentation, ZSTD_compressStream may not always fully consume the input. Therefore if the finish flag is set in deflateBytesDirect I think we need to avoid calling ZTD_endStream if there is still bytes left in the input or we risk dropping some of the input. We can just return without setting the finished flag and expect to get called again from the outer compression loop logic. Similarly the documentation for ZSTD_endStream says that it may not be able to flush the full contents of the data in one call, and if it indicates there is more data left to flush then we should call it again. So if we tried to end the stream and there's more data left then we shouldn't throw an error. Instead we simply leave the finished flag unset and return so the outer logic loop will empty the output buffer and call compress() again. We will then again call ZSTD_endStream to continue attempting to finish the compression. Bonus points for adding a test case that uses a 1-byte output buffer to demonstrate the extreme case is working properly. Sure thing, I think not throwing if there is data remaining to flush when endStream is called and instead setting finished to false will resolve both issues. I added a test called testCompressingWithOneByteOutputBuffer() Is it necessary to call ZSTD_flushStream after every ZSTD_compressStream call? Seems to me we can skip it entirely, but I might be missing something. To keep consistent with the API unfortunately I have to call flushStream. The input and output buffers of the zstandard library "pos" fields are only updated when the stream is flushed so unfortunately that is necessary with the Compressor / Decompressor framework. ZStandardDecompressor.java: LOG should be private rather than public. done Should there be a checkStream call at the beginning of decompress()? This would mirror the same check in the compressor for the corresponding compress() method. done In inflateDirect, the setting of preSliced = dst in the conditional block is redundant since it was just done before the block. Actually I think it would be faster and create less garbage to avoid creating a temporary ByteBuffer and have the JNI code to lookup the output buffer's position and limit and adjust accordingly. That way we aren't creating an object each time. yes that is not necessary, passing parameters in will fix that. The swapping of variables in inflateDirect is cumbersome. It would be much cleaner if inflateBytesDirect took arguments so we can pass what the JNI needs directly to it rather than shuffling the parameters into the object's fields before we call it. That also cleans up the JNI code a bit since it doesn't need to do many of the field lookups. For example: private native int inflateBytesDirect(ByteBuffer src, int srcOffset, int srcLen, ByteBuffer dst, int dstOffset, int dstLen); inflateBytesDirect is now parameterized. We could simplify the variables even more if the JNI code called the ByteBuffer methods to update the positions and limits directly rather than poking the values into the Java object and having Java do it when the JNI returns. I'll leave the details up to you, just pointing out that we can clean this up and avoid a lot of swapping. We could do a similar thing on the compressor side. We could but we have to do things like getting the object class, then the method ids we just push the work and complexity into the c code. I fixed the variable swapping in the direct decompressor and also parametrized the inflate and deflate methods to make things cleaner. I'm a little confused about the endOfInput flag for the direct decompressor. I assume it's for handling the case where there are multiple frames within the input buffer. The JNI code will set finished = true once it hits the end of the frame, and it appears it will remain that way even if more input is sent (i.e.: finished will go from false to true after the first frame in the input buffer and remain true throughout all subsequent frames). It makes me wonder if we should be setting finished to false if we didn't see the end of a frame this iteration? agreed you are correct we should be setting finished to false if we don't see end of frame for an iteration. This would allow for the multiple frames to be handled correctly. ZStandardDecompressor.c The handling of result codes from ZSTD_freeCStream is different between the compressor and decompressor. Not sure a non-zero result means unflushed data, and I'd recommend updating the handling here to match what is done in the compressor JNI code. good catch, will fix it to check for error and throw. Shouldn't we throw if GetDirectBufferAddress returns null? I will throw instead of returning 0 if we can't get a memory address, I agree with your sentiments. The x ^ ((x ^ y) & -(x < y)) expression is clever, but it's probably too smart for many compilers to realize what this is and generate the optimal code for max() on that platform. I tested this on gcc for x64 and it generates smaller, better code if written as a naive implementation than with this expression. Therefore I recommend we just do the simple thing here (i.e.: ternary operator or plain ol' if statement) which will likely be faster in practice (because compilers can easily detect and optimize for it) and easier to understand. Sounds good you make a good point. I'll change it. TestZStandardCompressorDecompressor.java: org.junit.Ignore import is unnecessary Nit: Should not do wildcard imports but only import what is necessary done. uncompressedFile and compressedFile lookups only need to be done once, so we can do this in the default constructor rather than before every test in the Before method (and those fields could be marked final). will put then in a static before class. I forgot to mention it would also be good to address at least some of the checkstyle issues, such as the indentation levels and line-length. fixed everything java related. The c code wraps quite a bit if we keep by the standards and then becomes unreadable. Many of the method names due to jni are longer than 80 chars themselves which cannot be wrapped. I looked at other c code in the compress package and none of them follow the line length guidelines either. I assume that its because at a line length of 80 everything would become wrapped and many things simply just can't be wrapped.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 0m 18s Maven dependency ordering for branch
          +1 mvninstall 7m 14s trunk passed
          +1 compile 10m 10s trunk passed
          +1 checkstyle 1m 40s trunk passed
          +1 mvnsite 10m 13s trunk passed
          +1 mvneclipse 1m 35s trunk passed
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 8s trunk passed
          +1 javadoc 5m 10s trunk passed
          0 mvndep 0m 22s Maven dependency ordering for patch
          +1 mvninstall 9m 46s the patch passed
          +1 compile 10m 44s the patch passed
          +1 cc 10m 44s the patch passed
          +1 javac 10m 44s the patch passed
          -0 checkstyle 1m 50s root: The patch generated 64 new + 159 unchanged - 1 fixed = 223 total (was 160)
          +1 mvnsite 10m 31s the patch passed
          +1 mvneclipse 1m 17s the patch passed
          +1 shellcheck 0m 14s There were no new shellcheck issues.
          +1 shelldocs 0m 10s There were no new shelldocs issues.
          -1 whitespace 0m 0s The patch has 5 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 xml 0m 6s The patch has no ill-formed XML file.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 58s the patch passed
          +1 javadoc 5m 47s the patch passed
          -1 unit 15m 10s root in the patch failed.
          -1 asflicense 0m 51s The patch generated 2 ASF License warnings.
          121m 38s



          Reason Tests
          Failed junit tests hadoop.io.compress.zstd.TestZStandardCompressorDecompressor



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-13578
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12842015/HADOOP-13578.v5.patch
          Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle
          uname Linux 35142cf729fb 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / df983b5
          Default Java 1.8.0_111
          shellcheck v0.4.5
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11205/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11205/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11205/artifact/patchprocess/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11205/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11205/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11205/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 0m 18s Maven dependency ordering for branch +1 mvninstall 7m 14s trunk passed +1 compile 10m 10s trunk passed +1 checkstyle 1m 40s trunk passed +1 mvnsite 10m 13s trunk passed +1 mvneclipse 1m 35s trunk passed 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 8s trunk passed +1 javadoc 5m 10s trunk passed 0 mvndep 0m 22s Maven dependency ordering for patch +1 mvninstall 9m 46s the patch passed +1 compile 10m 44s the patch passed +1 cc 10m 44s the patch passed +1 javac 10m 44s the patch passed -0 checkstyle 1m 50s root: The patch generated 64 new + 159 unchanged - 1 fixed = 223 total (was 160) +1 mvnsite 10m 31s the patch passed +1 mvneclipse 1m 17s the patch passed +1 shellcheck 0m 14s There were no new shellcheck issues. +1 shelldocs 0m 10s There were no new shelldocs issues. -1 whitespace 0m 0s The patch has 5 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 xml 0m 6s The patch has no ill-formed XML file. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 58s the patch passed +1 javadoc 5m 47s the patch passed -1 unit 15m 10s root in the patch failed. -1 asflicense 0m 51s The patch generated 2 ASF License warnings. 121m 38s Reason Tests Failed junit tests hadoop.io.compress.zstd.TestZStandardCompressorDecompressor Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-13578 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12842015/HADOOP-13578.v5.patch Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle uname Linux 35142cf729fb 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / df983b5 Default Java 1.8.0_111 shellcheck v0.4.5 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11205/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11205/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11205/artifact/patchprocess/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11205/testReport/ asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11205/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11205/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          churromorales churro morales added a comment -

          Sorry tests failed because git diff wont attach binary files by default. Since we are testing that CLI compression works with hadoop this file is necessary for the tests.

          Same patch as v5 just adds the binary files.

          Show
          churromorales churro morales added a comment - Sorry tests failed because git diff wont attach binary files by default. Since we are testing that CLI compression works with hadoop this file is necessary for the tests. Same patch as v5 just adds the binary files.
          Hide
          churromorales churro morales added a comment -

          I'll also take care of the java checkstyle issues and upload a patch provided you are good with this patch.

          Show
          churromorales churro morales added a comment - I'll also take care of the java checkstyle issues and upload a patch provided you are good with this patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 19m 21s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 1m 40s Maven dependency ordering for branch
          +1 mvninstall 7m 9s trunk passed
          +1 compile 9m 34s trunk passed
          +1 checkstyle 1m 38s trunk passed
          +1 mvnsite 9m 47s trunk passed
          +1 mvneclipse 1m 18s trunk passed
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 1m 52s trunk passed
          +1 javadoc 4m 25s trunk passed
          0 mvndep 0m 45s Maven dependency ordering for patch
          +1 mvninstall 8m 22s the patch passed
          +1 compile 9m 21s the patch passed
          +1 cc 9m 21s the patch passed
          +1 javac 9m 21s the patch passed
          -0 checkstyle 1m 43s root: The patch generated 64 new + 159 unchanged - 1 fixed = 223 total (was 160)
          +1 mvnsite 10m 4s the patch passed
          +1 mvneclipse 1m 16s the patch passed
          +1 shellcheck 0m 11s There were no new shellcheck issues.
          +1 shelldocs 0m 9s There were no new shelldocs issues.
          -1 whitespace 0m 0s The patch has 5 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 xml 0m 5s The patch has no ill-formed XML file.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 36s the patch passed
          +1 javadoc 5m 20s the patch passed
          -1 unit 114m 31s root in the patch failed.
          -1 asflicense 0m 51s The patch generated 2 ASF License warnings.
          234m 38s



          Reason Tests
          Failed junit tests hadoop.yarn.server.timeline.webapp.TestTimelineWebServices



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-13578
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12842043/HADOOP-13578.v6.patch
          Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle
          uname Linux ddee4ef0f97e 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / a7288da
          Default Java 1.8.0_111
          shellcheck v0.4.5
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11207/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11207/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11207/artifact/patchprocess/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11207/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11207/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11207/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 19m 21s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 1m 40s Maven dependency ordering for branch +1 mvninstall 7m 9s trunk passed +1 compile 9m 34s trunk passed +1 checkstyle 1m 38s trunk passed +1 mvnsite 9m 47s trunk passed +1 mvneclipse 1m 18s trunk passed 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 1m 52s trunk passed +1 javadoc 4m 25s trunk passed 0 mvndep 0m 45s Maven dependency ordering for patch +1 mvninstall 8m 22s the patch passed +1 compile 9m 21s the patch passed +1 cc 9m 21s the patch passed +1 javac 9m 21s the patch passed -0 checkstyle 1m 43s root: The patch generated 64 new + 159 unchanged - 1 fixed = 223 total (was 160) +1 mvnsite 10m 4s the patch passed +1 mvneclipse 1m 16s the patch passed +1 shellcheck 0m 11s There were no new shellcheck issues. +1 shelldocs 0m 9s There were no new shelldocs issues. -1 whitespace 0m 0s The patch has 5 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 xml 0m 5s The patch has no ill-formed XML file. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 36s the patch passed +1 javadoc 5m 20s the patch passed -1 unit 114m 31s root in the patch failed. -1 asflicense 0m 51s The patch generated 2 ASF License warnings. 234m 38s Reason Tests Failed junit tests hadoop.yarn.server.timeline.webapp.TestTimelineWebServices Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-13578 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12842043/HADOOP-13578.v6.patch Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle uname Linux ddee4ef0f97e 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / a7288da Default Java 1.8.0_111 shellcheck v0.4.5 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11207/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11207/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11207/artifact/patchprocess/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11207/testReport/ asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11207/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11207/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for updating the patch! Sorry for the delay in re-review, as I was travelling last week. The ASF license warnings appear to be unrelated, as does the unit test failure. Comments on the patch:

          IO_COMPRESSION_CODEC_ZSTD_BUFFER_SIZE should be IO_COMPRESSION_CODEC_ZSTD_BUFFER_SIZE_DEFAULT

          The ZStandardCompressor(level, bufferSize) constructor should be implemented in terms of ZStandardCompressor(level, inputBufferSize, outputBufferSize) to reduce the code redundancy and improve maintainability.

          If inflateBytesDirect took a dest position then we wouldn't need to slice the destination buffer, avoiding a temporary objection allocation.

          I'm not sure about this snippet in ZStandardDirectDecompressor#decompress:

                endOfInput = !src.hasRemaining();
                super.finished = !src.hasRemaining();
          

          After adding the super.finished change it appears the endOfInput and finished variables will simply reflect each other, so endOfInput would be redundant. Instead I think endOfInput may still be necessary, and the finished variable handling will already be taken care of by the JNI code. In other words I think we don't need the super.finished line that was added. However maybe I'm missing something here.

          I think we need to return after these THROW macros or risk passing null pointers to the zstd library:

              void * uncompressed_bytes = (*env)->GetDirectBufferAddress(env, uncompressed_direct_buf);
              if (!uncompressed_bytes) {
                  THROW(env, "java/lang/InternalError", "Undefined memory address for uncompressedDirectBuf");
              }
          
              // Get the output direct buffer
              void * compressed_bytes = (*env)->GetDirectBufferAddress(env, compressed_direct_buf);
              if (!compressed_bytes) {
                  THROW(env, "java/lang/InternalError", "Undefined memory address for compressedDirectBuf");
              }
          

          Same comments for similar code on the decompressor JNI side.

          Related to this code:

                  if (remaining_to_flush) {
                      (*env)->SetBooleanField(env, this, ZStandardCompressor_finished, JNI_FALSE);
                  } else {
                      (*env)->SetBooleanField(env, this, ZStandardCompressor_finished, JNI_TRUE);
                  }
          

          What I meant by my previous review was to have the JNI layer clear the finished flag in any case we didn't set it, including the case where the finish flag isn't asking us to wrap things up. Actually thinking about this further, I'm not sure we need the case where we set it to false. We just need to set it to true in the JNI layer when the end flush indicates there aren't any more bytes to flush. In every other case finished should already be false, and the user will need to call reset() to clear both the finish and finished flags to continue with more input. So I think we can put this back to the way it was. Sorry for the confusion on my part.

          If there are still bytes left to consume in the uncompressed buffer then I don't think we want to call ZSTD_endStream. We should only be calling ZSTD_endStream when the uncompressed buffer has been fully consumed, correct? Otherwise we may fail to compress the final chunk of input data when the user sets the finish flag and the zstd library doesn't fully consume the input on the ZSTD_compressStream call. That could result in a "successful" compression that drops data.

          Show
          jlowe Jason Lowe added a comment - Thanks for updating the patch! Sorry for the delay in re-review, as I was travelling last week. The ASF license warnings appear to be unrelated, as does the unit test failure. Comments on the patch: IO_COMPRESSION_CODEC_ZSTD_BUFFER_SIZE should be IO_COMPRESSION_CODEC_ZSTD_BUFFER_SIZE_DEFAULT The ZStandardCompressor(level, bufferSize) constructor should be implemented in terms of ZStandardCompressor(level, inputBufferSize, outputBufferSize) to reduce the code redundancy and improve maintainability. If inflateBytesDirect took a dest position then we wouldn't need to slice the destination buffer, avoiding a temporary objection allocation. I'm not sure about this snippet in ZStandardDirectDecompressor#decompress: endOfInput = !src.hasRemaining(); super .finished = !src.hasRemaining(); After adding the super.finished change it appears the endOfInput and finished variables will simply reflect each other, so endOfInput would be redundant. Instead I think endOfInput may still be necessary, and the finished variable handling will already be taken care of by the JNI code. In other words I think we don't need the super.finished line that was added. However maybe I'm missing something here. I think we need to return after these THROW macros or risk passing null pointers to the zstd library: void * uncompressed_bytes = (*env)->GetDirectBufferAddress(env, uncompressed_direct_buf); if (!uncompressed_bytes) { THROW(env, "java/lang/InternalError" , "Undefined memory address for uncompressedDirectBuf" ); } // Get the output direct buffer void * compressed_bytes = (*env)->GetDirectBufferAddress(env, compressed_direct_buf); if (!compressed_bytes) { THROW(env, "java/lang/InternalError" , "Undefined memory address for compressedDirectBuf" ); } Same comments for similar code on the decompressor JNI side. Related to this code: if (remaining_to_flush) { (*env)->SetBooleanField(env, this , ZStandardCompressor_finished, JNI_FALSE); } else { (*env)->SetBooleanField(env, this , ZStandardCompressor_finished, JNI_TRUE); } What I meant by my previous review was to have the JNI layer clear the finished flag in any case we didn't set it, including the case where the finish flag isn't asking us to wrap things up. Actually thinking about this further, I'm not sure we need the case where we set it to false. We just need to set it to true in the JNI layer when the end flush indicates there aren't any more bytes to flush. In every other case finished should already be false, and the user will need to call reset() to clear both the finish and finished flags to continue with more input. So I think we can put this back to the way it was. Sorry for the confusion on my part. If there are still bytes left to consume in the uncompressed buffer then I don't think we want to call ZSTD_endStream. We should only be calling ZSTD_endStream when the uncompressed buffer has been fully consumed, correct? Otherwise we may fail to compress the final chunk of input data when the user sets the finish flag and the zstd library doesn't fully consume the input on the ZSTD_compressStream call. That could result in a "successful" compression that drops data.
          Hide
          churromorales churro morales added a comment -

          HI Jason Lowe

          Thanks for taking the time to review. I agree with all of the above comments and will correct those issues. The last question you had was related to the ZSTD_endStream(). The endStream() finishes the frame and writes the epilogue only if the uncompressed buffer has been fully consumed. Otherwise it basically does the same thing as ZSTD_compressStream().

          You are correct, if the output buffer is too small it may not be able to flush. There is a check in ZSTD_endStream() which does this:

           
                  size_t const notEnded = ZSTD_compressStream_generic(zcs, ostart, &sizeWritten, &srcSize, &srcSize, zsf_end);  
                  size_t const remainingToFlush = zcs->outBuffContentSize - zcs->outBuffFlushedSize;
                  op += sizeWritten;
                  if (remainingToFlush) {
                      output->pos += sizeWritten;
                      return remainingToFlush + ZSTD_BLOCKHEADERSIZE /* final empty block */ + (zcs->checksum * 4);
                  }
                 // Create the epilogue and flush the epilogue
          
          

          so if there is still data to be consumed the library wont finish the frame, thus making it safe to call repeatedly with our framework because we never set the finished flag until the epilogue has been written successfully.

          The code in the CompressorStream.java which calls our codec simply does this:

           
          @Override
            public void finish() throws IOException {
              if (!compressor.finished()) {
                compressor.finish();
                while (!compressor.finished()) {
                  compress();
                }
              }
            }
          

          So I believe we wont drop any data with the way things are done. Please let me know if I am missing something obvious here .

          Show
          churromorales churro morales added a comment - HI Jason Lowe Thanks for taking the time to review. I agree with all of the above comments and will correct those issues. The last question you had was related to the ZSTD_endStream(). The endStream() finishes the frame and writes the epilogue only if the uncompressed buffer has been fully consumed. Otherwise it basically does the same thing as ZSTD_compressStream(). You are correct, if the output buffer is too small it may not be able to flush. There is a check in ZSTD_endStream() which does this: size_t const notEnded = ZSTD_compressStream_generic(zcs, ostart, &sizeWritten, &srcSize, &srcSize, zsf_end); size_t const remainingToFlush = zcs->outBuffContentSize - zcs->outBuffFlushedSize; op += sizeWritten; if (remainingToFlush) { output->pos += sizeWritten; return remainingToFlush + ZSTD_BLOCKHEADERSIZE /* final empty block */ + (zcs->checksum * 4); } // Create the epilogue and flush the epilogue so if there is still data to be consumed the library wont finish the frame, thus making it safe to call repeatedly with our framework because we never set the finished flag until the epilogue has been written successfully. The code in the CompressorStream.java which calls our codec simply does this: @Override public void finish() throws IOException { if (!compressor.finished()) { compressor.finish(); while (!compressor.finished()) { compress(); } } } So I believe we wont drop any data with the way things are done. Please let me know if I am missing something obvious here .
          Hide
          jlowe Jason Lowe added a comment -

          The thing I'm worried about is that when we call ZSTD_compressStream we are passing descriptors for both the input buffer and the output buffer. When we call ZSTD_endStream we are only passing the descriptor for the output buffer. Therefore I don't know how ZSTD_endStream is supposed to finish consuming any input that ZSTD_compressStream didn't get to if it doesn't have access to that input buffer descriptor.

          Looking at the zstd code you'll see that when it does call ZSTD_compressStream inside ZSTD_endStream, it's calling it with srcSize == 0. That means there is no more source to consume. So if the last call of the JNI code to ZSTD_compressStream did not fully consume the input buffer's data (i.e.: input pos is not moved to the end of the data) then it looks like calling ZSTD_endStream will simply flush out what input data did make it and then end the frame. That matches what the documentation for ZSTD_endStream says. So I still think we need to make sure we do not call ZSTD_endStream if input.pos is not at the end of the input buffer after we call ZSTD_compressStream, or we risk losing the last chunk of data if the zstd library for some reason cannot fully consume the input buffer when we try to finish.

          Show
          jlowe Jason Lowe added a comment - The thing I'm worried about is that when we call ZSTD_compressStream we are passing descriptors for both the input buffer and the output buffer. When we call ZSTD_endStream we are only passing the descriptor for the output buffer. Therefore I don't know how ZSTD_endStream is supposed to finish consuming any input that ZSTD_compressStream didn't get to if it doesn't have access to that input buffer descriptor. Looking at the zstd code you'll see that when it does call ZSTD_compressStream inside ZSTD_endStream, it's calling it with srcSize == 0. That means there is no more source to consume. So if the last call of the JNI code to ZSTD_compressStream did not fully consume the input buffer's data (i.e.: input pos is not moved to the end of the data) then it looks like calling ZSTD_endStream will simply flush out what input data did make it and then end the frame. That matches what the documentation for ZSTD_endStream says. So I still think we need to make sure we do not call ZSTD_endStream if input.pos is not at the end of the input buffer after we call ZSTD_compressStream, or we risk losing the last chunk of data if the zstd library for some reason cannot fully consume the input buffer when we try to finish.
          Hide
          churromorales churro morales added a comment -

          fair enough, thats an easy check to make instead of checking if (finish) in the ZStandardCompressor.c we can just do a if (finish && input.size == input.pos) which will ensure we have at least compressed everything in the input buffer before writing the final frame.

          Here is what is included in the updated patch:

          Fixed the zstd buffer constant naming.
          Fixed the constructor for ZStandardCompressor
          Removed finished from the ZStandardDirectDecompressor.
          Reverted the finished logic in the ZStandardCompressor back to its original state
          return after we throw in any c code.
          Parameterized the inflateBytes method such that we don't have to slice the buffer anymore.
          Took care of the endStream with data still left in the buffer
          Also fixed the remaining checkstyle issues.

          Show
          churromorales churro morales added a comment - fair enough, thats an easy check to make instead of checking if (finish) in the ZStandardCompressor.c we can just do a if (finish && input.size == input.pos) which will ensure we have at least compressed everything in the input buffer before writing the final frame. Here is what is included in the updated patch: Fixed the zstd buffer constant naming. Fixed the constructor for ZStandardCompressor Removed finished from the ZStandardDirectDecompressor. Reverted the finished logic in the ZStandardCompressor back to its original state return after we throw in any c code. Parameterized the inflateBytes method such that we don't have to slice the buffer anymore. Took care of the endStream with data still left in the buffer Also fixed the remaining checkstyle issues.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 0m 16s Maven dependency ordering for branch
          +1 mvninstall 6m 57s trunk passed
          +1 compile 9m 45s trunk passed
          +1 checkstyle 1m 38s trunk passed
          +1 mvnsite 9m 46s trunk passed
          +1 mvneclipse 1m 10s trunk passed
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 1m 55s trunk passed
          +1 javadoc 4m 29s trunk passed
          0 mvndep 0m 19s Maven dependency ordering for patch
          +1 mvninstall 8m 24s the patch passed
          +1 compile 9m 22s the patch passed
          +1 cc 9m 22s the patch passed
          +1 javac 9m 22s the patch passed
          -0 checkstyle 1m 43s root: The patch generated 3 new + 159 unchanged - 1 fixed = 162 total (was 160)
          +1 mvnsite 10m 30s the patch passed
          +1 mvneclipse 1m 20s the patch passed
          +1 shellcheck 0m 12s There were no new shellcheck issues.
          +1 shelldocs 0m 8s There were no new shelldocs issues.
          -1 whitespace 0m 0s The patch has 5 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 xml 0m 5s The patch has no ill-formed XML file.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 18s the patch passed
          +1 javadoc 4m 38s the patch passed
          -1 unit 14m 10s root in the patch failed.
          -1 asflicense 0m 42s The patch generated 2 ASF License warnings.
          112m 47s



          Reason Tests
          Failed junit tests hadoop.ha.TestZKFailoverController
            hadoop.security.token.delegation.web.TestWebDelegationToken
            hadoop.io.compress.zstd.TestZStandardCompressorDecompressor
            hadoop.net.TestDNS



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-13578
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843264/HADOOP-13578.v7.patch
          Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle
          uname Linux 437168d715f9 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 72bff19
          Default Java 1.8.0_111
          shellcheck v0.4.5
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11277/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11277/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11277/artifact/patchprocess/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11277/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11277/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11277/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 0m 16s Maven dependency ordering for branch +1 mvninstall 6m 57s trunk passed +1 compile 9m 45s trunk passed +1 checkstyle 1m 38s trunk passed +1 mvnsite 9m 46s trunk passed +1 mvneclipse 1m 10s trunk passed 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 1m 55s trunk passed +1 javadoc 4m 29s trunk passed 0 mvndep 0m 19s Maven dependency ordering for patch +1 mvninstall 8m 24s the patch passed +1 compile 9m 22s the patch passed +1 cc 9m 22s the patch passed +1 javac 9m 22s the patch passed -0 checkstyle 1m 43s root: The patch generated 3 new + 159 unchanged - 1 fixed = 162 total (was 160) +1 mvnsite 10m 30s the patch passed +1 mvneclipse 1m 20s the patch passed +1 shellcheck 0m 12s There were no new shellcheck issues. +1 shelldocs 0m 8s There were no new shelldocs issues. -1 whitespace 0m 0s The patch has 5 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 xml 0m 5s The patch has no ill-formed XML file. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 18s the patch passed +1 javadoc 4m 38s the patch passed -1 unit 14m 10s root in the patch failed. -1 asflicense 0m 42s The patch generated 2 ASF License warnings. 112m 47s Reason Tests Failed junit tests hadoop.ha.TestZKFailoverController   hadoop.security.token.delegation.web.TestWebDelegationToken   hadoop.io.compress.zstd.TestZStandardCompressorDecompressor   hadoop.net.TestDNS Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-13578 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843264/HADOOP-13578.v7.patch Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle uname Linux 437168d715f9 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 72bff19 Default Java 1.8.0_111 shellcheck v0.4.5 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11277/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11277/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11277/artifact/patchprocess/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11277/testReport/ asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11277/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11277/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          churromorales churro morales added a comment -

          forgot to do a git diff --binary and it excluded the file used by the ZStandard tests. No other difference with v7

          Show
          churromorales churro morales added a comment - forgot to do a git diff --binary and it excluded the file used by the ZStandard tests. No other difference with v7
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for updating the patch! It's really close now.

          We need to check for an error if ZSTD_endStream returns a failure, otherwise we could end up looping forever on a persistent error being returned from ZSTD_endStream. The outer compress logic will keep calling compress with finish=true trying to wrap things up and the JNI code will keep avoiding setting the finished flag because ZSTD_endStream isn't returning 0.

          There's no return after the THROW macros here, so we can fall through and call ZSTD_flushStream after ZSTD_compressStream returns an error. That means the error being reported from compress might be clobbered by the error being returned from flush if they are different. Also we can just move the error check on "size" up inside the flush block since it only applies to that case, or change "remaining_to_flush" to "size" and let the previous point be fixed by falling through to the common error check on "size".

              size_t size = dlsym_ZSTD_compressStream(stream, &output, &input);
              if (dlsym_ZSTD_isError(size)) {
                  THROW(env, "java/lang/InternalError", dlsym_ZSTD_getErrorName(size));
              }
              if (finish && input.pos == input.size) {
                  // end the stream, flush and  write the frame epilogue
                  size_t const remaining_to_flush = dlsym_ZSTD_endStream(stream, &output);
                  if (!remaining_to_flush) {
                      (*env)->SetBooleanField(env, this, ZStandardCompressor_finished, JNI_TRUE);
                  }
              }
              else {
                  // need to flush the output buffer
                  // this also updates the output buffer position.
                  size = dlsym_ZSTD_flushStream(stream, &output);
              }
              if (dlsym_ZSTD_isError(size)) {
                  THROW(env, "java/lang/InternalError", dlsym_ZSTD_getErrorName(size));
              }
          
          Show
          jlowe Jason Lowe added a comment - Thanks for updating the patch! It's really close now. We need to check for an error if ZSTD_endStream returns a failure, otherwise we could end up looping forever on a persistent error being returned from ZSTD_endStream. The outer compress logic will keep calling compress with finish=true trying to wrap things up and the JNI code will keep avoiding setting the finished flag because ZSTD_endStream isn't returning 0. There's no return after the THROW macros here, so we can fall through and call ZSTD_flushStream after ZSTD_compressStream returns an error. That means the error being reported from compress might be clobbered by the error being returned from flush if they are different. Also we can just move the error check on "size" up inside the flush block since it only applies to that case, or change "remaining_to_flush" to "size" and let the previous point be fixed by falling through to the common error check on "size". size_t size = dlsym_ZSTD_compressStream(stream, &output, &input); if (dlsym_ZSTD_isError(size)) { THROW(env, "java/lang/InternalError" , dlsym_ZSTD_getErrorName(size)); } if (finish && input.pos == input.size) { // end the stream, flush and write the frame epilogue size_t const remaining_to_flush = dlsym_ZSTD_endStream(stream, &output); if (!remaining_to_flush) { (*env)->SetBooleanField(env, this , ZStandardCompressor_finished, JNI_TRUE); } } else { // need to flush the output buffer // this also updates the output buffer position. size = dlsym_ZSTD_flushStream(stream, &output); } if (dlsym_ZSTD_isError(size)) { THROW(env, "java/lang/InternalError" , dlsym_ZSTD_getErrorName(size)); }
          Hide
          jlowe Jason Lowe added a comment -

          One more thing I just thought of – if we end up having to loop around and call endStream more than once, is it OK that we called ZSTD_compressStream between the two calls to ZSTD_endStream? The call will be degenerate with input size == 0, but I'm wondering if that's OK from the zstd lib API perspective.

          Actually I think we're OK here since ZSTD_endStream will do a similar degenerate ZSTD_compressStream call internally if it isn't in the final compression stage, and it looks like ZSTD_compressStream explicitly does nothing if the compressor is in the final stage. So I think we're good, but it'd be great if you could double-check.

          Show
          jlowe Jason Lowe added a comment - One more thing I just thought of – if we end up having to loop around and call endStream more than once, is it OK that we called ZSTD_compressStream between the two calls to ZSTD_endStream? The call will be degenerate with input size == 0, but I'm wondering if that's OK from the zstd lib API perspective. Actually I think we're OK here since ZSTD_endStream will do a similar degenerate ZSTD_compressStream call internally if it isn't in the final compression stage, and it looks like ZSTD_compressStream explicitly does nothing if the compressor is in the final stage. So I think we're good, but it'd be great if you could double-check.
          Hide
          churromorales churro morales added a comment -

          Yes you are correct, ZSTD_compressStream_generic does nothing if the stage is final, so that part looks good to me.

          Show
          churromorales churro morales added a comment - Yes you are correct, ZSTD_compressStream_generic does nothing if the stage is final, so that part looks good to me.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 0m 15s Maven dependency ordering for branch
          +1 mvninstall 6m 55s trunk passed
          +1 compile 9m 42s trunk passed
          +1 checkstyle 1m 39s trunk passed
          +1 mvnsite 9m 47s trunk passed
          +1 mvneclipse 1m 9s trunk passed
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 15s trunk passed
          +1 javadoc 5m 36s trunk passed
          0 mvndep 0m 22s Maven dependency ordering for patch
          +1 mvninstall 9m 42s the patch passed
          +1 compile 10m 11s the patch passed
          +1 cc 10m 11s the patch passed
          +1 javac 10m 11s the patch passed
          -0 checkstyle 1m 58s root: The patch generated 3 new + 159 unchanged - 1 fixed = 162 total (was 160)
          +1 mvnsite 10m 52s the patch passed
          +1 mvneclipse 1m 16s the patch passed
          +1 shellcheck 0m 12s There were no new shellcheck issues.
          +1 shelldocs 0m 9s There were no new shelldocs issues.
          -1 whitespace 0m 0s The patch has 5 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 xml 0m 5s The patch has no ill-formed XML file.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 37s the patch passed
          +1 javadoc 4m 55s the patch passed
          -1 unit 13m 58s root in the patch failed.
          -1 asflicense 0m 42s The patch generated 2 ASF License warnings.
          117m 16s



          Reason Tests
          Failed junit tests hadoop.security.token.delegation.web.TestWebDelegationToken



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-13578
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843284/HADOOP-13578.v8.patch
          Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle
          uname Linux 939e19eb2e0e 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 6ba9587
          Default Java 1.8.0_111
          shellcheck v0.4.5
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11279/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11279/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11279/artifact/patchprocess/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11279/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11279/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11279/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 0m 15s Maven dependency ordering for branch +1 mvninstall 6m 55s trunk passed +1 compile 9m 42s trunk passed +1 checkstyle 1m 39s trunk passed +1 mvnsite 9m 47s trunk passed +1 mvneclipse 1m 9s trunk passed 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 15s trunk passed +1 javadoc 5m 36s trunk passed 0 mvndep 0m 22s Maven dependency ordering for patch +1 mvninstall 9m 42s the patch passed +1 compile 10m 11s the patch passed +1 cc 10m 11s the patch passed +1 javac 10m 11s the patch passed -0 checkstyle 1m 58s root: The patch generated 3 new + 159 unchanged - 1 fixed = 162 total (was 160) +1 mvnsite 10m 52s the patch passed +1 mvneclipse 1m 16s the patch passed +1 shellcheck 0m 12s There were no new shellcheck issues. +1 shelldocs 0m 9s There were no new shelldocs issues. -1 whitespace 0m 0s The patch has 5 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 xml 0m 5s The patch has no ill-formed XML file. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 37s the patch passed +1 javadoc 4m 55s the patch passed -1 unit 13m 58s root in the patch failed. -1 asflicense 0m 42s The patch generated 2 ASF License warnings. 117m 16s Reason Tests Failed junit tests hadoop.security.token.delegation.web.TestWebDelegationToken Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-13578 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843284/HADOOP-13578.v8.patch Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle uname Linux 939e19eb2e0e 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 6ba9587 Default Java 1.8.0_111 shellcheck v0.4.5 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11279/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11279/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11279/artifact/patchprocess/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11279/testReport/ asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11279/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11279/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          churromorales churro morales added a comment - - edited

          Jason Lowe to your previous comment in regards to not returning after throwing. I can make the following change:

           
          size_t size = dlsym_ZSTD_compressStream(stream, &output, &input);
              if (dlsym_ZSTD_isError(size)) {
                  THROW(env, "java/lang/InternalError", dlsym_ZSTD_getErrorName(size));
                  return (jint) 0;
              }
              if (finish && input.pos == input.size) {
                  // end the stream, flush and  write the frame epilogue
                  size = dlsym_ZSTD_endStream(stream, &output);
                  if (!size) {
                      (*env)->SetBooleanField(env, this, ZStandardCompressor_finished, JNI_TRUE);
                  }
              }
              else {
                  // need to flush the output buffer
                  // this also updates the output buffer position.
                  size = dlsym_ZSTD_flushStream(stream, &output);
              }
              if (dlsym_ZSTD_isError(size)) {
                  THROW(env, "java/lang/InternalError", dlsym_ZSTD_getErrorName(size));
                  return (jint) 0;
              }
          

          If you are okay with this I can put up a new patch containing this fix.

          Show
          churromorales churro morales added a comment - - edited Jason Lowe to your previous comment in regards to not returning after throwing. I can make the following change: size_t size = dlsym_ZSTD_compressStream(stream, &output, &input); if (dlsym_ZSTD_isError(size)) { THROW(env, "java/lang/InternalError" , dlsym_ZSTD_getErrorName(size)); return (jint) 0; } if (finish && input.pos == input.size) { // end the stream, flush and write the frame epilogue size = dlsym_ZSTD_endStream(stream, &output); if (!size) { (*env)->SetBooleanField(env, this , ZStandardCompressor_finished, JNI_TRUE); } } else { // need to flush the output buffer // this also updates the output buffer position. size = dlsym_ZSTD_flushStream(stream, &output); } if (dlsym_ZSTD_isError(size)) { THROW(env, "java/lang/InternalError" , dlsym_ZSTD_getErrorName(size)); return (jint) 0; } If you are okay with this I can put up a new patch containing this fix.
          Hide
          jlowe Jason Lowe added a comment -

          Yep, that looks good. If you could put the else on the same line as the previous closing bracket to match the coding style that'd be nice as well.

          Show
          jlowe Jason Lowe added a comment - Yep, that looks good. If you could put the else on the same line as the previous closing bracket to match the coding style that'd be nice as well.
          Hide
          churromorales churro morales added a comment - - edited

          No problem will do that and upload the patch, thanks again for taking the time to review. I know we are interested in backports for this feature to the 2.x branch, would you be interested in a backport? Should I create a new ticket for that?

          Show
          churromorales churro morales added a comment - - edited No problem will do that and upload the patch, thanks again for taking the time to review. I know we are interested in backports for this feature to the 2.x branch, would you be interested in a backport? Should I create a new ticket for that?
          Hide
          jlowe Jason Lowe added a comment -

          We should use this JIRA for the branch-2 patch. Just attach the branch-2 version according to the patch naming guidelines and the precommit build will run against branch-2 instead of trunk, e.g.: HADOOP-13578-branch-2.v9.patch can be the branch-2 version of the HADOOP-13578.v9.patch. I'd attach the trunk patch first, and if the precommit looks good go ahead and attach the branch-2 version so it can run on that. If you attach both at the same time then precommit will only run on one of them (the first attached, IIRC).

          Show
          jlowe Jason Lowe added a comment - We should use this JIRA for the branch-2 patch. Just attach the branch-2 version according to the patch naming guidelines and the precommit build will run against branch-2 instead of trunk, e.g.: HADOOP-13578 -branch-2.v9.patch can be the branch-2 version of the HADOOP-13578 .v9.patch. I'd attach the trunk patch first, and if the precommit looks good go ahead and attach the branch-2 version so it can run on that. If you attach both at the same time then precommit will only run on one of them (the first attached, IIRC).
          Hide
          churromorales churro morales added a comment -

          Great, thank you Jason Lowe for your patience and taking the time to do so many reviews. The 2x branch patch will be mostly the same code-wise except the code to build and package the libraries is quite different. I'll work on getting a patch ready today.

          Show
          churromorales churro morales added a comment - Great, thank you Jason Lowe for your patience and taking the time to do so many reviews. The 2x branch patch will be mostly the same code-wise except the code to build and package the libraries is quite different. I'll work on getting a patch ready today.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 16s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 0m 16s Maven dependency ordering for branch
          +1 mvninstall 6m 54s trunk passed
          +1 compile 9m 37s trunk passed
          +1 checkstyle 1m 39s trunk passed
          +1 mvnsite 9m 42s trunk passed
          +1 mvneclipse 1m 9s trunk passed
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 1m 55s trunk passed
          +1 javadoc 4m 32s trunk passed
          0 mvndep 0m 20s Maven dependency ordering for patch
          +1 mvninstall 8m 25s the patch passed
          +1 compile 9m 23s the patch passed
          +1 cc 9m 23s the patch passed
          +1 javac 9m 23s the patch passed
          -0 checkstyle 1m 42s root: The patch generated 3 new + 159 unchanged - 1 fixed = 162 total (was 160)
          +1 mvnsite 9m 53s the patch passed
          +1 mvneclipse 1m 13s the patch passed
          +1 shellcheck 0m 11s There were no new shellcheck issues.
          +1 shelldocs 0m 8s There were no new shelldocs issues.
          -1 whitespace 0m 0s The patch has 5 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 xml 0m 5s The patch has no ill-formed XML file.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 21s the patch passed
          +1 javadoc 5m 0s the patch passed
          -1 unit 111m 34s root in the patch failed.
          -1 asflicense 0m 44s The patch generated 2 ASF License warnings.
          209m 40s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
            hadoop.hdfs.server.namenode.TestFileTruncate
            hadoop.yarn.server.timeline.webapp.TestTimelineWebServices



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-13578
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843308/HADOOP-13578.v9.patch
          Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle
          uname Linux 4c72bf3a58c7 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / f5e0bd3
          Default Java 1.8.0_111
          shellcheck v0.4.5
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11280/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11280/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11280/artifact/patchprocess/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11280/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11280/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11280/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 0m 16s Maven dependency ordering for branch +1 mvninstall 6m 54s trunk passed +1 compile 9m 37s trunk passed +1 checkstyle 1m 39s trunk passed +1 mvnsite 9m 42s trunk passed +1 mvneclipse 1m 9s trunk passed 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 1m 55s trunk passed +1 javadoc 4m 32s trunk passed 0 mvndep 0m 20s Maven dependency ordering for patch +1 mvninstall 8m 25s the patch passed +1 compile 9m 23s the patch passed +1 cc 9m 23s the patch passed +1 javac 9m 23s the patch passed -0 checkstyle 1m 42s root: The patch generated 3 new + 159 unchanged - 1 fixed = 162 total (was 160) +1 mvnsite 9m 53s the patch passed +1 mvneclipse 1m 13s the patch passed +1 shellcheck 0m 11s There were no new shellcheck issues. +1 shelldocs 0m 8s There were no new shelldocs issues. -1 whitespace 0m 0s The patch has 5 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 xml 0m 5s The patch has no ill-formed XML file. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 21s the patch passed +1 javadoc 5m 0s the patch passed -1 unit 111m 34s root in the patch failed. -1 asflicense 0m 44s The patch generated 2 ASF License warnings. 209m 40s Reason Tests Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes   hadoop.hdfs.server.namenode.TestFileTruncate   hadoop.yarn.server.timeline.webapp.TestTimelineWebServices Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-13578 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843308/HADOOP-13578.v9.patch Optional Tests asflicense shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle uname Linux 4c72bf3a58c7 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / f5e0bd3 Default Java 1.8.0_111 shellcheck v0.4.5 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11280/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11280/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11280/artifact/patchprocess/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11280/testReport/ asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11280/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask . U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11280/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Latest patch looks good. The test failures and ASF warnings are unrelated. I filed HDFS-11251 to track the TestDataNodeHotSwapVolumes failure and HDFS-11252 to track the TestFileTruncate failure. YARN-5934 is tracking the TestTimelineWebServices failure.

          Please attach an equivalent branch-2 patch, and if that looks good then I'll commit this.

          Show
          jlowe Jason Lowe added a comment - Latest patch looks good. The test failures and ASF warnings are unrelated. I filed HDFS-11251 to track the TestDataNodeHotSwapVolumes failure and HDFS-11252 to track the TestFileTruncate failure. YARN-5934 is tracking the TestTimelineWebServices failure. Please attach an equivalent branch-2 patch, and if that looks good then I'll commit this.
          Hide
          churromorales churro morales added a comment -

          branch-2 version of the patch.

          Show
          churromorales churro morales added a comment - branch-2 version of the patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          -1 patch 0m 8s HADOOP-13578 does not apply to branch-2. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



          Subsystem Report/Notes
          JIRA Issue HADOOP-13578
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843619/HADOOP-13578-branch-2.v9.patch
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11289/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 8s HADOOP-13578 does not apply to branch-2. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Issue HADOOP-13578 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843619/HADOOP-13578-branch-2.v9.patch Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11289/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          churromorales churro morales added a comment -

          Sorry I attached the wrong branch-2 patch. My fault.

          Show
          churromorales churro morales added a comment - Sorry I attached the wrong branch-2 patch. My fault.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          -1 patch 0m 6s HADOOP-13578 does not apply to branch-2. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



          Subsystem Report/Notes
          JIRA Issue HADOOP-13578
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843629/HADOOP-13578-branch-2.v9.patch
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11291/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 6s HADOOP-13578 does not apply to branch-2. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Issue HADOOP-13578 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843629/HADOOP-13578-branch-2.v9.patch Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11291/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          churromorales churro morales added a comment -

          Rebased the branch and here is the branch-2 patch.

          Show
          churromorales churro morales added a comment - Rebased the branch and here is the branch-2 patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 13m 22s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
          0 mvndep 1m 12s Maven dependency ordering for branch
          +1 mvninstall 6m 30s branch-2 passed
          +1 compile 6m 5s branch-2 passed with JDK v1.8.0_111
          +1 compile 6m 50s branch-2 passed with JDK v1.7.0_121
          +1 checkstyle 1m 28s branch-2 passed
          +1 mvnsite 9m 37s branch-2 passed
          +1 mvneclipse 0m 56s branch-2 passed
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 1m 39s branch-2 passed
          +1 javadoc 5m 10s branch-2 passed with JDK v1.8.0_111
          +1 javadoc 6m 7s branch-2 passed with JDK v1.7.0_121
          0 mvndep 0m 15s Maven dependency ordering for patch
          +1 mvninstall 7m 36s the patch passed
          +1 compile 5m 43s the patch passed with JDK v1.8.0_111
          -1 cc 5m 43s root-jdk1.8.0_111 with JDK v1.8.0_111 generated 1 new + 6 unchanged - 1 fixed = 7 total (was 7)
          +1 javac 5m 43s the patch passed
          +1 compile 6m 38s the patch passed with JDK v1.7.0_121
          -1 cc 6m 38s root-jdk1.7.0_121 with JDK v1.7.0_121 generated 1 new + 16 unchanged - 1 fixed = 17 total (was 17)
          +1 javac 6m 38s the patch passed
          -0 checkstyle 1m 34s root: The patch generated 4 new + 161 unchanged - 1 fixed = 165 total (was 162)
          +1 mvnsite 10m 1s the patch passed
          +1 mvneclipse 0m 53s the patch passed
          -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 xml 0m 1s The patch has no ill-formed XML file.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 2m 6s the patch passed
          +1 javadoc 5m 13s the patch passed with JDK v1.8.0_111
          +1 javadoc 6m 22s the patch passed with JDK v1.7.0_121
          -1 unit 12m 11s root in the patch failed with JDK v1.7.0_121.
          -1 asflicense 0m 27s The patch generated 2 ASF License warnings.
          153m 8s



          Reason Tests
          JDK v1.8.0_111 Failed junit tests hadoop.io.compress.zstd.TestZStandardCompressorDecompressor
          JDK v1.7.0_121 Failed junit tests hadoop.io.compress.zstd.TestZStandardCompressorDecompressor



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:b59b8b7
          JIRA Issue HADOOP-13578
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843633/HADOOP-13578-branch-2.v9.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle
          uname Linux 963cae40ad5c 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2 / 0cb99db
          Default Java 1.7.0_121
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_111 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_121
          findbugs v3.0.0
          cc https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/artifact/patchprocess/diff-compile-cc-root-jdk1.8.0_111.txt
          cc https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/artifact/patchprocess/diff-compile-cc-root-jdk1.7.0_121.txt
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/artifact/patchprocess/patch-unit-root-jdk1.7.0_121.txt
          JDK v1.7.0_121 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common . U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 13m 22s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. 0 mvndep 1m 12s Maven dependency ordering for branch +1 mvninstall 6m 30s branch-2 passed +1 compile 6m 5s branch-2 passed with JDK v1.8.0_111 +1 compile 6m 50s branch-2 passed with JDK v1.7.0_121 +1 checkstyle 1m 28s branch-2 passed +1 mvnsite 9m 37s branch-2 passed +1 mvneclipse 0m 56s branch-2 passed 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 1m 39s branch-2 passed +1 javadoc 5m 10s branch-2 passed with JDK v1.8.0_111 +1 javadoc 6m 7s branch-2 passed with JDK v1.7.0_121 0 mvndep 0m 15s Maven dependency ordering for patch +1 mvninstall 7m 36s the patch passed +1 compile 5m 43s the patch passed with JDK v1.8.0_111 -1 cc 5m 43s root-jdk1.8.0_111 with JDK v1.8.0_111 generated 1 new + 6 unchanged - 1 fixed = 7 total (was 7) +1 javac 5m 43s the patch passed +1 compile 6m 38s the patch passed with JDK v1.7.0_121 -1 cc 6m 38s root-jdk1.7.0_121 with JDK v1.7.0_121 generated 1 new + 16 unchanged - 1 fixed = 17 total (was 17) +1 javac 6m 38s the patch passed -0 checkstyle 1m 34s root: The patch generated 4 new + 161 unchanged - 1 fixed = 165 total (was 162) +1 mvnsite 10m 1s the patch passed +1 mvneclipse 0m 53s the patch passed -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 xml 0m 1s The patch has no ill-formed XML file. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 2m 6s the patch passed +1 javadoc 5m 13s the patch passed with JDK v1.8.0_111 +1 javadoc 6m 22s the patch passed with JDK v1.7.0_121 -1 unit 12m 11s root in the patch failed with JDK v1.7.0_121. -1 asflicense 0m 27s The patch generated 2 ASF License warnings. 153m 8s Reason Tests JDK v1.8.0_111 Failed junit tests hadoop.io.compress.zstd.TestZStandardCompressorDecompressor JDK v1.7.0_121 Failed junit tests hadoop.io.compress.zstd.TestZStandardCompressorDecompressor Subsystem Report/Notes Docker Image:yetus/hadoop:b59b8b7 JIRA Issue HADOOP-13578 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843633/HADOOP-13578-branch-2.v9.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle uname Linux 963cae40ad5c 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / 0cb99db Default Java 1.7.0_121 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_111 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_121 findbugs v3.0.0 cc https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/artifact/patchprocess/diff-compile-cc-root-jdk1.8.0_111.txt cc https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/artifact/patchprocess/diff-compile-cc-root-jdk1.7.0_121.txt checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/artifact/patchprocess/patch-unit-root-jdk1.7.0_121.txt JDK v1.7.0_121 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/testReport/ asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common . U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11293/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          churromorales churro morales added a comment -

          I'm sorry I forgot to include the binary test file. Long day my fault.

          Show
          churromorales churro morales added a comment - I'm sorry I forgot to include the binary test file. Long day my fault.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 0m 52s Maven dependency ordering for branch
          +1 mvninstall 6m 31s branch-2 passed
          +1 compile 5m 30s branch-2 passed with JDK v1.8.0_111
          +1 compile 6m 28s branch-2 passed with JDK v1.7.0_121
          +1 checkstyle 1m 28s branch-2 passed
          +1 mvnsite 9m 2s branch-2 passed
          +1 mvneclipse 0m 48s branch-2 passed
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 1m 37s branch-2 passed
          +1 javadoc 4m 26s branch-2 passed with JDK v1.8.0_111
          +1 javadoc 6m 26s branch-2 passed with JDK v1.7.0_121
          0 mvndep 0m 18s Maven dependency ordering for patch
          +1 mvninstall 7m 31s the patch passed
          +1 compile 5m 30s the patch passed with JDK v1.8.0_111
          +1 cc 5m 30s the patch passed
          +1 javac 5m 30s the patch passed
          +1 compile 6m 31s the patch passed with JDK v1.7.0_121
          +1 cc 6m 31s the patch passed
          +1 javac 6m 31s the patch passed
          -0 checkstyle 1m 29s root: The patch generated 4 new + 161 unchanged - 1 fixed = 165 total (was 162)
          +1 mvnsite 9m 10s the patch passed
          +1 mvneclipse 0m 52s the patch passed
          -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 xml 0m 1s The patch has no ill-formed XML file.
          0 findbugs 0m 1s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist .
          +1 findbugs 1m 55s the patch passed
          +1 javadoc 5m 5s the patch passed with JDK v1.8.0_111
          +1 javadoc 6m 54s the patch passed with JDK v1.7.0_121
          -1 unit 134m 47s root in the patch failed with JDK v1.7.0_121.
          -1 asflicense 0m 30s The patch generated 2 ASF License warnings.
          338m 11s



          Reason Tests
          JDK v1.8.0_111 Failed junit tests hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits
            hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRegression
          JDK v1.7.0_121 Failed junit tests hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRegression
            hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
            hadoop.yarn.server.resourcemanager.TestRMRestart



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:b59b8b7
          JIRA Issue HADOOP-13578
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843664/HADOOP-13578-branch-2.v9.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle
          uname Linux 85e20a9f7369 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2 / 0cb99db
          Default Java 1.7.0_121
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_111 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11295/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11295/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11295/artifact/patchprocess/patch-unit-root-jdk1.7.0_121.txt
          JDK v1.7.0_121 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11295/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11295/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common . U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11295/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 0m 52s Maven dependency ordering for branch +1 mvninstall 6m 31s branch-2 passed +1 compile 5m 30s branch-2 passed with JDK v1.8.0_111 +1 compile 6m 28s branch-2 passed with JDK v1.7.0_121 +1 checkstyle 1m 28s branch-2 passed +1 mvnsite 9m 2s branch-2 passed +1 mvneclipse 0m 48s branch-2 passed 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 1m 37s branch-2 passed +1 javadoc 4m 26s branch-2 passed with JDK v1.8.0_111 +1 javadoc 6m 26s branch-2 passed with JDK v1.7.0_121 0 mvndep 0m 18s Maven dependency ordering for patch +1 mvninstall 7m 31s the patch passed +1 compile 5m 30s the patch passed with JDK v1.8.0_111 +1 cc 5m 30s the patch passed +1 javac 5m 30s the patch passed +1 compile 6m 31s the patch passed with JDK v1.7.0_121 +1 cc 6m 31s the patch passed +1 javac 6m 31s the patch passed -0 checkstyle 1m 29s root: The patch generated 4 new + 161 unchanged - 1 fixed = 165 total (was 162) +1 mvnsite 9m 10s the patch passed +1 mvneclipse 0m 52s the patch passed -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 xml 0m 1s The patch has no ill-formed XML file. 0 findbugs 0m 1s Skipped patched modules with no Java source: hadoop-project hadoop-project-dist . +1 findbugs 1m 55s the patch passed +1 javadoc 5m 5s the patch passed with JDK v1.8.0_111 +1 javadoc 6m 54s the patch passed with JDK v1.7.0_121 -1 unit 134m 47s root in the patch failed with JDK v1.7.0_121. -1 asflicense 0m 30s The patch generated 2 ASF License warnings. 338m 11s Reason Tests JDK v1.8.0_111 Failed junit tests hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits   hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRegression JDK v1.7.0_121 Failed junit tests hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRegression   hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer   hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:b59b8b7 JIRA Issue HADOOP-13578 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843664/HADOOP-13578-branch-2.v9.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit xml cc findbugs checkstyle uname Linux 85e20a9f7369 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / 0cb99db Default Java 1.7.0_121 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_111 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11295/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/11295/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11295/artifact/patchprocess/patch-unit-root-jdk1.7.0_121.txt JDK v1.7.0_121 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11295/testReport/ asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/11295/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-project hadoop-project-dist hadoop-common-project/hadoop-common . U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11295/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Sorry for the long delay, as I just got back from vacation.

          +1 for the v9 trunk and branch-2 patches. The branch-2 test failures and ASF warnings are unrelated.

          I'll commit this tomorrow if there are no objections.

          Show
          jlowe Jason Lowe added a comment - Sorry for the long delay, as I just got back from vacation. +1 for the v9 trunk and branch-2 patches. The branch-2 test failures and ASF warnings are unrelated. I'll commit this tomorrow if there are no objections.
          Hide
          churromorales churro morales added a comment -

          Jason Lowe thank you for taking the time to review this patch.

          Show
          churromorales churro morales added a comment - Jason Lowe thank you for taking the time to review this patch.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks to churro morales for the contribution! I committed this to trunk and branch-2.

          Show
          jlowe Jason Lowe added a comment - Thanks to churro morales for the contribution! I committed this to trunk and branch-2.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11072 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11072/)
          HADOOP-13578. Add Codec for ZStandard Compression. Contributed by churro (jlowe: rev a0a276162147e843a5a4e028abdca5b66f5118da)

          • (add) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/zstd/TestZStandardCompressorDecompressor.java
          • (edit) hadoop-project/pom.xml
          • (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/NativeLibraryChecker.java
          • (add) hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c
          • (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/Decompressor.java
          • (add) hadoop-common-project/hadoop-common/src/test/resources/zstd/test_file.txt.zst
          • (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java
          • (add) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/ZStandardCodec.java
          • (edit) hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/NativeCodeLoader.c
          • (add) hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/org_apache_hadoop_io_compress_zstd.h
          • (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/TestCompressionStreamReuse.java
          • (edit) hadoop-common-project/hadoop-common/src/main/resources/META-INF/services/org.apache.hadoop.io.compress.CompressionCodec
          • (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/pom.xml
          • (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/TestCodec.java
          • (edit) hadoop-common-project/hadoop-common/src/CMakeLists.txt
          • (add) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.java
          • (add) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/package-info.java
          • (edit) dev-support/bin/dist-copynativelibs
          • (add) hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c
          • (edit) hadoop-common-project/hadoop-common/pom.xml
          • (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/NativeCodeLoader.java
          • (add) hadoop-common-project/hadoop-common/src/test/resources/zstd/test_file.txt
          • (edit) hadoop-project-dist/pom.xml
          • (add) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.java
          • (edit) hadoop-common-project/hadoop-common/src/config.h.cmake
          • (edit) BUILDING.txt
          • (edit) hadoop-common-project/hadoop-common/src/site/markdown/NativeLibraries.md.vm
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11072 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11072/ ) HADOOP-13578 . Add Codec for ZStandard Compression. Contributed by churro (jlowe: rev a0a276162147e843a5a4e028abdca5b66f5118da) (add) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/zstd/TestZStandardCompressorDecompressor.java (edit) hadoop-project/pom.xml (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/NativeLibraryChecker.java (add) hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/Decompressor.java (add) hadoop-common-project/hadoop-common/src/test/resources/zstd/test_file.txt.zst (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java (add) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/ZStandardCodec.java (edit) hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/NativeCodeLoader.c (add) hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/org_apache_hadoop_io_compress_zstd.h (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/TestCompressionStreamReuse.java (edit) hadoop-common-project/hadoop-common/src/main/resources/META-INF/services/org.apache.hadoop.io.compress.CompressionCodec (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/pom.xml (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/TestCodec.java (edit) hadoop-common-project/hadoop-common/src/CMakeLists.txt (add) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.java (add) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/package-info.java (edit) dev-support/bin/dist-copynativelibs (add) hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c (edit) hadoop-common-project/hadoop-common/pom.xml (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/NativeCodeLoader.java (add) hadoop-common-project/hadoop-common/src/test/resources/zstd/test_file.txt (edit) hadoop-project-dist/pom.xml (add) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.java (edit) hadoop-common-project/hadoop-common/src/config.h.cmake (edit) BUILDING.txt (edit) hadoop-common-project/hadoop-common/src/site/markdown/NativeLibraries.md.vm
          Hide
          redsky_luan James_Luan added a comment -

          Hi there.. Is there any schedule for backporing zstd on 2.7? I'd glad to test zstd on version 2.7

          Show
          redsky_luan James_Luan added a comment - Hi there.. Is there any schedule for backporing zstd on 2.7? I'd glad to test zstd on version 2.7
          Hide
          churromorales churro morales added a comment -

          Hi James_Luan - I believe I can put up a patch for 2.7, should be pretty easy.

          Lets create another JIRA ticket for this and move it out.

          Show
          churromorales churro morales added a comment - Hi James_Luan - I believe I can put up a patch for 2.7, should be pretty easy. Lets create another JIRA ticket for this and move it out.
          Hide
          adamkennedy Adam Kennedy added a comment -

          Is this impacted by LEGAL-303?

          Show
          adamkennedy Adam Kennedy added a comment - Is this impacted by LEGAL-303 ?
          Hide
          andrew.wang Andrew Wang added a comment -

          Believe it's fine, see this mailing list thread:

          https://lists.apache.org/thread.html/55e4ebf60ef0d109c9f74eac92f58a01b2efc551ae321e5444e05772@%3Ccommon-dev.hadoop.apache.org%3E

          From Jason:

          I think we are OK to leave support for the zstd codec in the Hadoop code base. I asked Chris Mattman for clarification, noting that the support for the zstd codec requires the user to install the zstd headers and libraries and then configure it to be included in the native Hadoop build. The Hadoop releases are not shipping any zstd code (e.g.: headers or libraries) nor does it require zstd as a mandatory dependency. Here's what he said:

          Show
          andrew.wang Andrew Wang added a comment - Believe it's fine, see this mailing list thread: https://lists.apache.org/thread.html/55e4ebf60ef0d109c9f74eac92f58a01b2efc551ae321e5444e05772@%3Ccommon-dev.hadoop.apache.org%3E From Jason: I think we are OK to leave support for the zstd codec in the Hadoop code base. I asked Chris Mattman for clarification, noting that the support for the zstd codec requires the user to install the zstd headers and libraries and then configure it to be included in the native Hadoop build. The Hadoop releases are not shipping any zstd code (e.g.: headers or libraries) nor does it require zstd as a mandatory dependency. Here's what he said:

            People

            • Assignee:
              churromorales churro morales
              Reporter:
              churromorales churro morales
            • Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development