Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-14535

wasb: implement high-performance random access and seek of block blobs

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-beta1
    • Component/s: fs/azure
    • Labels:
      None
    • Target Version/s:
    • Release Note:
      Random access and seek improvements for the wasb:// (Azure) file system.
    • Flags:
      Patch

      Description

      This change adds a seek-able stream for reading block blobs to the wasb:// file system.

      If seek() is not used or if only forward seek() is used, the behavior of read() is unchanged.
      That is, the stream is optimized for sequential reads by reading chunks (over the network) in
      the size specified by "fs.azure.read.request.size" (default is 4 megabytes).

      If reverse seek() is used, the behavior of read() changes in favor of reading the actual number
      of bytes requested in the call to read(), with some constraints. If the size requested is smaller
      than 16 kilobytes and cannot be satisfied by the internal buffer, the network read will be 16
      kilobytes. If the size requested is greater than 4 megabytes, it will be satisfied by sequential
      4 megabyte reads over the network.

      This change improves the performance of FSInputStream.seek() by not closing and re-opening the
      stream, which for block blobs also involves a network operation to read the blob metadata. Now
      NativeAzureFsInputStream.seek() checks if the stream is seek-able and moves the read position.

      [^attachment-name.zip]

        Issue Links

          Activity

          Hide
          tmarquardt Thomas Marquardt added a comment -

          Attaching patch 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch.

          Show
          tmarquardt Thomas Marquardt added a comment - Attaching patch 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch.
          Hide
          liuml07 Mingliang Liu added a comment -

          Thomas Marquardt, I have added you as Hadoop contributor in Apache issues systema and assigned this JIRA to you.

          Show
          liuml07 Mingliang Liu added a comment - Thomas Marquardt , I have added you as Hadoop contributor in Apache issues systema and assigned this JIRA to you.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 13s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 13m 49s trunk passed
          +1 compile 0m 19s trunk passed
          +1 checkstyle 0m 15s trunk passed
          +1 mvnsite 0m 21s trunk passed
          +1 findbugs 0m 30s trunk passed
          +1 javadoc 0m 15s trunk passed
          +1 mvninstall 0m 18s the patch passed
          +1 compile 0m 18s the patch passed
          +1 javac 0m 18s the patch passed
          -0 checkstyle 0m 11s hadoop-tools/hadoop-azure: The patch generated 11 new + 107 unchanged - 7 fixed = 118 total (was 114)
          +1 mvnsite 0m 19s the patch passed
          -1 whitespace 0m 0s The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          -1 findbugs 0m 36s hadoop-tools/hadoop-azure generated 3 new + 0 unchanged - 0 fixed = 3 total (was 0)
          +1 javadoc 0m 12s the patch passed
          +1 unit 1m 21s hadoop-azure in the patch passed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          20m 28s



          Reason Tests
          FindBugs module:hadoop-tools/hadoop-azure
            Dead store to numberOfBytesRead in org.apache.hadoop.fs.azure.BlockBlobInputStream.read() At BlockBlobInputStream.java:org.apache.hadoop.fs.azure.BlockBlobInputStream.read() At BlockBlobInputStream.java:[line 212]
            Inconsistent synchronization of org.apache.hadoop.fs.azure.BlockBlobInputStream$MemoryOutputStream.writePosition; locked 83% of time Unsynchronized access at BlockBlobInputStream.java:83% of time Unsynchronized access at BlockBlobInputStream.java:[line 268]
            Should org.apache.hadoop.fs.azure.BlockBlobInputStream$MemoryOutputStream be a static inner class? At BlockBlobInputStream.java:inner class? At BlockBlobInputStream.java:[lines 251-299]



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue HADOOP-14535
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12873347/0001-Random-access-and-seek-imporvements-to-azure-file-system.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux b71679b993f2 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 82bbcbf
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/12551/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-azure.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/12551/artifact/patchprocess/whitespace-eol.txt
          findbugs https://builds.apache.org/job/PreCommit-HADOOP-Build/12551/artifact/patchprocess/new-findbugs-hadoop-tools_hadoop-azure.html
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/12551/testReport/
          modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/12551/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 13m 49s trunk passed +1 compile 0m 19s trunk passed +1 checkstyle 0m 15s trunk passed +1 mvnsite 0m 21s trunk passed +1 findbugs 0m 30s trunk passed +1 javadoc 0m 15s trunk passed +1 mvninstall 0m 18s the patch passed +1 compile 0m 18s the patch passed +1 javac 0m 18s the patch passed -0 checkstyle 0m 11s hadoop-tools/hadoop-azure: The patch generated 11 new + 107 unchanged - 7 fixed = 118 total (was 114) +1 mvnsite 0m 19s the patch passed -1 whitespace 0m 0s The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply -1 findbugs 0m 36s hadoop-tools/hadoop-azure generated 3 new + 0 unchanged - 0 fixed = 3 total (was 0) +1 javadoc 0m 12s the patch passed +1 unit 1m 21s hadoop-azure in the patch passed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 20m 28s Reason Tests FindBugs module:hadoop-tools/hadoop-azure   Dead store to numberOfBytesRead in org.apache.hadoop.fs.azure.BlockBlobInputStream.read() At BlockBlobInputStream.java:org.apache.hadoop.fs.azure.BlockBlobInputStream.read() At BlockBlobInputStream.java: [line 212]   Inconsistent synchronization of org.apache.hadoop.fs.azure.BlockBlobInputStream$MemoryOutputStream.writePosition; locked 83% of time Unsynchronized access at BlockBlobInputStream.java:83% of time Unsynchronized access at BlockBlobInputStream.java: [line 268]   Should org.apache.hadoop.fs.azure.BlockBlobInputStream$MemoryOutputStream be a static inner class? At BlockBlobInputStream.java:inner class? At BlockBlobInputStream.java: [lines 251-299] Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue HADOOP-14535 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12873347/0001-Random-access-and-seek-imporvements-to-azure-file-system.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux b71679b993f2 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 82bbcbf Default Java 1.8.0_131 findbugs v3.1.0-RC1 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/12551/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-azure.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/12551/artifact/patchprocess/whitespace-eol.txt findbugs https://builds.apache.org/job/PreCommit-HADOOP-Build/12551/artifact/patchprocess/new-findbugs-hadoop-tools_hadoop-azure.html Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/12551/testReport/ modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/12551/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          tmarquardt Thomas Marquardt added a comment -

          Adding updated patch with fixes for whitespace and findbugs.

          Show
          tmarquardt Thomas Marquardt added a comment - Adding updated patch with fixes for whitespace and findbugs.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 19s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 15m 55s trunk passed
          +1 compile 0m 18s trunk passed
          +1 checkstyle 0m 15s trunk passed
          +1 mvnsite 0m 21s trunk passed
          +1 findbugs 0m 28s trunk passed
          +1 javadoc 0m 14s trunk passed
          +1 mvninstall 0m 17s the patch passed
          +1 compile 0m 16s the patch passed
          +1 javac 0m 16s the patch passed
          -0 checkstyle 0m 12s hadoop-tools/hadoop-azure: The patch generated 11 new + 107 unchanged - 7 fixed = 118 total (was 114)
          +1 mvnsite 0m 18s the patch passed
          -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 findbugs 0m 32s the patch passed
          +1 javadoc 0m 11s the patch passed
          +1 unit 1m 19s hadoop-azure in the patch passed.
          +1 asflicense 0m 19s The patch does not generate ASF License warnings.
          22m 30s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue HADOOP-14535
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12873363/0003-Random-access-and-seek-imporvements-to-azure-file-system.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 18f4aaf639cf 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 6460df2
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/12553/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-azure.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/12553/artifact/patchprocess/whitespace-eol.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/12553/testReport/
          modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/12553/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 15m 55s trunk passed +1 compile 0m 18s trunk passed +1 checkstyle 0m 15s trunk passed +1 mvnsite 0m 21s trunk passed +1 findbugs 0m 28s trunk passed +1 javadoc 0m 14s trunk passed +1 mvninstall 0m 17s the patch passed +1 compile 0m 16s the patch passed +1 javac 0m 16s the patch passed -0 checkstyle 0m 12s hadoop-tools/hadoop-azure: The patch generated 11 new + 107 unchanged - 7 fixed = 118 total (was 114) +1 mvnsite 0m 18s the patch passed -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 findbugs 0m 32s the patch passed +1 javadoc 0m 11s the patch passed +1 unit 1m 19s hadoop-azure in the patch passed. +1 asflicense 0m 19s The patch does not generate ASF License warnings. 22m 30s Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue HADOOP-14535 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12873363/0003-Random-access-and-seek-imporvements-to-azure-file-system.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 18f4aaf639cf 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 6460df2 Default Java 1.8.0_131 findbugs v3.1.0-RC1 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/12553/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-azure.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/12553/artifact/patchprocess/whitespace-eol.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/12553/testReport/ modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/12553/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          tmarquardt Thomas Marquardt added a comment -

          Adding updated patch with fixes for whitespace.

          Show
          tmarquardt Thomas Marquardt added a comment - Adding updated patch with fixes for whitespace.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 13s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 13m 9s trunk passed
          +1 compile 0m 19s trunk passed
          +1 checkstyle 0m 14s trunk passed
          +1 mvnsite 0m 21s trunk passed
          +1 findbugs 0m 27s trunk passed
          +1 javadoc 0m 14s trunk passed
          +1 mvninstall 0m 17s the patch passed
          +1 compile 0m 16s the patch passed
          +1 javac 0m 16s the patch passed
          -0 checkstyle 0m 12s hadoop-tools/hadoop-azure: The patch generated 11 new + 107 unchanged - 7 fixed = 118 total (was 114)
          +1 mvnsite 0m 18s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 0m 32s the patch passed
          +1 javadoc 0m 11s the patch passed
          +1 unit 1m 19s hadoop-azure in the patch passed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          19m 31s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue HADOOP-14535
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12873365/0004-Random-access-and-seek-imporvements-to-azure-file-system.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux ea4b5e638758 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 6460df2
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/12554/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-azure.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/12554/testReport/
          modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/12554/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 13m 9s trunk passed +1 compile 0m 19s trunk passed +1 checkstyle 0m 14s trunk passed +1 mvnsite 0m 21s trunk passed +1 findbugs 0m 27s trunk passed +1 javadoc 0m 14s trunk passed +1 mvninstall 0m 17s the patch passed +1 compile 0m 16s the patch passed +1 javac 0m 16s the patch passed -0 checkstyle 0m 12s hadoop-tools/hadoop-azure: The patch generated 11 new + 107 unchanged - 7 fixed = 118 total (was 114) +1 mvnsite 0m 18s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 32s the patch passed +1 javadoc 0m 11s the patch passed +1 unit 1m 19s hadoop-azure in the patch passed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 19m 31s Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue HADOOP-14535 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12873365/0004-Random-access-and-seek-imporvements-to-azure-file-system.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux ea4b5e638758 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 6460df2 Default Java 1.8.0_131 findbugs v3.1.0-RC1 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/12554/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-azure.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/12554/testReport/ modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/12554/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          seek optimisation is the key way to speed up data input, especially for columnar storage formats

          The lessons from our work on S3AInputStream (Which I recommend you look at were)

          • ignore no-ops
          • lazy-seek, where you don't seek() until a read() call, really boosts performance when the IO is a series of readFully(pos, ....) calls.
          • any forward seeks you can do by discarding data is cost effective for tends to hundreds of KB (> 512KB long haul). It looks like you do this.
          • optimisations you may do for random IO can be pathologically bad for full document reads (e.g. scanning an entire .csv.gz file). You have to measure performance on these formats as well as random IO. We rely in S3 for amazon serving some 20MB public CSV.gz files which we can abuse for this.
          • it's really useful to instrument your streams to count how many bytes you discard in closing streams, skip in seeks, how many forward & backward seeks, etc. In S3A we track this in the stream (and toString()) prints, then we pull it back to the FS instrumentation (which is actually copied out of the Azure code originally). This helps us diagnose our tests, and make some assertions. Have a look at ITestS3AInputStreamPerformance for this.

          At a quick glance of this patch, I can see you are starting on this. I'd recommend starting off with that instrumentation and setting up a test which anyone can use to test performance by working with a public (free, read-only) file. Why? Saves setup time, eliminates cost, very useful later on. With the measurements, then we can look at what's best to target in improving seek.

          How about you create/own an uber JIRA on this topic, say "speed up Azure input", and add the tasks underneath?

          Show
          stevel@apache.org Steve Loughran added a comment - seek optimisation is the key way to speed up data input, especially for columnar storage formats The lessons from our work on S3AInputStream (Which I recommend you look at were) ignore no-ops lazy-seek, where you don't seek() until a read() call, really boosts performance when the IO is a series of readFully(pos, ....) calls. any forward seeks you can do by discarding data is cost effective for tends to hundreds of KB (> 512KB long haul). It looks like you do this. optimisations you may do for random IO can be pathologically bad for full document reads (e.g. scanning an entire .csv.gz file). You have to measure performance on these formats as well as random IO. We rely in S3 for amazon serving some 20MB public CSV.gz files which we can abuse for this. it's really useful to instrument your streams to count how many bytes you discard in closing streams, skip in seeks, how many forward & backward seeks, etc. In S3A we track this in the stream (and toString()) prints, then we pull it back to the FS instrumentation (which is actually copied out of the Azure code originally). This helps us diagnose our tests, and make some assertions. Have a look at ITestS3AInputStreamPerformance for this. At a quick glance of this patch, I can see you are starting on this. I'd recommend starting off with that instrumentation and setting up a test which anyone can use to test performance by working with a public (free, read-only) file. Why? Saves setup time, eliminates cost, very useful later on. With the measurements, then we can look at what's best to target in improving seek. How about you create/own an uber JIRA on this topic, say "speed up Azure input", and add the tasks underneath?
          Hide
          tmarquardt Thomas Marquardt added a comment -

          Thanks for the feedback. I looked at ITestS3AInputStreamPerformance and will do something similar. I do not have an Azure account with which I can share a file publicly, but I can write a test to generate the source for the test. I am currently working on a few other things, so won't be able to jump on this immediately. Would you like to hold off on this change until the instrumentation and unit test is complete, or would end-to-end test results be sufficient motivation to move forward on this task while I continue to work on the other tasks?

          By the way, this work was done to address https://issues.apache.org/jira/browse/HADOOP-14478, which has a dependency on a change in the Azure Storage SDK for Java. The ask was for the SDK to use InputStream.mark(readLimit) as a hint to disregard the default network read size and use readLimit instead. Since this is not the intended use of mark, rather than pursue unusual dependencies between these two projects I provided the implementation in the patch as a solution.

          Show
          tmarquardt Thomas Marquardt added a comment - Thanks for the feedback. I looked at ITestS3AInputStreamPerformance and will do something similar. I do not have an Azure account with which I can share a file publicly, but I can write a test to generate the source for the test. I am currently working on a few other things, so won't be able to jump on this immediately. Would you like to hold off on this change until the instrumentation and unit test is complete, or would end-to-end test results be sufficient motivation to move forward on this task while I continue to work on the other tasks? By the way, this work was done to address https://issues.apache.org/jira/browse/HADOOP-14478 , which has a dependency on a change in the Azure Storage SDK for Java. The ask was for the SDK to use InputStream.mark(readLimit) as a hint to disregard the default network read size and use readLimit instead. Since this is not the intended use of mark, rather than pursue unusual dependencies between these two projects I provided the implementation in the patch as a solution.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          in HADOOP-14553 I've added a new test which can create a large file and then do seeks around it. I'd like that test to be used as the regression test here.

          Show
          stevel@apache.org Steve Loughran added a comment - in HADOOP-14553 I've added a new test which can create a large file and then do seeks around it. I'd like that test to be used as the regression test here.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I've cut the HADOOP-14553 dependency as getting that test running is harder than expected

          Show
          stevel@apache.org Steve Loughran added a comment - I've cut the HADOOP-14553 dependency as getting that test running is harder than expected
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Having delved into the Azure codebase, I think a test could be fitted into TestReadAndSeekPageBlobAfterWrite, hopefully just by re-using the file generated. Is that the same kind of blob you want to work with?

          I don't see any uses of readFully() in that test BTW. Rather than seek/read sequences, a sequence of readFully() operations is more representative of column store access. Doing something there to mimic seek-near-end and then some near start would match that and line up for any other optimisations of readFully

          FWIW, here's a trace of some TCP-DS benchmark IO:

          https://raw.githubusercontent.com/rajeshbalamohan/hadoop-aws-wrapper/master/stream_access_query_27_tpcds_200gb.log

          a like like

          .../000098_0,readFully,17113131,0,0,17111727,342,44181435
          

          means "in file 000098_0 17113131 readFully(offset=17111727, bytes=342) duration = 44,181,435 nS

          That's the seek pattern that this optimisation is clearly targeting, the regression we need to avoid is "byte 0 to EOF", which is what .gz processing involves.

          I'll set up some of my downstream tests in https://github.com/hortonworks-spark/cloud-integration to do this in spark & going from .gz to ORC & parquet and then scanning; as this uses the actual libraries, it's a full integration test of the seek() code

          Show
          stevel@apache.org Steve Loughran added a comment - Having delved into the Azure codebase, I think a test could be fitted into TestReadAndSeekPageBlobAfterWrite , hopefully just by re-using the file generated. Is that the same kind of blob you want to work with? I don't see any uses of readFully() in that test BTW. Rather than seek/read sequences, a sequence of readFully() operations is more representative of column store access. Doing something there to mimic seek-near-end and then some near start would match that and line up for any other optimisations of readFully FWIW, here's a trace of some TCP-DS benchmark IO: https://raw.githubusercontent.com/rajeshbalamohan/hadoop-aws-wrapper/master/stream_access_query_27_tpcds_200gb.log a like like .../000098_0,readFully,17113131,0,0,17111727,342,44181435 means "in file 000098_0 17113131 readFully(offset=17111727, bytes=342) duration = 44,181,435 nS That's the seek pattern that this optimisation is clearly targeting, the regression we need to avoid is "byte 0 to EOF", which is what .gz processing involves. I'll set up some of my downstream tests in https://github.com/hortonworks-spark/cloud-integration to do this in spark & going from .gz to ORC & parquet and then scanning; as this uses the actual libraries, it's a full integration test of the seek() code
          Hide
          tmarquardt Thomas Marquardt added a comment -

          Thanks for the feedback. I will do the following and resubmit the patch in a few days:

          1) Add logic so reverse seek followed by sequential read performs very well. I will add internal buffering and remove the dependency on BufferedFSInputStream so the read buffer can grow back to the "fs.azure.read.request.size" (default 4 MB) after a reverse seek.

          2) Add tests to ensure functional and performance coverage of seek, read, and skip.

          3) If required for testing, I will add metrics/instrumentation.

          Show
          tmarquardt Thomas Marquardt added a comment - Thanks for the feedback. I will do the following and resubmit the patch in a few days: 1) Add logic so reverse seek followed by sequential read performs very well. I will add internal buffering and remove the dependency on BufferedFSInputStream so the read buffer can grow back to the "fs.azure.read.request.size" (default 4 MB) after a reverse seek. 2) Add tests to ensure functional and performance coverage of seek, read, and skip. 3) If required for testing, I will add metrics/instrumentation.
          Hide
          tmarquardt Thomas Marquardt added a comment - - edited

          I am attaching the updated patch (0005-Random-access-and-seek-improvements-to-azure-file-system.patch). Random access is as much as 90% faster for block blobs without any regressions. There are unit tests demonstrating the performance (see TestBlockBlobInputStream.java) improvement for random access and unit tests demonstrating that there are no performance regressions in sequential reads after reverse seeks.

          However, please note that unit tests and various developer machines are not an appropriate environment for measuring performance. The performance tests in TestBlockBlobInputStream.java merely demonstrate the behavior and prevent regressions. There are many things which can impact performance measurements over short periods of time, such as but not limited to fluctuations in network traffic and routing, fluctuations in activity of other processes running on the client, fluctuations in load on the shared stamp that hosts your Azure Storage account, and throttling sometimes performed by enterprise IT departments. The performance tests included with this change are written to execute quickly and work around these fluctuations, and prevent regressions in the code. In the process of implementing and running these unit tests, I also validated the performance improvements by running variations of the code for longer periods and the results looked favorable.

          My team plans to review and improve the instrumentation (Hadoop Metrics) for the wasb:// file system. Although this change does not include new metrics, we will be looking into this in the future.

          ALL tests in "hadoop-tools/hadoop-azure" are passing with the patch (0005-Random-access-and-seek-improvements-to-azure-file-system.patch).

          Show
          tmarquardt Thomas Marquardt added a comment - - edited I am attaching the updated patch (0005-Random-access-and-seek-improvements-to-azure-file-system.patch). Random access is as much as 90% faster for block blobs without any regressions. There are unit tests demonstrating the performance (see TestBlockBlobInputStream.java) improvement for random access and unit tests demonstrating that there are no performance regressions in sequential reads after reverse seeks. However, please note that unit tests and various developer machines are not an appropriate environment for measuring performance. The performance tests in TestBlockBlobInputStream.java merely demonstrate the behavior and prevent regressions. There are many things which can impact performance measurements over short periods of time, such as but not limited to fluctuations in network traffic and routing, fluctuations in activity of other processes running on the client, fluctuations in load on the shared stamp that hosts your Azure Storage account, and throttling sometimes performed by enterprise IT departments. The performance tests included with this change are written to execute quickly and work around these fluctuations, and prevent regressions in the code. In the process of implementing and running these unit tests, I also validated the performance improvements by running variations of the code for longer periods and the results looked favorable. My team plans to review and improve the instrumentation (Hadoop Metrics) for the wasb:// file system. Although this change does not include new metrics, we will be looking into this in the future. ALL tests in "hadoop-tools/hadoop-azure" are passing with the patch (0005-Random-access-and-seek-improvements-to-azure-file-system.patch).
          Hide
          tmarquardt Thomas Marquardt added a comment -

          Canceling and then will submit again to trigger Jenkins.

          Show
          tmarquardt Thomas Marquardt added a comment - Canceling and then will submit again to trigger Jenkins.
          Hide
          tmarquardt Thomas Marquardt added a comment -

          Submitting again to trigger Jenkins for 0005-Random-access-and-seek-imporvements-to-azure-file-system.patch.

          Show
          tmarquardt Thomas Marquardt added a comment - Submitting again to trigger Jenkins for 0005-Random-access-and-seek-imporvements-to-azure-file-system.patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 15s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
          0 mvndep 1m 26s Maven dependency ordering for branch
          +1 mvninstall 15m 34s trunk passed
          +1 compile 17m 21s trunk passed
          +1 checkstyle 1m 56s trunk passed
          +1 mvnsite 1m 38s trunk passed
          +1 findbugs 2m 17s trunk passed
          +1 javadoc 1m 6s trunk passed
          0 mvndep 0m 15s Maven dependency ordering for patch
          +1 mvninstall 1m 0s the patch passed
          +1 compile 10m 56s the patch passed
          +1 javac 10m 56s the patch passed
          -0 checkstyle 1m 54s root: The patch generated 28 new + 135 unchanged - 7 fixed = 163 total (was 142)
          +1 mvnsite 1m 34s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 5m 37s the patch passed
          +1 javadoc 1m 11s the patch passed
          +1 unit 7m 50s hadoop-common in the patch passed.
          +1 unit 1m 38s hadoop-azure in the patch passed.
          +1 asflicense 0m 32s The patch does not generate ASF License warnings.
          95m 27s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue HADOOP-14535
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12875336/0005-Random-access-and-seek-imporvements-to-azure-file-system.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux edf1f3040ee2 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / fa1aaee
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/12696/artifact/patchprocess/diff-checkstyle-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/12696/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-azure U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/12696/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files. 0 mvndep 1m 26s Maven dependency ordering for branch +1 mvninstall 15m 34s trunk passed +1 compile 17m 21s trunk passed +1 checkstyle 1m 56s trunk passed +1 mvnsite 1m 38s trunk passed +1 findbugs 2m 17s trunk passed +1 javadoc 1m 6s trunk passed 0 mvndep 0m 15s Maven dependency ordering for patch +1 mvninstall 1m 0s the patch passed +1 compile 10m 56s the patch passed +1 javac 10m 56s the patch passed -0 checkstyle 1m 54s root: The patch generated 28 new + 135 unchanged - 7 fixed = 163 total (was 142) +1 mvnsite 1m 34s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 5m 37s the patch passed +1 javadoc 1m 11s the patch passed +1 unit 7m 50s hadoop-common in the patch passed. +1 unit 1m 38s hadoop-azure in the patch passed. +1 asflicense 0m 32s The patch does not generate ASF License warnings. 95m 27s Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue HADOOP-14535 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12875336/0005-Random-access-and-seek-imporvements-to-azure-file-system.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux edf1f3040ee2 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / fa1aaee Default Java 1.8.0_131 findbugs v3.1.0-RC1 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/12696/artifact/patchprocess/diff-checkstyle-root.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/12696/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-azure U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/12696/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          This is getting pretty close to going in; most of my feedback is test related. 1+ iteration should be enough. Once this and HADOOP-14598 are in, I can do some downstream testing with real data.

          AzureNativeFileSystemStore

          • How about naming the key of KEY_INPUT_STREAM_VERSION to "fs.azure.experimental.stream.version"? That's be consistent with the "fs.s3a.experimental" term?
          • log @ debug choice of algorithm, to aid diagnostics
          • retrieve() L2066: the PageBlobInputStream constructor already wraps StorageException with IOE. Retrieve doesn't need to catch and translate them, so should catch and then rethrow IOEs as is.

          BlockBlobInputStream

          • a seek to the current position can be downgraded to a no-op; no need to close & reopen the stream
          • you don't need to go this. when referencing fields. We expect our IDEs to colour code fields these days.
          • can you have the else and the catch statements on the same line as the previous clauses closing "}".
          • read(byte[] buffer, ..). Use FSInputStream.validatePositionedReadArgs for validation, or at least as much of it as is relevant. FWIW, the order of checks matches that in InputStream.
          • closeBlobInputStream: should blobInputStream=null be done in a finally clause so that it is guaranteed to be set (so making the call non-reentrant)

          NativeAzureFileSystem

          • L625: accidental typo in comment

          ContractTestUtils.java

          revert move of elapsedTime() to a single line method, use multiline style for the new elapsedTimeMs().

          TestBlockBlobInputStream

          1. I like the idea of using the ratio as a way of comparing performance; it makes it independent of bandwidth.
          2. And I agree, you can't reliably assess real-world perf. But it would seem faster.
          3. Once HADOOP-14553 is in, this test would be uprated to a scale test; only executed with the -Dscale option,
            and configurable for larger sizes of data. No need to worry about it. I think the tests could perhaps even be moved into the ITestAzureHugeFiles test, which forces a strict ordering of tests in junit, so can have one test to upload a file, one to delete, and some in between to play with reading and seeking.

          for now

          • TestBlockBlobInputStream to extend AbstractWasbTestBase. This will aid migration to the parallel test runner of HADOOP-14553
          • TestBlockBlobInputStream teardown only closes one of the input streams.
          • toMbps(): would it be better or worse to do the *8 before the / 1000.0? Or, given these are floating point, moot?
          • split testMarkSupported() into a separate test for each separate stream; assertion in validateMarkSupported to include some text.
          • same for testSkipAndAvailableAndPosition
          • testSequentialReadPerformance are we confident that the v2ElapsedMs read time will always be >0? Otherewise that division will fail.
          • testRandomRead and testSequentialRead to always close the output stream. Or save a refernce to the stream into a field and have the @After teardown close it (quietly)
          • validateMarkAndReset, validateSkipBounds to use GenericTestUtils.assertExceptionContains to validate caught exception, or
            LambdaTestUtils.intercept to structure expected failure. Have a look at other uses in the code for details. +Same in other tests.
              try {
                seekCheck(in, dataLen + 3);
                Assert.fail("Seek after EOF should fail.");
              } catch (IOException e) {
                GenericTestUtils.assertExceptionContains("Cannot seek after EOF", e);
              }
          

          LambdaTestUtils may seem a bit more convoluted

              
              intercept(IOException.class, expected,
                  new Callable<S3AEncryptionMethods>() {
                    @Override
                    public S3AEncryptionMethods call() throws Exception {
                      return getAlgorithm(alg, key);
                    }
                  });
          

          But it really comes out to play in Java 8:

              
              intercept(IOException.class, expected,
                  () -> getAlgorithm(alg, key));
          

          That's why I'd recommend adopting it now.

          Other

          AzureBlobStorageTestAccount

          • L96; think some tabs have snuck in.
          • I have problem in that every test run leaks wasb containers. Does this patch continue or even worsen the tradition?
          Show
          stevel@apache.org Steve Loughran added a comment - This is getting pretty close to going in; most of my feedback is test related. 1+ iteration should be enough. Once this and HADOOP-14598 are in, I can do some downstream testing with real data. AzureNativeFileSystemStore How about naming the key of KEY_INPUT_STREAM_VERSION to "fs.azure.experimental.stream.version"? That's be consistent with the "fs.s3a.experimental" term? log @ debug choice of algorithm, to aid diagnostics retrieve() L2066: the PageBlobInputStream constructor already wraps StorageException with IOE. Retrieve doesn't need to catch and translate them, so should catch and then rethrow IOEs as is. BlockBlobInputStream a seek to the current position can be downgraded to a no-op; no need to close & reopen the stream you don't need to go this. when referencing fields. We expect our IDEs to colour code fields these days. can you have the else and the catch statements on the same line as the previous clauses closing "}". read(byte[] buffer, ..) . Use FSInputStream.validatePositionedReadArgs for validation, or at least as much of it as is relevant. FWIW, the order of checks matches that in InputStream. closeBlobInputStream : should blobInputStream=null be done in a finally clause so that it is guaranteed to be set (so making the call non-reentrant) NativeAzureFileSystem L625: accidental typo in comment ContractTestUtils.java revert move of elapsedTime() to a single line method, use multiline style for the new elapsedTimeMs() . TestBlockBlobInputStream I like the idea of using the ratio as a way of comparing performance; it makes it independent of bandwidth. And I agree, you can't reliably assess real-world perf. But it would seem faster. Once HADOOP-14553 is in, this test would be uprated to a scale test; only executed with the -Dscale option, and configurable for larger sizes of data. No need to worry about it. I think the tests could perhaps even be moved into the ITestAzureHugeFiles test, which forces a strict ordering of tests in junit, so can have one test to upload a file, one to delete, and some in between to play with reading and seeking. for now TestBlockBlobInputStream to extend AbstractWasbTestBase . This will aid migration to the parallel test runner of HADOOP-14553 TestBlockBlobInputStream teardown only closes one of the input streams. toMbps() : would it be better or worse to do the *8 before the / 1000.0? Or, given these are floating point, moot? split testMarkSupported() into a separate test for each separate stream; assertion in validateMarkSupported to include some text. same for testSkipAndAvailableAndPosition testSequentialReadPerformance are we confident that the v2ElapsedMs read time will always be >0? Otherewise that division will fail. testRandomRead and testSequentialRead to always close the output stream. Or save a refernce to the stream into a field and have the @After teardown close it (quietly) validateMarkAndReset, validateSkipBounds to use GenericTestUtils.assertExceptionContains to validate caught exception, or LambdaTestUtils.intercept to structure expected failure. Have a look at other uses in the code for details. +Same in other tests. try { seekCheck(in, dataLen + 3); Assert.fail( "Seek after EOF should fail." ); } catch (IOException e) { GenericTestUtils.assertExceptionContains( "Cannot seek after EOF" , e); } LambdaTestUtils may seem a bit more convoluted intercept(IOException.class, expected, new Callable<S3AEncryptionMethods>() { @Override public S3AEncryptionMethods call() throws Exception { return getAlgorithm(alg, key); } }); But it really comes out to play in Java 8: intercept(IOException.class, expected, () -> getAlgorithm(alg, key)); That's why I'd recommend adopting it now. Other AzureBlobStorageTestAccount L96; think some tabs have snuck in. I have problem in that every test run leaks wasb containers. Does this patch continue or even worsen the tradition?
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Patch 006. This is patch 005 with all the changes I suggested, particularly the tests.

          The original test suite has a couple of operational flaws

          1. its slow
          2. it leaves 128MB files around. This can be expensive.

          I've reworked it to use the same style as AbstractSTestS3AHugeFiles; using ordered names to guarantee the test cases are run in sequence; the final test deletes the file. And downsized the file.
          This is lined up for HADOOP-14553, which ports a copy of the same test into Azure, and runs tests in parallel. The tests in this method should be something which can be merged in to that test, and make it a scale test for configurable size of dataset.

          Tested: new suite, yes. Remainder: in progress

          -------------------------------------------------------
          Running org.apache.hadoop.fs.azure.TestBlockBlobInputStream
          Tests run: 19, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 212.423 sec - in org.apache.hadoop.fs.azure.TestBlockBlobInputStream
          
          Results :
          
          Tests run: 19, Failures: 0, Errors: 0, Skipped: 0
          
          [INFO] ------------------------------------------------------------------------
          [INFO] BUILD SUCCESS
          [INFO] ------------------------------------------------------------------------
          [INFO] Total time: 03:37 min (Wall Clock)
          [INFO] Finished at: 2017-07-10T21:46:59+01:00
          [INFO] Final Memory: 46M/820M
          [INFO] ------------------------------------------------------------------------
          
          Show
          stevel@apache.org Steve Loughran added a comment - Patch 006. This is patch 005 with all the changes I suggested, particularly the tests. The original test suite has a couple of operational flaws its slow it leaves 128MB files around. This can be expensive. I've reworked it to use the same style as AbstractSTestS3AHugeFiles ; using ordered names to guarantee the test cases are run in sequence; the final test deletes the file. And downsized the file. This is lined up for HADOOP-14553 , which ports a copy of the same test into Azure, and runs tests in parallel. The tests in this method should be something which can be merged in to that test, and make it a scale test for configurable size of dataset. Tested: new suite, yes. Remainder: in progress ------------------------------------------------------- Running org.apache.hadoop.fs.azure.TestBlockBlobInputStream Tests run: 19, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 212.423 sec - in org.apache.hadoop.fs.azure.TestBlockBlobInputStream Results : Tests run: 19, Failures: 0, Errors: 0, Skipped: 0 [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 03:37 min (Wall Clock) [INFO] Finished at: 2017-07-10T21:46:59+01:00 [INFO] Final Memory: 46M/820M [INFO] ------------------------------------------------------------------------
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Full test run completed.

          Results :
          
          Tests run: 694, Failures: 0, Errors: 0, Skipped: 121
          

          I will note that test run performance awful, even worse with this, and one run timed out. After this patch is in, I'm not going to approve anything else until HADOOP-14553 has the test process under control. Sorry

          Show
          stevel@apache.org Steve Loughran added a comment - Full test run completed. Results : Tests run: 694, Failures: 0, Errors: 0, Skipped: 121 I will note that test run performance awful, even worse with this, and one run timed out. After this patch is in, I'm not going to approve anything else until HADOOP-14553 has the test process under control. Sorry
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 12s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.
                trunk Compile Tests
          0 mvndep 0m 14s Maven dependency ordering for branch
          +1 mvninstall 13m 11s trunk passed
          +1 compile 13m 30s trunk passed
          +1 checkstyle 1m 52s trunk passed
          +1 mvnsite 1m 54s trunk passed
          +1 findbugs 1m 59s trunk passed
          +1 javadoc 1m 14s trunk passed
                Patch Compile Tests
          0 mvndep 0m 17s Maven dependency ordering for patch
          +1 mvninstall 1m 0s the patch passed
          +1 compile 10m 13s the patch passed
          +1 javac 10m 13s the patch passed
          -0 checkstyle 1m 58s root: The patch generated 30 new + 135 unchanged - 7 fixed = 165 total (was 142)
          +1 mvnsite 1m 58s the patch passed
          -1 whitespace 0m 0s The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 findbugs 2m 21s the patch passed
          +1 javadoc 1m 21s the patch passed
                Other Tests
          -1 unit 8m 3s hadoop-common in the patch failed.
          +1 unit 1m 36s hadoop-azure in the patch passed.
          +1 asflicense 0m 39s The patch does not generate ASF License warnings.
          85m 20s



          Reason Tests
          Failed junit tests hadoop.security.TestKDiag



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue HADOOP-14535
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12876496/HADOOP-14535-006.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 4e83f5515aba 3.13.0-117-generic #164-Ubuntu SMP Fri Apr 7 11:05:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 34f113d
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/12761/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/12761/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/12761/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/12761/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-azure U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/12761/console
          Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 12s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 5 new or modified test files.       trunk Compile Tests 0 mvndep 0m 14s Maven dependency ordering for branch +1 mvninstall 13m 11s trunk passed +1 compile 13m 30s trunk passed +1 checkstyle 1m 52s trunk passed +1 mvnsite 1m 54s trunk passed +1 findbugs 1m 59s trunk passed +1 javadoc 1m 14s trunk passed       Patch Compile Tests 0 mvndep 0m 17s Maven dependency ordering for patch +1 mvninstall 1m 0s the patch passed +1 compile 10m 13s the patch passed +1 javac 10m 13s the patch passed -0 checkstyle 1m 58s root: The patch generated 30 new + 135 unchanged - 7 fixed = 165 total (was 142) +1 mvnsite 1m 58s the patch passed -1 whitespace 0m 0s The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 findbugs 2m 21s the patch passed +1 javadoc 1m 21s the patch passed       Other Tests -1 unit 8m 3s hadoop-common in the patch failed. +1 unit 1m 36s hadoop-azure in the patch passed. +1 asflicense 0m 39s The patch does not generate ASF License warnings. 85m 20s Reason Tests Failed junit tests hadoop.security.TestKDiag Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue HADOOP-14535 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12876496/HADOOP-14535-006.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 4e83f5515aba 3.13.0-117-generic #164-Ubuntu SMP Fri Apr 7 11:05:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 34f113d Default Java 1.8.0_131 findbugs v3.1.0-RC1 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/12761/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/12761/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/12761/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/12761/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-azure U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/12761/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          tmarquardt Thomas Marquardt added a comment -

          Thanks for moving this forward Steve! I've provided my comments in response to yours below. Please let me know if I need to do anything, as it looks like you made the changes that you requested.

          1. I agree that we need to improve how Jenkins runs the azure tests. Let's clarify the requirements in HADOOP-14553 and assign it to either myself or Georgi, unless you were planning to take it on. On a side note, it takes me ~12 minutes to run all 717 hadoop-azure tests. My development environment (Linux virtual machine) and my storage account are in the West US region. I am fortunate to have both in the same data center. You mention that it takes a long time to run the tests, and I suspect this is due to the network path between your development environment and the storage account. Are you using an Azure storage account that is regionally located near you?

          2. BlockBlobInputStream.seek is only called for reverse seek due to the implementation of NativeAzureFsInputStream.seek. Since BlockBlobInputStream.seek is never called for a forward or no-op seek, and there is no good way to exercise such a code path in the unit tests, I don't think BlockBlobInputStream.seek should be implemented to handle these cases. Anyhow, it doesn't matter if you already made the change.

          3. TestBlockBlobInputStream intentionally left the 128 MB file to speed up the test run the next time. It makes the test run considerably faster, as the 128 MB file is created once. Earlier, you asked for a permanent shared file for testing, but I don't have a way to do that. Creating the file once and re-using it has similar benefits.

          Show
          tmarquardt Thomas Marquardt added a comment - Thanks for moving this forward Steve! I've provided my comments in response to yours below. Please let me know if I need to do anything, as it looks like you made the changes that you requested. 1. I agree that we need to improve how Jenkins runs the azure tests. Let's clarify the requirements in HADOOP-14553 and assign it to either myself or Georgi, unless you were planning to take it on. On a side note, it takes me ~12 minutes to run all 717 hadoop-azure tests. My development environment (Linux virtual machine) and my storage account are in the West US region. I am fortunate to have both in the same data center. You mention that it takes a long time to run the tests, and I suspect this is due to the network path between your development environment and the storage account. Are you using an Azure storage account that is regionally located near you? 2. BlockBlobInputStream.seek is only called for reverse seek due to the implementation of NativeAzureFsInputStream.seek. Since BlockBlobInputStream.seek is never called for a forward or no-op seek, and there is no good way to exercise such a code path in the unit tests, I don't think BlockBlobInputStream.seek should be implemented to handle these cases. Anyhow, it doesn't matter if you already made the change. 3. TestBlockBlobInputStream intentionally left the 128 MB file to speed up the test run the next time. It makes the test run considerably faster, as the 128 MB file is created once. Earlier, you asked for a permanent shared file for testing, but I don't have a way to do that. Creating the file once and re-using it has similar benefits.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I think with with the tests reworked & jenkins happy, it's good to go in

          +1

          1. HADOOP-14553 has a patch, it'll need updating with this one. Like you note, I'm further away, and RTT hurts as much as B/W. Parallel execution of all the low-bandwidth tests will speed things up, the scale tests (like this one) will be configurable for size, so you can run full multi-GB tests when you want to.
          2. that big file upload is something to measure test too, which the new test will
          Show
          stevel@apache.org Steve Loughran added a comment - I think with with the tests reworked & jenkins happy, it's good to go in +1 HADOOP-14553 has a patch, it'll need updating with this one. Like you note, I'm further away, and RTT hurts as much as B/W. Parallel execution of all the low-bandwidth tests will speed things up, the scale tests (like this one) will be configurable for size, so you can run full multi-GB tests when you want to. that big file upload is something to measure test too, which the new test will
          Hide
          stevel@apache.org Steve Loughran added a comment -

          committed to branch-2 & trunk.Thanks!

          Show
          stevel@apache.org Steve Loughran added a comment - committed to branch-2 & trunk.Thanks!
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11989 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11989/)
          HADOOP-14535 wasb: implement high-performance random access and seek of (stevel: rev d670c3a4da7dd80dccf6c6308603bb3bb013b3b0)

          • (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystemHelper.java
          • (edit) hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azure/TestAzureConcurrentOutOfBandIo.java
          • (add) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/BlockBlobInputStream.java
          • (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/SecureStorageInterfaceImpl.java
          • (edit) hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azure/MockStorageInterface.java
          • (edit) hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azure/AzureBlobStorageTestAccount.java
          • (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java
          • (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/StorageInterfaceImpl.java
          • (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/AzureNativeFileSystemStore.java
          • (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/StorageInterface.java
          • (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeFileSystemStore.java
          • (add) hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azure/TestBlockBlobInputStream.java
          • (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/contract/ContractTestUtils.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11989 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11989/ ) HADOOP-14535 wasb: implement high-performance random access and seek of (stevel: rev d670c3a4da7dd80dccf6c6308603bb3bb013b3b0) (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystemHelper.java (edit) hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azure/TestAzureConcurrentOutOfBandIo.java (add) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/BlockBlobInputStream.java (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/SecureStorageInterfaceImpl.java (edit) hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azure/MockStorageInterface.java (edit) hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azure/AzureBlobStorageTestAccount.java (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/StorageInterfaceImpl.java (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/AzureNativeFileSystemStore.java (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/StorageInterface.java (edit) hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeFileSystemStore.java (add) hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azure/TestBlockBlobInputStream.java (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/contract/ContractTestUtils.java
          Hide
          tmarquardt Thomas Marquardt added a comment -

          HADOOP-14535 and HADOOP-14680 depend on each other.

          Show
          tmarquardt Thomas Marquardt added a comment - HADOOP-14535 and HADOOP-14680 depend on each other.

            People

            • Assignee:
              tmarquardt Thomas Marquardt
              Reporter:
              tmarquardt Thomas Marquardt
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development