Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-15407

Support Windows Azure Storage - Blob file system in Hadoop

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.2.0
    • Fix Version/s: 3.2.0
    • Component/s: fs/azure
    • Labels:
      None
    • Target Version/s:
    • Release Note:
      Hide
      The abfs connector in the hadoop-azure module supports Microsoft Azure Datalake (Gen 2), which at the time of writing (September 2018) was in preview, soon to go GA. As with all cloud connectors, corner-cases will inevitably surface. If you encounter one, please file a bug report.
      Show
      The abfs connector in the hadoop-azure module supports Microsoft Azure Datalake (Gen 2), which at the time of writing (September 2018) was in preview, soon to go GA. As with all cloud connectors, corner-cases will inevitably surface. If you encounter one, please file a bug report.

      Description

      Description
      This JIRA adds a new file system implementation, ABFS, for running Big Data and Analytics workloads against Azure Storage. This is a complete rewrite of the previous WASB driver with a heavy focus on optimizing both performance and cost.
       
      High level design
      At a high level, the code here extends the FileSystem class to provide an implementation for accessing blobs in Azure Storage. The scheme abfs is used for accessing it over HTTP, and abfss for accessing over HTTPS. The following URI scheme is used to address individual paths:
       
      abfs[s]://<filesystem>@<account>.dfs.core.windows.net/<path>
       
      ABFS is intended as a replacement to WASB. WASB is not deprecated but is in pure maintenance mode and customers should upgrade to ABFS once it hits General Availability later in CY18.
      Benefits of ABFS include:
      ·         Higher scale (capacity, throughput, and IOPS) Big Data and Analytics workloads by allowing higher limits on storage accounts
      ·         Removing any ramp up time with Storage backend partitioning; blocks are now automatically sharded across partitions in the Storage backend
                .         This avoids the need for using temporary/intermediate files, increasing the cost (and framework complexity around committing jobs/tasks)
      ·         Enabling much higher read and write throughput on single files (tens of Gbps by default)
      ·         Still retaining all of the Azure Blob features customers are familiar with and expect, and gaining the benefits of future Blob features as well
      ABFS incorporates Hadoop Filesystem metrics to monitor the file system throughput and operations. Ambari metrics are not currently implemented for ABFS, but will be available soon.
       
      Credits and history
      Credit for this work goes to (hope I don't forget anyone): Shane Mainali, Thomas Marquardt, Zichen Sun, Georgi Chalakov, Esfandiar Manii, Amit Singh, Dana Kaban, Da Zhou, Junhua Gu, Saher Ahwal, Saurabh Pant, and James Baker. 
       
      Test
      ABFS has gone through many test procedures including Hadoop file system contract tests, unit testing, functional testing, and manual testing. All the Junit tests provided with the driver are capable of running in both sequential/parallel fashion in order to reduce the testing time.
      Besides unit tests, we have used ABFS as the default file system in Azure HDInsight. Azure HDInsight will very soon offer ABFS as a storage option. (HDFS is also used but not as default file system.) Various different customer and test workloads have been run against clusters with such configurations for quite some time. Benchmarks such as Tera*, TPC-DS, Spark Streaming and Spark SQL, and others have been run to do scenario, performance, and functional testing. Third parties and customers have also done various testing of ABFS.
      The current version reflects to the version of the code tested and used in our production environment.

        Attachments

        1. HADOOP-15407-001.patch
          1.20 MB
          Esfandiar Manii
        2. HADOOP-15407-002.patch
          1.20 MB
          Esfandiar Manii
        3. HADOOP-15407-003.patch
          1.20 MB
          Esfandiar Manii
        4. HADOOP-15407-004.patch
          571 kB
          Esfandiar Manii
        5. HADOOP-15407-008.patch
          523 kB
          Steve Loughran
        6. HADOOP-15407-HADOOP-15407.006.patch
          545 kB
          Thomas Marquardt
        7. HADOOP-15407-HADOOP-15407.007.patch
          547 kB
          Thomas Marquardt
        8. HADOOP-15407-HADOOP-15407.008.patch
          522 kB
          Da Zhou
        9. HADOOP-15407-HADOOP-15407-008.patch
          523 kB
          Steve Loughran

          Issue Links

          1.
          AzureBlobFS - Base package classes and configuration files Sub-task Resolved Esfandiar Manii
          2.
          AzureBlobFS - Contracts Sub-task Resolved Esfandiar Manii
          3.
          AzureBlobFS - Constants Sub-task Resolved Esfandiar Manii
          4.
          AzureBlobFS - Diagnostics and Utils Sub-task Resolved Esfandiar Manii
          5.
          AzureBlobFS - Services Sub-task Resolved Esfandiar Manii
          6.
          AzureBlobFS - Tests Sub-task Resolved Esfandiar Manii
          7.
          ABFS: Commit of core codebase Sub-task Resolved Da Zhou
          8.
          ABFS initialize() throws string out of bounds exception of the URI isn't fully qualified Sub-task Resolved Steve Loughran
          9.
          ABFS: tune imports & javadocs; stabilise tests Sub-task Resolved Thomas Marquardt
          10.
          ABFS: removed dependency injection and unnecessary dependencies Sub-task Resolved Da Zhou
          11.
          ABFS: TestAbfsConfigurationFieldsValidation breaks if FS is configured in core-site Sub-task Resolved Steve Loughran
          12.
          ABFS: Code changes for bug fix and new tests Sub-task Resolved Da Zhou
          13.
          ABFS: Add support for OAuth Sub-task Closed Da Zhou
          14.
          ABFS: Add support for ACL Sub-task Closed Da Zhou
          15.
          ABFS: Simplify configuration Sub-task Resolved Da Zhou
          16.
          ABFS: Reduce test run time via parallelization and grouping Sub-task Resolved Da Zhou
          17.
          ABFS: Compatibility tests can fail Sub-task Resolved Unassigned
          18.
          ABFS: Improve HTTPS Performance Sub-task Resolved Vishwajeet Dusane
          19.
          ABFS: Add support for StreamCapabilities. Fix javadoc and checkstyle Sub-task Resolved Thomas Marquardt
          20.
          ABFS: InputStream wrapped in FSDataInputStream twice Sub-task Closed Sean Mackrory
          21.
          ABFS: extensible support for custom oauth Sub-task Resolved Da Zhou
          22.
          ABFS: Allow OAuth credentials to not be tied to accounts Sub-task Resolved Sean Mackrory
          23.
          ABFS - Implement client-side throttling Sub-task Resolved Thomas Marquardt
          24.
          Mark ABFS extension package and interfaces as LimitedPrivate/Unstable Sub-task Resolved Steve Loughran
          25.
          ABFS: Failure in OpenSSLProvider should fall back to JSSE Sub-task Resolved Vishwajeet Dusane
          26.
          tune abfs/wasb parallel the sequential test execution Sub-task Resolved Da Zhou
          27.
          ITestAzureBlobFileSystemE2E timing out with non-scale timeout of 10 min Sub-task Resolved Da Zhou
          28.
          Fail-fast when using OAuth over http Sub-task Resolved Da Zhou
          29.
          ABFS: Ranger Support Sub-task Resolved Yuan Gao
          30.
          ABFS: Add backward compatibility to handle Unsupported Operation for storage account with no namespace feature Sub-task Resolved Da Zhou
          31.
          ABFS: remove unused maven dependencies and add used undeclared dependencies Sub-task Resolved Da Zhou
          32.
          ABFS: Check variable names during initialization of AbfsClientThrottlingIntercept Sub-task Resolved Sneha Varma
          33.
          Add ABFS configuration to ConfigRedactor Sub-task Resolved Sean Mackrory
          34.
          ABFS: support path "abfs://mycluster/file/path" Sub-task Resolved Da Zhou
          35.
          ABFS: remove dependency on common-codec Base64 Sub-task Resolved Da Zhou
          36.
          AbstractContractAppendTest fails against HDFS on HADOOP-15407 branch Sub-task Resolved Steve Loughran
          37.
          ABFS: distcp tests are always skipped Sub-task Resolved Steve Loughran
          38.
          Merge HADOOP-15407 to trunk Sub-task Resolved Sean Mackrory
          39.
          ABFS: Fix issues raised by Yetus Sub-task Resolved Sean Mackrory
          40.
          ABFS: Fix client side throttling for read Sub-task Resolved Sneha Varma

            Activity

              People

              • Assignee:
                DanielZhou Da Zhou
                Reporter:
                esmanii Esfandiar Manii
              • Votes:
                1 Vote for this issue
                Watchers:
                28 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: