Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-15407

Support Windows Azure Storage - Blob file system in Hadoop

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.2.0
    • Fix Version/s: 3.2.0
    • Component/s: fs/azure
    • Labels:
      None
    • Target Version/s:
    • Release Note:
      Hide
      The abfs connector in the hadoop-azure module supports Microsoft Azure Datalake (Gen 2), which at the time of writing (September 2018) was in preview, soon to go GA. As with all cloud connectors, corner-cases will inevitably surface. If you encounter one, please file a bug report.
      Show
      The abfs connector in the hadoop-azure module supports Microsoft Azure Datalake (Gen 2), which at the time of writing (September 2018) was in preview, soon to go GA. As with all cloud connectors, corner-cases will inevitably surface. If you encounter one, please file a bug report.

      Description

      Description
      This JIRA adds a new file system implementation, ABFS, for running Big Data and Analytics workloads against Azure Storage. This is a complete rewrite of the previous WASB driver with a heavy focus on optimizing both performance and cost.
       
      High level design
      At a high level, the code here extends the FileSystem class to provide an implementation for accessing blobs in Azure Storage. The scheme abfs is used for accessing it over HTTP, and abfss for accessing over HTTPS. The following URI scheme is used to address individual paths:
       
      abfs[s]://<filesystem>@<account>.dfs.core.windows.net/<path>
       
      ABFS is intended as a replacement to WASB. WASB is not deprecated but is in pure maintenance mode and customers should upgrade to ABFS once it hits General Availability later in CY18.
      Benefits of ABFS include:
      ·         Higher scale (capacity, throughput, and IOPS) Big Data and Analytics workloads by allowing higher limits on storage accounts
      ·         Removing any ramp up time with Storage backend partitioning; blocks are now automatically sharded across partitions in the Storage backend
                .         This avoids the need for using temporary/intermediate files, increasing the cost (and framework complexity around committing jobs/tasks)
      ·         Enabling much higher read and write throughput on single files (tens of Gbps by default)
      ·         Still retaining all of the Azure Blob features customers are familiar with and expect, and gaining the benefits of future Blob features as well
      ABFS incorporates Hadoop Filesystem metrics to monitor the file system throughput and operations. Ambari metrics are not currently implemented for ABFS, but will be available soon.
       
      Credits and history
      Credit for this work goes to (hope I don't forget anyone): Shane Mainali, Thomas Marquardt, Zichen Sun, Georgi Chalakov, Esfandiar Manii, Amit Singh, Dana Kaban, Da Zhou, Junhua Gu, Saher Ahwal, Saurabh Pant, and James Baker. 
       
      Test
      ABFS has gone through many test procedures including Hadoop file system contract tests, unit testing, functional testing, and manual testing. All the Junit tests provided with the driver are capable of running in both sequential/parallel fashion in order to reduce the testing time.
      Besides unit tests, we have used ABFS as the default file system in Azure HDInsight. Azure HDInsight will very soon offer ABFS as a storage option. (HDFS is also used but not as default file system.) Various different customer and test workloads have been run against clusters with such configurations for quite some time. Benchmarks such as Tera*, TPC-DS, Spark Streaming and Spark SQL, and others have been run to do scenario, performance, and functional testing. Third parties and customers have also done various testing of ABFS.
      The current version reflects to the version of the code tested and used in our production environment.

        Attachments

        1. HADOOP-15407-001.patch
          1.20 MB
          Esfandiar Manii
        2. HADOOP-15407-002.patch
          1.20 MB
          Esfandiar Manii
        3. HADOOP-15407-003.patch
          1.20 MB
          Esfandiar Manii
        4. HADOOP-15407-004.patch
          571 kB
          Esfandiar Manii
        5. HADOOP-15407-008.patch
          523 kB
          Steve Loughran
        6. HADOOP-15407-HADOOP-15407.006.patch
          545 kB
          Thomas Marqardt
        7. HADOOP-15407-HADOOP-15407.007.patch
          547 kB
          Thomas Marqardt
        8. HADOOP-15407-HADOOP-15407.008.patch
          522 kB
          Da Zhou
        9. HADOOP-15407-HADOOP-15407-008.patch
          523 kB
          Steve Loughran

        Issue Links

        1.
        AzureBlobFS - Base package classes and configuration files Sub-task Resolved Esfandiar Manii Actions
        2.
        AzureBlobFS - Contracts Sub-task Resolved Esfandiar Manii Actions
        3.
        AzureBlobFS - Constants Sub-task Resolved Esfandiar Manii Actions
        4.
        AzureBlobFS - Diagnostics and Utils Sub-task Resolved Esfandiar Manii Actions
        5.
        AzureBlobFS - Services Sub-task Resolved Esfandiar Manii Actions
        6.
        AzureBlobFS - Tests Sub-task Resolved Esfandiar Manii Actions
        7.
        ABFS: Commit of core codebase Sub-task Resolved Da Zhou Actions
        8.
        ABFS initialize() throws string out of bounds exception of the URI isn't fully qualified Sub-task Resolved Steve Loughran Actions
        9.
        ABFS: tune imports & javadocs; stabilise tests Sub-task Resolved Thomas Marqardt Actions
        10.
        ABFS: removed dependency injection and unnecessary dependencies Sub-task Resolved Da Zhou Actions
        11.
        ABFS: TestAbfsConfigurationFieldsValidation breaks if FS is configured in core-site Sub-task Resolved Steve Loughran Actions
        12.
        ABFS: Code changes for bug fix and new tests Sub-task Resolved Da Zhou Actions
        13.
        ABFS: Add support for OAuth Sub-task Closed Da Zhou Actions
        14.
        ABFS: Add support for ACL Sub-task Closed Da Zhou Actions
        15.
        ABFS: Simplify configuration Sub-task Resolved Da Zhou Actions
        16.
        ABFS: Reduce test run time via parallelization and grouping Sub-task Resolved Da Zhou Actions
        17.
        ABFS: Compatibility tests can fail Sub-task Resolved Unassigned Actions
        18.
        ABFS: Improve HTTPS Performance Sub-task Resolved Vishwajeet Dusane Actions
        19.
        ABFS: Add support for StreamCapabilities. Fix javadoc and checkstyle Sub-task Resolved Thomas Marqardt Actions
        20.
        ABFS: InputStream wrapped in FSDataInputStream twice Sub-task Closed Sean Mackrory Actions
        21.
        ABFS: extensible support for custom oauth Sub-task Resolved Da Zhou Actions
        22.
        ABFS: Allow OAuth credentials to not be tied to accounts Sub-task Resolved Sean Mackrory Actions
        23.
        ABFS - Implement client-side throttling Sub-task Resolved Thomas Marqardt Actions
        24.
        Mark ABFS extension package and interfaces as LimitedPrivate/Unstable Sub-task Resolved Steve Loughran Actions
        25.
        ABFS: Failure in OpenSSLProvider should fall back to JSSE Sub-task Resolved Vishwajeet Dusane Actions
        26.
        tune abfs/wasb parallel the sequential test execution Sub-task Resolved Da Zhou Actions
        27.
        ITestAzureBlobFileSystemE2E timing out with non-scale timeout of 10 min Sub-task Resolved Da Zhou Actions
        28.
        Fail-fast when using OAuth over http Sub-task Resolved Da Zhou Actions
        29.
        ABFS: Ranger Support Sub-task Resolved Yuan Gao Actions
        30.
        ABFS: Add backward compatibility to handle Unsupported Operation for storage account with no namespace feature Sub-task Resolved Da Zhou Actions
        31.
        ABFS: remove unused maven dependencies and add used undeclared dependencies Sub-task Resolved Da Zhou Actions
        32.
        ABFS: Check variable names during initialization of AbfsClientThrottlingIntercept Sub-task Resolved Sneha Varma Actions
        33.
        Add ABFS configuration to ConfigRedactor Sub-task Resolved Sean Mackrory Actions
        34.
        ABFS: support path "abfs://mycluster/file/path" Sub-task Resolved Da Zhou Actions
        35.
        ABFS: remove dependency on common-codec Base64 Sub-task Resolved Da Zhou Actions
        36.
        AbstractContractAppendTest fails against HDFS on HADOOP-15407 branch Sub-task Resolved Steve Loughran Actions
        37.
        ABFS: distcp tests are always skipped Sub-task Resolved Steve Loughran Actions
        38.
        Merge HADOOP-15407 to trunk Sub-task Resolved Sean Mackrory Actions
        39.
        ABFS: Fix issues raised by Yetus Sub-task Resolved Sean Mackrory Actions
        40.
        ABFS: Fix client side throttling for read Sub-task Resolved Sneha Varma Actions
        41.
        ABFS: Skip unsupported test cases when non namespace enabled in ITestAzureBlobFileSystemAuthorization Sub-task Resolved Yuan Gao Actions
        42.
        ABFS: Fixing skipUserGroupMetadata in AzureBlobFileSystemStore Sub-task Resolved Da Zhou Actions
        43.
        ABFS: better exception handling when making getAccessToken call Sub-task Resolved Da Zhou Actions
        44.
        Backporting ABFS driver from trunk to branch 2.0 Sub-task Resolved Yuan Gao Actions

          Activity

            People

            • Assignee:
              DanielZhou Da Zhou
              Reporter:
              esmanii Esfandiar Manii

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment