[HADOOP-15407] Support Windows Azure Storage - Blob file system in Hadoop - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 3.2.0
Fix Version/s: 3.2.0
Component/s: fs/azure
Labels:
None

Target Version/s:

3.3.0
Release Note:

Hide
The abfs connector in the hadoop-azure module supports Microsoft Azure Datalake (Gen 2), which at the time of writing (September 2018) was in preview, soon to go GA. As with all cloud connectors, corner-cases will inevitably surface. If you encounter one, please file a bug report.

Show
The abfs connector in the hadoop-azure module supports Microsoft Azure Datalake (Gen 2), which at the time of writing (September 2018) was in preview, soon to go GA. As with all cloud connectors, corner-cases will inevitably surface. If you encounter one, please file a bug report.

Description

Description
This JIRA adds a new file system implementation, ABFS, for running Big Data and Analytics workloads against Azure Storage. This is a complete rewrite of the previous WASB driver with a heavy focus on optimizing both performance and cost.

High level design
At a high level, the code here extends the FileSystem class to provide an implementation for accessing blobs in Azure Storage. The scheme abfs is used for accessing it over HTTP, and abfss for accessing over HTTPS. The following URI scheme is used to address individual paths:

abfs[s]://<filesystem>@<account>.dfs.core.windows.net/<path>

ABFS is intended as a replacement to WASB. WASB is not deprecated but is in pure maintenance mode and customers should upgrade to ABFS once it hits General Availability later in CY18.
Benefits of ABFS include:
·         Higher scale (capacity, throughput, and IOPS) Big Data and Analytics workloads by allowing higher limits on storage accounts
·         Removing any ramp up time with Storage backend partitioning; blocks are now automatically sharded across partitions in the Storage backend
. This avoids the need for using temporary/intermediate files, increasing the cost (and framework complexity around committing jobs/tasks)
·         Enabling much higher read and write throughput on single files (tens of Gbps by default)
·         Still retaining all of the Azure Blob features customers are familiar with and expect, and gaining the benefits of future Blob features as well
ABFS incorporates Hadoop Filesystem metrics to monitor the file system throughput and operations. Ambari metrics are not currently implemented for ABFS, but will be available soon.

Credits and history
Credit for this work goes to (hope I don't forget anyone): Shane Mainali, Thomas Marquardt, Zichen Sun, Georgi Chalakov, Esfandiar Manii, Amit Singh, Dana Kaban, Da Zhou, Junhua Gu, Saher Ahwal, Saurabh Pant, and James Baker.

Test
ABFS has gone through many test procedures including Hadoop file system contract tests, unit testing, functional testing, and manual testing. All the Junit tests provided with the driver are capable of running in both sequential/parallel fashion in order to reduce the testing time.
Besides unit tests, we have used ABFS as the default file system in Azure HDInsight. Azure HDInsight will very soon offer ABFS as a storage option. (HDFS is also used but not as default file system.) Various different customer and test workloads have been run against clusters with such configurations for quite some time. Benchmarks such as Tera*, TPC-DS, Spark Streaming and Spark SQL, and others have been run to do scenario, performance, and functional testing. Third parties and customers have also done various testing of ABFS.
The current version reflects to the version of the code tested and used in our production environment.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-15407-HADOOP-15407-008.patch
14/Jun/18 17:04
523 kB
Steve Loughran
HADOOP-15407-008.patch
14/Jun/18 17:03
523 kB
Steve Loughran
HADOOP-15407-HADOOP-15407.008.patch
12/Jun/18 19:38
522 kB
Da Zhou
HADOOP-15407-HADOOP-15407.007.patch
03/Jun/18 22:53
547 kB
Thomas Marqardt
HADOOP-15407-HADOOP-15407.006.patch
31/May/18 21:28
545 kB
Thomas Marqardt
HADOOP-15407-004.patch
23/May/18 23:45
571 kB
Esfandiar Manii
HADOOP-15407-003.patch
08/May/18 17:40
1.20 MB
Esfandiar Manii
HADOOP-15407-002.patch
24/Apr/18 01:28
1.20 MB
Esfandiar Manii
HADOOP-15407-001.patch
23/Apr/18 22:20
1.20 MB
Esfandiar Manii

Issue Links

is depended upon by

HADOOP-15763 Über-JIRA: abfs phase II: Hadoop 3.3 features & fixes

Resolved

is related to

HADOOP-15566 Support OpenTelemetry

Patch Available

Sub-Tasks

1.	AzureBlobFS - Base package classes and configuration files	Resolved	Esfandiar Manii
2.	AzureBlobFS - Contracts	Resolved	Esfandiar Manii
3.	AzureBlobFS - Constants	Resolved	Esfandiar Manii
4.	AzureBlobFS - Diagnostics and Utils	Resolved	Esfandiar Manii
5.	AzureBlobFS - Services	Resolved	Esfandiar Manii
6.	AzureBlobFS - Tests	Resolved	Esfandiar Manii
7.	ABFS: Commit of core codebase	Resolved	Da Zhou
8.	ABFS initialize() throws string out of bounds exception of the URI isn't fully qualified	Resolved	Steve Loughran
9.	ABFS: tune imports & javadocs; stabilise tests	Resolved	Thomas Marqardt
10.	ABFS: removed dependency injection and unnecessary dependencies	Resolved	Da Zhou
11.	ABFS: TestAbfsConfigurationFieldsValidation breaks if FS is configured in core-site	Resolved	Steve Loughran
12.	ABFS: Code changes for bug fix and new tests	Resolved	Da Zhou
13.	ABFS: Add support for OAuth	Closed	Da Zhou
14.	ABFS: Add support for ACL	Closed	Da Zhou
15.	ABFS: Simplify configuration	Resolved	Da Zhou
16.	ABFS: Reduce test run time via parallelization and grouping	Resolved	Da Zhou
17.	ABFS: Compatibility tests can fail	Resolved	Unassigned
18.	ABFS: Improve HTTPS Performance	Resolved	Vishwajeet Dusane
19.	ABFS: Add support for StreamCapabilities. Fix javadoc and checkstyle	Resolved	Thomas Marqardt
20.	ABFS: InputStream wrapped in FSDataInputStream twice	Closed	Sean Mackrory
21.	ABFS: extensible support for custom oauth	Resolved	Da Zhou
22.	ABFS: Allow OAuth credentials to not be tied to accounts	Resolved	Sean Mackrory
23.	ABFS - Implement client-side throttling	Resolved	Thomas Marqardt
24.	Mark ABFS extension package and interfaces as LimitedPrivate/Unstable	Resolved	Steve Loughran
25.	ABFS: Failure in OpenSSLProvider should fall back to JSSE	Resolved	Vishwajeet Dusane
26.	tune abfs/wasb parallel the sequential test execution	Resolved	Da Zhou
27.	ITestAzureBlobFileSystemE2E timing out with non-scale timeout of 10 min	Resolved	Da Zhou
28.	Fail-fast when using OAuth over http	Resolved	Da Zhou
29.	ABFS: Ranger Support	Resolved	Yuan Gao
30.	ABFS: Add backward compatibility to handle Unsupported Operation for storage account with no namespace feature	Resolved	Da Zhou
31.	ABFS: remove unused maven dependencies and add used undeclared dependencies	Resolved	Da Zhou
32.	ABFS: Check variable names during initialization of AbfsClientThrottlingIntercept	Resolved	Sneha Varma
33.	Add ABFS configuration to ConfigRedactor	Resolved	Sean Mackrory
34.	ABFS: support path "abfs://mycluster/file/path"	Resolved	Da Zhou
35.	ABFS: remove dependency on common-codec Base64	Resolved	Da Zhou
36.	AbstractContractAppendTest fails against HDFS on HADOOP-15407 branch	Resolved	Steve Loughran
37.	ABFS: distcp tests are always skipped	Resolved	Steve Loughran
38.	Merge HADOOP-15407 to trunk	Resolved	Sean Mackrory
39.	ABFS: Fix issues raised by Yetus	Resolved	Sean Mackrory
40.	ABFS: Fix client side throttling for read	Resolved	Sneha Varma
41.	ABFS: Skip unsupported test cases when non namespace enabled in ITestAzureBlobFileSystemAuthorization	Resolved	Yuan Gao
42.	ABFS: Fixing skipUserGroupMetadata in AzureBlobFileSystemStore	Resolved	Da Zhou
43.	ABFS: better exception handling when making getAccessToken call	Resolved	Da Zhou
44.	Backporting ABFS driver from trunk to branch 2.0	Resolved	Yuan Gao

Activity

People

Assignee:: Da Zhou

Reporter:: Esfandiar Manii

Votes:: 1 Vote for this issue

Watchers:: 30 Start watching this issue

Dates

Created:: 23/Apr/18 22:15

Updated:: 25/Sep/18 20:03

Resolved:: 25/Sep/18 20:03