Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.5.4
-
None
-
None
-
None
Description
Today’s HFile on S3 support lays out "files" in the S3 bucket exactly as it does in HDFS, and this is going to be a problem. S3 throttles IO to buckets based on prefix. See https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in an Amazon S3 bucket. There are no limits to the number of prefixes that you can have in your bucket. LIST and GET objects don’t share the same limit. The performance of LIST calls depend on the number of Deleted markers present at the top of an object version for a given prefix.
Today this looks like:
/hbase/data/<namespace>/<table>/<region>/<store>/hfile1 /hbase/data/<namespace>/<table>/<region>/<store>/hfile2 ...
Unfortunately by-prefix partitioning is performed by S3 in a black box manner with no API provided to hint it. Customary file separator characters like '/' are not specially considered.
The situation we want to avoid is where the load accounted to one or more hot stores aggregates up to an inopportune choke point where S3 may have auto partitioned at the region, table, or namespace level. Paths to any given store should avoid sharing a common path prefix with those of another.
We can continue to represent metadata in a hierarchical manner. Metadata is infrequently accessed compared to data because it is cached, or can be made to be cached, because the size of metadata is a tiny fraction of the size of all data. So a resulting layout might look like:
/hbase/data/<namespace>/<table>/<region>/file.list /<store>/hfile1 /<store>/hfile2 ...
where file.list is our current manifest based HFile tracking, managed by FileBasedStoreFileTracker. This is simply a relocation of stores to a different path construction while maintaining all of the other housekeeping as-is. The file based store file tracker manifest allows us to make this change easily, and iteratively, supporting in-place migration. It seems straightforward to implement as new version of FileBasedStoreFileTracker with an automatic path to migration. Adapting the HBCK2 support for rebuilding the store file list should also be straightforward if we can version the FileBasedStoreFileTracker and teach it about the different versions.
Bucket layouts for the HFile archive should also take the same approach. Snapshots are based on archiving so tackling one takes care of the other.
/hbase/archive/data/<namespace>/<table>/<region>/file.list /<store>/hfile1 /<store>/hfile2 ...
e.g.
/f572d396fae9206628714fb2ce00f72e94f2258f/7f900f36ebc78d125c773ac0e3a000ad355b8ba1.hfile /f572d396fae9206628714fb2ce00f72e94f2258f/988881adc9fc3655077dc2d4d757d480b5ea0e11.hfile
This is still not entirely ideal, but is the best we can do. Using cryptographic hashes of store metadata as prefixes distributes the placement of any given store into any potential S3 partition randomly. The probability of a client accessing any particular point in the HBase keyspace has a similar random distribution. These will not be the same distribution, but should be a reasonable approximation. What should happen is hotspots will only impact clients accessing the specific store and region that is hotspotting, like today, and the blast radius is not significantly wider.
Attachments
Issue Links
- relates to
-
HBASE-26584 Further improvements on StoreFileTracker
- Open
-
HBASE-27842 FileBasedStoreFileTracker v2
- Open