[HIVE-19719] Adding metastore batch API for partitions - ASF JIRA

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0, 4.0.0
Fix Version/s: None
Component/s: Metastore
Labels:
None

Description

Hive Metastore provides APIs for fetching a collection of objects (usually tables or partitions). These APIs provide a way to fetch all available objects so the size of the response is O(N) where N is the number of objects. These calls have several problems:

All objects (and there may be thousands or even millions) should be fetched from the database, serialized to Java list of thrift objects then serialized into byte array for sending over the network. This creates spikes of huge memory pressure, especially since in some cases multiple of copies of the same data are present in memory (e.g. unserialized and serialized versions).
Even though HMS tries to avoid string duplication by use of string interning in JAVA, duplicated strings must be serialized in the output array.
Java has 2Gb limit on the maximum size of byte array, and crashes with Out Of Memory exception if this array size is exceeded
Fetching huge amount of objects blows up DB caches and memory caches in the system.
Receiving such huge messages also creates memory pressure on the receiver side (usually HS2) which can cause it crashing with Out of Memory exception as well.
Such requests have very big latencies since the server must collect all objects, serialize them and send them all to the network before the client can do anything with the result.

To prevent cases of Out Of Memory exceptions, the server now has a configurable limit on the maximum number of objects returned. This helps to avoid crashes, but doesn’t allow for correct query execution since the result will include random and incomplete set of K objects.

Currently this is addressed on the client side by simulating batching by getting list of table or partition names first and then requesting table information for parts of this list. Still, the list of objects can be big as well and this method requires locking to ensure that objects are not added or removed between the calls, especially if this is done outside of HS2.

Instead we can do simple modification of existing APIs which allows for batch iterator-style operations without keeping any server-side state. The main idea is to have a unique incrementing IDs for each objects. The IDs should be only unique within their container (e.g. table IDs should be unique within a database and partition IDs should be unique within a table).
Such ID can be easily generated using database auto-increment mechanism or we can be simply reuse existing ID column that is already maintained by the Data Nucleus.
The request is then modified to include

Starting ID i0
Batch size (B)

The server fetches up to B objects starting from i0, serlalizes them and sends to the client. The client then requests next batch by using the ID of the last received request plus one. It is possible to construct an SQL query (either by using DataNucleus JDOQL or in DirectSQL code) which only selects needed objects avoiding big reads from the database. The client then iterates until it fetches all the objects and each request memory size is limited by the value of batch size.
If we extend the API a little bit, providing a way to get the minimum and maximum ID values (either via a separate call or piggybacked to the normal reply), clients can request such batches concurrently, thus also reducing the latency. Clients can easily estimate number of batches by knowing the total number of IDs. While this isn’t a precise method it is good enough to divide the work.

It is also possible to wrap this in a way similar to PartitionIterator and async-fetch next batch while we are processing current batch.

Consistency considerations

HMS only provides consistency guarantees for a single call. The set of objects that should be returned may change while we are iterating over it. In some cases this is not an issue since HS2 may use ZooKeeper locks on the table to prevent modifications, but in some cases this may be an issue (for example for calls that originate from external systems. We should consider additions and removals separately.
New objects are added during iteration. All new objects are always added at the ‘end’ of ID space, so they will be always picked up by the iterator. We assume that IDs are always incrementing.
Some objects are removed during iteration. Removal of objects that are not already consumed is not a problem. It is possible that some objects which were already consumed are returned. Although this results in an inconsistent list of objects, this situation is indistinguishable from the situation when these objects were removed immediately after we got all objects in one atomic call. So it doens’t seem to be a practical issue.

Attachments

Issue Links

relates to

HIVE-19715 Consolidated and flexible API for fetching partition metadata from HMS

Open

Adding metastore batch API for partitions

Details

Description

Attachments

Issue Links

Activity

People

Dates