[HIVE-10165] Improve hive-hcatalog-streaming extensibility and support updates and deletes. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.2.0
Fix Version/s: 2.0.0
Component/s: HCatalog
Labels:
- streaming_api

Release Note:
Expanded streaming API to include update and delete operations and support merge type processes.

Description

Overview

I'd like to extend the hive-hcatalog-streaming API so that it also supports the writing of record updates and deletes in addition to the already supported inserts.

Motivation

We have many Hadoop processes outside of Hive that merge changed facts into existing datasets. Traditionally we achieve this by: reading in a ground-truth dataset and a modified dataset, grouping by a key, sorting by a sequence and then applying a function to determine inserted, updated, and deleted rows. However, in our current scheme we must rewrite all partitions that may potentially contain changes. In practice the number of mutated records is very small when compared with the records contained in a partition. This approach results in a number of operational issues:

Excessive amount of write activity required for small data changes.
Downstream applications cannot robustly read these datasets while they are being updated.
Due to scale of the updates (hundreds or partitions) the scope for contention is high.

I believe we can address this problem by instead writing only the changed records to a Hive transactional table. This should drastically reduce the amount of data that we need to write and also provide a means for managing concurrent access to the data. Our existing merge processes can read and retain each record's ROW_ID/RecordIdentifier and pass this through to an updated form of the hive-hcatalog-streaming API which will then have the required data to perform an update or insert in a transactional manner.

Benefits

Enables the creation of large-scale dataset merge processes
Opens up Hive transactional functionality in an accessible manner to processes that operate outside of Hive.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-10165.10.patch
30/Jun/15 14:11
213 kB
Elliot West
HIVE-10165.9.patch
22/Jun/15 15:42
202 kB
Elliot West
HIVE-10165.7.patch
15/Jun/15 13:51
184 kB
Elliot West
HIVE-10165.6.patch
05/Jun/15 19:52
183 kB
Elliot West
mutate-system-overview.png
27/May/15 11:46
106 kB
Elliot West
HIVE-10165.5.patch
26/May/15 10:00
185 kB
Elliot West
HIVE-10165.4.patch
25/May/15 20:37
183 kB
Elliot West
HIVE-10165.0.patch
24/May/15 14:54
159 kB
Elliot West

Issue Links

depends upon

HIVE-11078 Enhance DbLockManger to support multi-statement txns

Open

is related to

HIVE-11030 Enhance storage layer to create one delta file per write

Resolved

HIVE-11228 Mutation API should use semi-shared locks.

Closed

Activity

People

Assignee:: Elliot West

Reporter:: Elliot West

Votes:: 2 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 31/Mar/15 13:54

Updated:: 01/Oct/19 22:07

Resolved:: 30/Jun/15 22:02