[HIVE-4196] Support for Streaming Partitions in Hive - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 0.10.1
Fix Version/s: None
Component/s: Database/Schema, HCatalog
Labels:
None

Release Note:
Tags:
Streaming HCatalog

Description

Motivation: Allow Hive users to immediately query data streaming in through clients such as Flume.

Currently Hive partitions must be created after all the data for the partition is available. Thereafter, data in the partitions is considered immutable.

This proposal introduces the notion of a streaming partition into which new files an be committed periodically and made available for queries before the partition is closed and converted into a standard partition.

The admin enables streaming partition on a table using DDL. He provides the following pieces of information:

Name of the partition in the table on which streaming is enabled
Frequency at which the streaming partition should be closed and converted into a standard partition.

Tables with streaming partition enabled will be partitioned by one and only one column. It is assumed that this column will contain a timestamp.

Closing the current streaming partition converts it into a standard partition. Based on the specified frequency, the current streaming partition is closed and a new one created for future writes. This is referred to as 'rolling the partition'.

A streaming partition's life cycle is as follows:

A new streaming partition is instantiated for writes

Streaming clients request (via webhcat) for a HDFS file name into which they can write a chunk of records for a specific table.

Streaming clients write a chunk (via webhdfs) to that file and commit it(via webhcat). Committing merely indicates that the chunk has been written completely and ready for serving queries.

When the partition is rolled, all committed chunks are swept into single directory and a standard partition pointing to that directory is created. The streaming partition is closed and new streaming partition is created. Rolling the partition is atomic. Streaming clients are agnostic of partition rolling.

Hive queries will be able to query the partition that is currently open for streaming. only committed chunks will be visible. read consistency will be ensured so that repeated reads of the same partition will be idempotent for the lifespan of the query.

Partition rolling requires an active agent/thread running to check when it is time to roll and trigger the roll. This could be either be achieved by using an external agent such as Oozie (preferably) or an internal agent.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-4196.v1.patch
30/Apr/13 02:35
2.46 MB
Roshan Naik
HCatalogStreamingIngestFunctionalSpecificationandDesign- apr 29- patch1.pdf
29/May/13 23:37
138 kB
Roshan Naik
HCatalogStreamingIngestFunctionalSpecificationandDesign- apr 29- patch1.docx
30/Apr/13 02:44
16 kB
Roshan Naik

Issue Links

is superceded by

HIVE-5687 Streaming support in Hive

Resolved

HIVE-5317 Implement insert, update, and delete in Hive with full ACID support

Closed

links to

Review Board link

Sub-Tasks

1.	Streaming - Web HCat API	Open	Roshan Naik
2.	Streaming - DDL support for enabling and disabling streaming	Open	Roshan Naik
3.	Streaming - Active agent for rolling a streaming partition into a standard partition	Open	Roshan Naik
4.	Streaming - Query committed chunks	Open	Roshan Naik
5.	Streaming - Compaction of partitions	Open	Roshan Naik

Activity

People

Assignee:: Roshan Naik

Reporter:: Roshan Naik

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 16/Mar/13 04:50

Updated:: 17/Feb/14 20:02

Resolved:: 29/Oct/13 22:16