[HIVE-7296] big data approximate processing at a very low cost based on hive sql - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

For big data analysis, we often need to do the following query and statistics：

1.Cardinality Estimation, count the number of different elements in the collection, such as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id) from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the site visits of a user 。

Hive query: select count(1) from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops

Hive query: select count(1), name from TestTable group by name ; need UDF……

4.Range Query: for example, to find out the number of users between 20 to 30

Hive query : select count(1) from TestTable where age>20 and age <30

5.Membership Query : for example, whether the user name is already registered?

According to the implementation mechanism of hive , it will cost too large memory space and a long query time.

However ,in many cases, we do not need very accurate results and a small error can be tolerated. In such case , we can use approximate processing to greatly improve the time and space efficiency.

Now , based on some theoretical analysis materials ,I want to do some for these new features so much if possible.

So, is there anything I can do ? Many Thanks.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: WangMeng

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 26/Jun/14 04:47

Updated:: 05/Jul/14 11:07