Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-7296

big data approximate processing at a very low cost based on hive sql

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      For big data analysis, we often need to do the following query and statistics:

      1.Cardinality Estimation, count the number of different elements in the collection, such as Unique Visitor ,UV)

      Now we can use hive-query:
      Select distinct(id) from TestTable ;

      2.Frequency Estimation: estimate number of an element is repeated, such as the site visits of a user 。

      Hive query: select count(1) from TestTable where name=”wangmeng”

      3.Heavy Hitters, top-k elements: such as top-100 shops

      Hive query: select count(1), name from TestTable group by name ; need UDF……

      4.Range Query: for example, to find out the number of users between 20 to 30

      Hive query : select count(1) from TestTable where age>20 and age <30

      5.Membership Query : for example, whether the user name is already registered?

      According to the implementation mechanism of hive , it will cost too large memory space and a long query time.

      However ,in many cases, we do not need very accurate results and a small error can be tolerated. In such case , we can use approximate processing to greatly improve the time and space efficiency.

      Now , based on some theoretical analysis materials ,I want to do some for these new features so much if possible.

      So, is there anything I can do ? Many Thanks.

      Attachments

        Activity

          People

            Unassigned Unassigned
            sjtufighter WangMeng
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: