Hive
  1. Hive
  2. HIVE-7604

Add Metastore API to fetch one or more partition names

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Metastore
    • Labels:
      None

      Description

      We need a new API in Metastore to address the following use cases. Both use cases arise from having tables with hundreds of thousands or in some cases millions of partitions.

      1. It should be quick and easy to obtain distinct values of a partition. Eg: Obtain all dates for which partitions are available. This can be used by tools/frameworks programmatically to understand gaps in partitions before reprocessing them. Currently one has to run Hive queries (JDBC or CLI) to obtain this information which is unfriendly and heavy weight. And for tables which have large number of partitions, it takes a long time to run the queries and it also requires large heap space.

      2. Typically users would like to know the list of partitions available and would run queries that would only involve partition keys (select distinct partkey1 from table) Or to obtain the latest date partition from a dimension table to join against another fact table (select * from fact_table join select max(dt) from dimension_table). Those queries (metadata only queries) can be pushed to metastore and need not be run even locally in Hive. If the queries can be converted into database based queries, the clients can be light weight and need not fetch all partition names. The results can be obtained much faster with less resources.

      1. Design_HIVE_7604.txt
        2 kB
        Thiruvel Thirumoolan
      2. Design_HIVE_7604.1.txt
        3 kB
        Thiruvel Thirumoolan

        Activity

        Thiruvel Thirumoolan created issue -
        Hide
        Ashutosh Chauhan added a comment -

        Thiruvel Thirumoolan This sounds useful. We have similar use case to figure out number of distinct values for a given partition column. That is for query planning purposes in new cost based optimizer. Would love to see what api your are proposing to see if we can use those in our use case as well.

        Show
        Ashutosh Chauhan added a comment - Thiruvel Thirumoolan This sounds useful. We have similar use case to figure out number of distinct values for a given partition column. That is for query planning purposes in new cost based optimizer. Would love to see what api your are proposing to see if we can use those in our use case as well.
        Hide
        Thiruvel Thirumoolan added a comment -

        Ashutosh Chauhan Thanks. I will post an API signature today.

        Show
        Thiruvel Thirumoolan added a comment - Ashutosh Chauhan Thanks. I will post an API signature today.
        Hide
        Sergey Shelukhin added a comment -

        Note that there are already APIs to send filters on partition cols to metastore to return limited number of partitions... perhaps similar APIs can be added for names/cols

        Show
        Sergey Shelukhin added a comment - Note that there are already APIs to send filters on partition cols to metastore to return limited number of partitions... perhaps similar APIs can be added for names/cols
        Hide
        Thiruvel Thirumoolan added a comment -

        Attaching file that describes the API and rationality behind them.

        I have an alpha implementation which obtains distinct values of partition keys. To start with, this is only ORM and its approach is very similar to ExpressionTree.java (using substring and indexOf string functions). Tested this with a table containing about a million partitions, partitioned by 6 keys and using Oracle as backend. It takes 2-4 seconds to obtain unique values of a partition. Hope this provides a rough idea of latency for large tables.

        Show
        Thiruvel Thirumoolan added a comment - Attaching file that describes the API and rationality behind them. I have an alpha implementation which obtains distinct values of partition keys. To start with, this is only ORM and its approach is very similar to ExpressionTree.java (using substring and indexOf string functions). Tested this with a table containing about a million partitions, partitioned by 6 keys and using Oracle as backend. It takes 2-4 seconds to obtain unique values of a partition. Hope this provides a rough idea of latency for large tables.
        Thiruvel Thirumoolan made changes -
        Field Original Value New Value
        Attachment Design_HIVE_7604.txt [ 12661515 ]
        Hide
        Thiruvel Thirumoolan added a comment -

        Thanks Sergey Shelukhin. I will reuse as much as possible.

        Show
        Thiruvel Thirumoolan added a comment - Thanks Sergey Shelukhin . I will reuse as much as possible.
        Hide
        Thiruvel Thirumoolan added a comment -

        Ashutosh Chauhan Do you have any comments on the API?

        Show
        Thiruvel Thirumoolan added a comment - Ashutosh Chauhan Do you have any comments on the API?
        Hide
        Ashutosh Chauhan added a comment -

        Looks good to me. I couldn't understand PartitionValuesResponse completely. Can you add a small description of it in your design doc.

        Show
        Ashutosh Chauhan added a comment - Looks good to me. I couldn't understand PartitionValuesResponse completely. Can you add a small description of it in your design doc.
        Hide
        Thiruvel Thirumoolan added a comment -

        Thanks Ashutosh Chauhan, uploading revised document with additional information for return values. Lemme know if its unclear.

        Show
        Thiruvel Thirumoolan added a comment - Thanks Ashutosh Chauhan , uploading revised document with additional information for return values. Lemme know if its unclear.
        Thiruvel Thirumoolan made changes -
        Attachment Design_HIVE_7604.1.txt [ 12664523 ]
        Hide
        Ashutosh Chauhan added a comment -

        +1 Proposed api LGTM

        Show
        Ashutosh Chauhan added a comment - +1 Proposed api LGTM
        Gunther Hagleitner made changes -
        Fix Version/s 0.14.0 [ 12326450 ]

          People

          • Assignee:
            Thiruvel Thirumoolan
            Reporter:
            Thiruvel Thirumoolan
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development