Hive
  1. Hive
  2. HIVE-1788

Add more calls to the metastore thrift interface

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      For administrative purposes the following calls to the metastore thrift interface would be very useful:

      1. Get the table metadata for all the tables owned by a particular users
      2. Ability to iterate over this set of tables
      3. Ability to change a particular key value property of the table

      1. HIVE-1788_3.txt
        310 kB
        Ashish Thusoo
      2. HIVE-1788_2.txt
        310 kB
        Ashish Thusoo
      3. HIVE-1788.txt
        261 kB
        Ashish Thusoo

        Activity

        Hide
        Ashish Thusoo added a comment -

        There is still one more error that I need to resolve in the tests before this patch works but any early review comments are welcome. Specifically there are some issues with the IN operator in the jdo stuff in this patch.

        Show
        Ashish Thusoo added a comment - There is still one more error that I need to resolve in the tests before this patch works but any early review comments are welcome. Specifically there are some issues with the IN operator in the jdo stuff in this patch.
        Hide
        Ashish Thusoo added a comment -

        There were some errors in the previous patch that I had overlooked.

        Show
        Ashish Thusoo added a comment - There were some errors in the previous patch that I had overlooked.
        Hide
        Ashish Thusoo added a comment -

        Changed the api so that we have two now...

        List<String> get_tables_by_owner(String owner) - this returns the table names

        List<Table> get_tables_by_names(List<String> names) - this returns the table objects corresponding to the names.

        Am currently testing this.

        Show
        Ashish Thusoo added a comment - Changed the api so that we have two now... List<String> get_tables_by_owner(String owner) - this returns the table names List<Table> get_tables_by_names(List<String> names) - this returns the table objects corresponding to the names. Am currently testing this.
        Hide
        Paul Yang added a comment -

        Talked with Ashish offline about this - a batch retrieval method for tables is definitely required as the RTT would be excessive with 100's of tables. Another approach that may work is to have an API that accepts a list of table names as an input. Then, that could be used in conjunction with a method that returns matching table names to deal with the inconsistent offset issues. This will require N+1 vs N RPC's as compared to the proposed method.

        It's preferable to keep the metastore API simple, so I'm a little hesitant to put in both approaches. If the currently proposed method is fast enough, then it'd be okay to leave it the way it is.

        Show
        Paul Yang added a comment - Talked with Ashish offline about this - a batch retrieval method for tables is definitely required as the RTT would be excessive with 100's of tables. Another approach that may work is to have an API that accepts a list of table names as an input. Then, that could be used in conjunction with a method that returns matching table names to deal with the inconsistent offset issues. This will require N+1 vs N RPC's as compared to the proposed method. It's preferable to keep the metastore API simple, so I'm a little hesitant to put in both approaches. If the currently proposed method is fast enough, then it'd be okay to leave it the way it is.
        Hide
        Ashish Thusoo added a comment -

        In my particular application I need to get the entire table object. If I get only the names and then call back to the metastore to get a table object at a time, it will be very slow. I have not measured the speed of the call with many tables, however, the hope is that with the offset and limit fields the application can stream those tables across multiple calls as opposed to getting them in one single gigantic batch. I will do the measurement though to find out how bad this is.

        The offsets would not be consistent if new tables are created in between, however, this level of consistency is not needed by the application. Think of an application displaying all the tables and their associated columns for a particular user and does that in a paginated way. During pagination if new tables appear or disappear is not critical from the application point of view. The only other way of ensuring that things are consistent would be to run the whole query without offsets and limits or do all this in a long transaction. The later would be bad because it will hold locks on tables while pagination is happening and that would be really bad for other clients. Thoughts?

        Show
        Ashish Thusoo added a comment - In my particular application I need to get the entire table object. If I get only the names and then call back to the metastore to get a table object at a time, it will be very slow. I have not measured the speed of the call with many tables, however, the hope is that with the offset and limit fields the application can stream those tables across multiple calls as opposed to getting them in one single gigantic batch. I will do the measurement though to find out how bad this is. The offsets would not be consistent if new tables are created in between, however, this level of consistency is not needed by the application. Think of an application displaying all the tables and their associated columns for a particular user and does that in a paginated way. During pagination if new tables appear or disappear is not critical from the application point of view. The only other way of ensuring that things are consistent would be to run the whole query without offsets and limits or do all this in a long transaction. The later would be bad because it will hold locks on tables while pagination is happening and that would be really bad for other clients. Thoughts?
        Hide
        Paul Yang added a comment -

        Instead of returning a list of Table objects, could we return a list of the matching table names? Then, the user would be responsible for getting the necessary table objects. Also, have you tried measuring the speed of the call where there are many (1000+) tables? It might be very slow, similar to how get_partitions() performs poorly compared to get_partition_names()

        Also, with the current approach, won't the offsets not be consistent if new tables are created in between calls?

        Show
        Paul Yang added a comment - Instead of returning a list of Table objects, could we return a list of the matching table names? Then, the user would be responsible for getting the necessary table objects. Also, have you tried measuring the speed of the call where there are many (1000+) tables? It might be very slow, similar to how get_partitions() performs poorly compared to get_partition_names() Also, with the current approach, won't the offsets not be consistent if new tables are created in between calls?
        Hide
        Ashish Thusoo added a comment -

        Patch attached.

        This is also available at

        https://reviews.apache.org/r/409/

        Show
        Ashish Thusoo added a comment - Patch attached. This is also available at https://reviews.apache.org/r/409/
        Hide
        Ashish Thusoo added a comment -

        3 can be addressed by the current thrift interface so only an iterator over tables owned by particular users is needed.

        Show
        Ashish Thusoo added a comment - 3 can be addressed by the current thrift interface so only an iterator over tables owned by particular users is needed.

          People

          • Assignee:
            Ashish Thusoo
            Reporter:
            Ashish Thusoo
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development