Hive
  1. Hive
  2. HIVE-493

automatically infer existing partitions of table from HDFS files.

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.3.0, 0.4.0
    • Fix Version/s: None
    • Component/s: Metastore, Query Processor
    • Labels:
      None

      Description

      Initially partition list for a table is inferred from HDFS directory structure instead of looking into metastore (partitions are created using 'alter table ... add partition'). but this automatic inferring was removed to favor the later approach during checking-in metastore checker feature and also to facilitate external partitions.

      Joydeep and Frederick mentioned that it would simple for users to create the HDFS directory and let Hive infer rather than explicitly add a partition. But doing that raises following...

      1) External partitions – so we have to mix both approaches and partition list is merged list of inferred partitions and registered partitions. and duplicates have to be resolved.
      2) Partition level schemas can't supported. Which schema to chose for the inferred partitions? the table schema when the inferred partition is created or the latest tale schema? how do we know the table schema when the inferred partitions is created?
      3) If partitions have to be registered the partitions can be disabled without actually deleting the data. this feature is not supported and may not be that useful but nevertheless this can't be supported with inferred partitions
      4) Indexes are being added. So if partitions are not registered then indexes for such partitions can not be maintained automatically.

      I would like to know what is the general thinking about this among users of Hive. If inferred partitions are preferred then can we live with restricted functionality that this imposes?

      1. HIVE-493-2.patch
        8 kB
        Cyrus Katrak
      2. HIVE-493.patch
        10 kB
        Cyrus Katrak

        Issue Links

          Activity

          Hide
          Zheng Shao added a comment -

          We won't need this any more given HIVE-874.

          Show
          Zheng Shao added a comment - We won't need this any more given HIVE-874 .
          Hide
          Prasad Chakka added a comment -

          Created HIVE-874 for this patch as this doesn't exactly solve this problem. I would like to keep this open for that 'perfect' solution

          Show
          Prasad Chakka added a comment - Created HIVE-874 for this patch as this doesn't exactly solve this problem. I would like to keep this open for that 'perfect' solution
          Hide
          Cyrus Katrak added a comment -

          Prasad,

          Patch revised: Took away the remove partition functionality
          1) I took a look at TestHiveMetaStoreChecker.java, this doesn't seem to expose the ability to run a HQL command, and the clientpostitive scripts do not seem to expose a way for modifying the file system between queries. Any suggestions?
          2) Done. Thanks for the pointer.
          3) Done.
          4) No, removed.

          Show
          Cyrus Katrak added a comment - Prasad, Patch revised: Took away the remove partition functionality 1) I took a look at TestHiveMetaStoreChecker.java, this doesn't seem to expose the ability to run a HQL command, and the clientpostitive scripts do not seem to expose a way for modifying the file system between queries. Any suggestions? 2) Done. Thanks for the pointer. 3) Done. 4) No, removed.
          Hide
          Prasad Chakka added a comment -

          Cyrus,

          Thanks for providing this patch. Very useful.

          It is possible that on an HDFS with permissions enabled, a partition/table directory is not accessible to the current user but metadata will be deleted here so I am little uncomfortable in removing partitions. I am not really sure that there is that much utility for removing partitions compared to the risk loosing partitions permanently. What do you think?

          Couple of comments on the code:
          1) Can you add a test or two to the msck test package.
          2) REPAIR should be an optional keyword to the MSCK ANTRL clause instead of being whole another clause. Look at how KW_EXTERNAL is used in createStatement clause.
          3) Following like should be outside of the for loop since there is only one table here.

          Table table = db.getTable(MetaStoreUtils.DEFAULT_DATABASE_NAME,
                          msckDesc.getTableName());
          

          4) Is this cast '(Map <String, String>)' really needed?

          Show
          Prasad Chakka added a comment - Cyrus, Thanks for providing this patch. Very useful. It is possible that on an HDFS with permissions enabled, a partition/table directory is not accessible to the current user but metadata will be deleted here so I am little uncomfortable in removing partitions. I am not really sure that there is that much utility for removing partitions compared to the risk loosing partitions permanently. What do you think? Couple of comments on the code: 1) Can you add a test or two to the msck test package. 2) REPAIR should be an optional keyword to the MSCK ANTRL clause instead of being whole another clause. Look at how KW_EXTERNAL is used in createStatement clause. 3) Following like should be outside of the for loop since there is only one table here. Table table = db.getTable(MetaStoreUtils.DEFAULT_DATABASE_NAME, msckDesc.getTableName()); 4) Is this cast '(Map <String, String>)' really needed?
          Hide
          Cyrus Katrak added a comment -

          Patch Attached.
          Modified the HQL command MSCK to take an additional parameter REPAIR, that add/removes partitions that msck notices.
          Example output:

          hive> msck table t;
          OK
          Partitions not in metastore: t:p1=new/p2=5 t:p1=new/p2=6
          Partitions missing from filesystem: t:p1=6/p2=5 t:p1=6/p2=6
          Time taken: 0.084 seconds
          hive> msck repair table t;
          OK
          Partitions not in metastore: t:p1=new/p2=5 t:p1=new/p2=6
          Partitions missing from filesystem: t:p1=6/p2=5 t:p1=6/p2=6
          Repair: Added partition to metastore t:p1=new/p2=5
          Repair: Added partition to metastore t:p1=new/p2=6
          Repair: removed partition from metastore t:p1=6/p2=5
          Repair: removed partition from metastore t:p1=6/p2=6
          Time taken: 0.521 seconds
          hive> msck table t;
          OK
          Time taken: 0.083 seconds

          Show
          Cyrus Katrak added a comment - Patch Attached. Modified the HQL command MSCK to take an additional parameter REPAIR, that add/removes partitions that msck notices. Example output: hive> msck table t; OK Partitions not in metastore: t:p1=new/p2=5 t:p1=new/p2=6 Partitions missing from filesystem: t:p1=6/p2=5 t:p1=6/p2=6 Time taken: 0.084 seconds hive> msck repair table t; OK Partitions not in metastore: t:p1=new/p2=5 t:p1=new/p2=6 Partitions missing from filesystem: t:p1=6/p2=5 t:p1=6/p2=6 Repair: Added partition to metastore t:p1=new/p2=5 Repair: Added partition to metastore t:p1=new/p2=6 Repair: removed partition from metastore t:p1=6/p2=5 Repair: removed partition from metastore t:p1=6/p2=6 Time taken: 0.521 seconds hive> msck table t; OK Time taken: 0.083 seconds
          Hide
          Cyrus Katrak added a comment -

          Ping? Was half way through writing a script to do this when I found this thread.

          You can do 'alter table <tbl> add partition <partition spec>' at the end of the map-reduce job that creates the partition. You don't really 'automatic inference' unless you do not have any control over the partition creation process

          In one of my use cases I don't have control over partition creation (Intra-cluster copying of a hive table)
          Ignoring the other issues (Indicies/Compaction), I think an HQL solution would be usefull.

          e.g.: "ALTER TABLE <tbl> ADD PARTITION AUTOSCAN"
          Adds entries for partitions on hdfs that don't exist in the metastore.

          Show
          Cyrus Katrak added a comment - Ping? Was half way through writing a script to do this when I found this thread. You can do 'alter table <tbl> add partition <partition spec>' at the end of the map-reduce job that creates the partition. You don't really 'automatic inference' unless you do not have any control over the partition creation process In one of my use cases I don't have control over partition creation (Intra-cluster copying of a hive table) Ignoring the other issues (Indicies/Compaction), I think an HQL solution would be usefull. e.g.: "ALTER TABLE <tbl> ADD PARTITION AUTOSCAN" Adds entries for partitions on hdfs that don't exist in the metastore.
          Hide
          Schubert Zhang added a comment -

          @Prasad,

          1) We want the data to be available in the Hive table after each run of the MapReduce job. Do you mean we should only 'add partition' once at the first time? and need not do it after add new files into a existing partition?

          2) Thanks for your good advices. The lease is necessary.

          It seems we (you and we) should add more features into Hive and let Hive to be a data server (or data warehouse server), since current Hive is a dumb data tool to translate HQL to MapReduce.

          When thinking of following important data service features, such as:

          • Indexing
          • Compact
          • Lease
          • Real Dyanmic Partition (or automatic partition ,even varying partition)
          • ....

          It seems, besides MapReduce data processing and SQL, we shall implement a light and flexiable BigTable/HBase like data service solution.

          Show
          Schubert Zhang added a comment - @Prasad, 1) We want the data to be available in the Hive table after each run of the MapReduce job. Do you mean we should only 'add partition' once at the first time? and need not do it after add new files into a existing partition? 2) Thanks for your good advices. The lease is necessary. It seems we (you and we) should add more features into Hive and let Hive to be a data server (or data warehouse server), since current Hive is a dumb data tool to translate HQL to MapReduce. When thinking of following important data service features, such as: Indexing Compact Lease Real Dyanmic Partition (or automatic partition ,even varying partition) .... It seems, besides MapReduce data processing and SQL, we shall implement a light and flexiable BigTable/HBase like data service solution.
          Hide
          Prasad Chakka added a comment -

          When to run 'add partition' command is depending on when you want to make the partition available for use. But you should run that command only once per partition not everytime you add new files to the partition.

          In fact, we are thinking of this issue in our project. Is there any good practices?

          if you have a locking server available in your system such as ZooKeeper you should use it. Otherwise you could create an HDFS file as lock. But you need to be careful here since a client can die after acquiring a lock and thus creating lot of orphaned locks.

          Correct solution will be to create a lease server inside of Hive using metastore db. I might end up doing this if I get couple of days of time.

          Show
          Prasad Chakka added a comment - When to run 'add partition' command is depending on when you want to make the partition available for use. But you should run that command only once per partition not everytime you add new files to the partition. In fact, we are thinking of this issue in our project. Is there any good practices? if you have a locking server available in your system such as ZooKeeper you should use it. Otherwise you could create an HDFS file as lock. But you need to be careful here since a client can die after acquiring a lock and thus creating lot of orphaned locks. Correct solution will be to create a lease server inside of Hive using metastore db. I might end up doing this if I get couple of days of time.
          Hide
          Schubert Zhang added a comment -

          Thanks Prasad,

          1) Do you mean we should run 'alter table <tbl> add partition <partition spec>' at the end of each of the MapReducen run?
          For example:

          The first run of above MapReduce job create directories of partition-1 and partition-2, and each have some files under. And we should do'alter table <tbl> add partition partition-1' and 'alter table <tbl> add partition partition-2'.

          The second run of the job generate some files under partition-2 and create a new partition directory partition-3. And we should do 'alter table <tbl> add partition partition-2' and 'alter table <tbl> add partition partition-3'.

          Is right?

          2) When the first do run of 1)'s example, the data in partition-1 and partition-2 are available?
          And do you mean after the second run of 1)'s example, the new added data in partition-2 will become available with the old existing data?

          3) Yes, it is a serious issue to co-ordinate the read and write, since Hive is not a strict data server, but it is a loose solution.
          In fact, we are thinking of this issue in our project. Is there any good practices?

          Show
          Schubert Zhang added a comment - Thanks Prasad, 1) Do you mean we should run 'alter table <tbl> add partition <partition spec>' at the end of each of the MapReducen run? For example: The first run of above MapReduce job create directories of partition-1 and partition-2, and each have some files under. And we should do'alter table <tbl> add partition partition-1' and 'alter table <tbl> add partition partition-2'. The second run of the job generate some files under partition-2 and create a new partition directory partition-3. And we should do 'alter table <tbl> add partition partition-2' and 'alter table <tbl> add partition partition-3'. Is right? 2) When the first do run of 1)'s example, the data in partition-1 and partition-2 are available? And do you mean after the second run of 1)'s example, the new added data in partition-2 will become available with the old existing data? 3) Yes, it is a serious issue to co-ordinate the read and write, since Hive is not a strict data server, but it is a loose solution. In fact, we are thinking of this issue in our project. Is there any good practices?
          Hide
          Prasad Chakka added a comment -

          1) You can do 'alter table <tbl> add partition <partition spec>' at the end of the map-reduce job that creates the partition. You don't really 'automatic inference' unless you do not have any control over the partition creation process

          2) Adding new files to existing partitions and tables should not be a problem now. But you may want to do the above add partition command only when last of the partition files have been added otherwise the dependent data processes might see incomplete data. With automatic inferring of partitions, the dependent processes can see incomplete data.

          3) Netflix guys wrote a custom map/reducer job to compact/merging process. But you may want to co-ordinate the partition readers and compactors so that later does not clobber the directory.

          Show
          Prasad Chakka added a comment - 1) You can do 'alter table <tbl> add partition <partition spec>' at the end of the map-reduce job that creates the partition. You don't really 'automatic inference' unless you do not have any control over the partition creation process 2) Adding new files to existing partitions and tables should not be a problem now. But you may want to do the above add partition command only when last of the partition files have been added otherwise the dependent data processes might see incomplete data. With automatic inferring of partitions, the dependent processes can see incomplete data. 3) Netflix guys wrote a custom map/reducer job to compact/merging process. But you may want to co-ordinate the partition readers and compactors so that later does not clobber the directory.
          Hide
          Schubert Zhang added a comment -

          We have following use case:

          1. We have a periodic MapReduce job to pre-process the source data (files) and want put the output data files into HDFS directory. The HDFS directory is correspond to a Hive table (this table should be partitioned). The above MapReduce job shall output data into different partitions based on data analysis.

          2. We want Hive to recognise any new raised partitions from HDFS sub-directories under the table's root directory. And the above MapReduce job may add new files into new created partitions or existing partitions.

          3. We also need a compact/merging process to periodic compact or merge the existing partitions to get bigger files.

          Show
          Schubert Zhang added a comment - We have following use case: 1. We have a periodic MapReduce job to pre-process the source data (files) and want put the output data files into HDFS directory. The HDFS directory is correspond to a Hive table (this table should be partitioned). The above MapReduce job shall output data into different partitions based on data analysis. 2. We want Hive to recognise any new raised partitions from HDFS sub-directories under the table's root directory. And the above MapReduce job may add new files into new created partitions or existing partitions. 3. We also need a compact/merging process to periodic compact or merge the existing partitions to get bigger files.
          Hide
          Prasad Chakka added a comment -

          yeah, we could have java toolset similar to ToolRunner that Hadoop provides. i think python will be more useful only because it is a natural scripting language and can develop tools much faster there.

          Show
          Prasad Chakka added a comment - yeah, we could have java toolset similar to ToolRunner that Hadoop provides. i think python will be more useful only because it is a natural scripting language and can develop tools much faster there.
          Hide
          Edward Capriolo added a comment -

          Prasad,

          I see your point about creating too many tools. Agreed.

          As for the python integration. Should we have a java native toolset? It more direct then thift, jdbc, in a sense we already have it since the project is in java. It would/should be a very easy to use overlay.

          Show
          Edward Capriolo added a comment - Prasad, I see your point about creating too many tools. Agreed. As for the python integration. Should we have a java native toolset? It more direct then thift, jdbc, in a sense we already have it since the project is in java. It would/should be a very easy to use overlay.
          Hide
          Prasad Chakka added a comment -

          hive --service compact --table tablea

          this would be easiest for users but not extensible in the sense that users will not be able to do more things. eventually users start asking for options to do such things as not rebuild indexes but just drop them or not compact if the number files is small or total size is small etc. this requires lot more effort from developers to provide such options. it may be better to provide ways to script these things using python. may be we can provide basic tools that does not require lot of Hive internal knowledge. there is a JIRA open to create and execute Hive queries natively in Python. We should extend such API to do these kind of tasks better.

          i am not sure how you can use queries like 'select columns from table where tablename='tablea' ' for your usecase without a PLSQL kind of support from HiveQL.

          Show
          Prasad Chakka added a comment - hive --service compact --table tablea this would be easiest for users but not extensible in the sense that users will not be able to do more things. eventually users start asking for options to do such things as not rebuild indexes but just drop them or not compact if the number files is small or total size is small etc. this requires lot more effort from developers to provide such options. it may be better to provide ways to script these things using python. may be we can provide basic tools that does not require lot of Hive internal knowledge. there is a JIRA open to create and execute Hive queries natively in Python. We should extend such API to do these kind of tasks better. i am not sure how you can use queries like 'select columns from table where tablename='tablea' ' for your usecase without a PLSQL kind of support from HiveQL.
          Hide
          Edward Capriolo added a comment -

          Prasad,

          by 'virtual read only schema' I mean that mysql has a schema table. You can run queries like
          'select columns from table where tablename='tablea' '. Hive has some support for things like 'show partitions', but some data can only be read with the MetaStore API.

          For a use case. I want to write my compact utility. As a user I am faced with options on how to do this. The best way would be a tightly integrated tool.

          hive --service compact --table tablea
          

          This tool would be smart. It would do things rebuild indexes, warn or error on external tables, or bucketed columns.

          On the other side of the spectrum one could write a map/reduce job with hard coded paths to the warehouse that works on the file level this assumes I know a lot about the underpinning of hive.

          Users need to be able to explore the metastore table structure and the warehouse so their utilities can make an informed decision. Should they link to the metastore API or should the hql language support most of the operations and the process could be done purely from an HQL script?

          Show
          Edward Capriolo added a comment - Prasad, by 'virtual read only schema' I mean that mysql has a schema table. You can run queries like 'select columns from table where tablename='tablea' '. Hive has some support for things like 'show partitions', but some data can only be read with the MetaStore API. For a use case. I want to write my compact utility. As a user I am faced with options on how to do this. The best way would be a tightly integrated tool. hive --service compact --table tablea This tool would be smart. It would do things rebuild indexes, warn or error on external tables, or bucketed columns. On the other side of the spectrum one could write a map/reduce job with hard coded paths to the warehouse that works on the file level this assumes I know a lot about the underpinning of hive. Users need to be able to explore the metastore table structure and the warehouse so their utilities can make an informed decision. Should they link to the metastore API or should the hql language support most of the operations and the process could be done purely from an HQL script?
          Hide
          Prasad Chakka added a comment -

          @edward, you can write an externalutility that keeps checking directories and compacting them as necessary. don't understand 'virtual read only schema'

          Show
          Prasad Chakka added a comment - @edward, you can write an externalutility that keeps checking directories and compacting them as necessary. don't understand 'virtual read only schema'
          Hide
          Edward Capriolo added a comment -

          I have a USE CASE for something similar and I wanted to get peoples opinion on it. My intake process is a map reduce job that takes as input a list of servers. On these servers I connect via FTP and take all the new files. We are doing 5 minute logs.

          I have a map only job that writes the Files to a static HDFS folder. After the map process is complete I am presented with exactly this problem.

          Do I assume the partition is created, and copy the files? I decided to let hive handle this instead.

            String hql=" load data inpath '"+conf.get("fs.default.name")+"/user/ecapriolo/pull/raw_web_log/"+p.getName()+
                             "' into table raw_web_data partition (log_date_part='"+dateFormat.format(today.getTime())+"')";
                System.out.println("Running "+hql);
                String [] run = new String [] { "/opt/hive/bin/hive", "-e", hql };
          
           LoadThread lt = new LoadThread(run);
           Thread t = new Thread(lt);
           t.start();
          

          Personally, I do not think we should let users infer into the table layout of hive. Users should have tools, whether these be API based or HQL based tools. I should not have to mix match between hive -e 'something', map/reduce, bash scripting to get a job accomplished (I spent 4 hours trying to get the environment correct for my forked 'hive -e query') (I probably should learn more about the thrift API )

          But that problem I already solved. My next problem is also important to this discussion. I now have too many files inside my directory. I am partitioned by day, but each server is dropping 5 minute log files. What I really need now is a COMPACT function. To merge all these 5 minute data files into one. What would be the proper way to handle this? I could take an all query based approach, by selecting all the data into a new table. Then I need to drop the partition and selecting the data back into the original table. However I could short circuit the operations (and save time) by building the new partition first, deleting the old data, and then moving the new data it back using 'dfs mv'

          Should this be a done through HQL " Compact table X partiton Y "? Or should a command like service be done? bin/hive --service compact table X partition Y. Doing it all though HQL is possible now, but not optimized in some cases. Unless I am missing something.

          I think we need more easily insight into the metastore from HQL like how mysql does. show tables is a good step but we need something like a virtual read only schema table to query.

          Sorry to be all over the place on this post.

          Show
          Edward Capriolo added a comment - I have a USE CASE for something similar and I wanted to get peoples opinion on it. My intake process is a map reduce job that takes as input a list of servers. On these servers I connect via FTP and take all the new files. We are doing 5 minute logs. I have a map only job that writes the Files to a static HDFS folder. After the map process is complete I am presented with exactly this problem. Do I assume the partition is created, and copy the files? I decided to let hive handle this instead. String hql=" load data inpath '"+conf.get("fs.default.name")+"/user/ecapriolo/pull/raw_web_log/"+p.getName()+ "' into table raw_web_data partition (log_date_part='"+dateFormat.format(today.getTime())+"')"; System.out.println("Running "+hql); String [] run = new String [] { "/opt/hive/bin/hive", "-e", hql }; LoadThread lt = new LoadThread(run); Thread t = new Thread(lt); t.start(); Personally, I do not think we should let users infer into the table layout of hive. Users should have tools, whether these be API based or HQL based tools. I should not have to mix match between hive -e 'something', map/reduce, bash scripting to get a job accomplished (I spent 4 hours trying to get the environment correct for my forked 'hive -e query') (I probably should learn more about the thrift API ) But that problem I already solved. My next problem is also important to this discussion. I now have too many files inside my directory. I am partitioned by day, but each server is dropping 5 minute log files. What I really need now is a COMPACT function. To merge all these 5 minute data files into one. What would be the proper way to handle this? I could take an all query based approach, by selecting all the data into a new table. Then I need to drop the partition and selecting the data back into the original table. However I could short circuit the operations (and save time) by building the new partition first, deleting the old data, and then moving the new data it back using 'dfs mv' Should this be a done through HQL " Compact table X partiton Y "? Or should a command like service be done? bin/hive --service compact table X partition Y. Doing it all though HQL is possible now, but not optimized in some cases. Unless I am missing something. I think we need more easily insight into the metastore from HQL like how mysql does. show tables is a good step but we need something like a virtual read only schema table to query. Sorry to be all over the place on this post.

            People

            • Assignee:
              Unassigned
              Reporter:
              Prasad Chakka
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development