Hive
  1. Hive
  2. HIVE-91

Allow external tables with different partition directory structure

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.3.0
    • Component/s: Metastore
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      A lot of users have datasets in a directory structures similar to this in hdfs: /dataset/yyyy/MM/dd/<one or more files>
      Instead of loading these into Hive the normal way it would be useful to create an external table with the /dataset location and then one partition per yyyy/mm/dd. This would require the partition "naming to directory"-function to be made more flexible.

      1. HIVE-91.patch
        40 kB
        Johan Oskarsson
      2. HIVE-91.patch
        40 kB
        Johan Oskarsson
      3. HIVE-91.patch
        32 kB
        Johan Oskarsson
      4. HIVE-91.patch
        30 kB
        Johan Oskarsson

        Issue Links

          Activity

          Hide
          Johan Oskarsson added a comment -

          Comment from Joydeep Sen Sarma on the mailinglist:
          "Or we could have a 'format' spec in the create table command for how the directories are named. By default it's '%key=%value', but in this case it's '%value'. this might make it more flexible if we encounter other kinds of directory layouts."

          Show
          Johan Oskarsson added a comment - Comment from Joydeep Sen Sarma on the mailinglist: "Or we could have a 'format' spec in the create table command for how the directories are named. By default it's '%key=%value', but in this case it's '%value'. this might make it more flexible if we encounter other kinds of directory layouts."
          Hide
          Joydeep Sen Sarma added a comment -

          one thing is not clear to me:

          when an external table is created pointing to a location - do the subdirectories automatically get registered into corresponding partitions in Hive? similarly - when new subdirectories are added - what happens - does Hive recognize them automatically. (the only other alternative would be to call 'load data ...' where the data directory and the target directory will be the same - which would probably work - but i don't think we have tried it out).

          (this is kind of relevant to hive-126 - since we are getting rid of the logic that recognizes partitions based on hdfs contents).

          this seems like a usability issue. if the directories already exist and you are unwilling to alter them (so that hive can convert it into internal table directory structure) - then i presume that there are other apps that work directly against the directory namespace - and perhaps there is already a pipeline to populate these directories on an ongoing basis. this would suggest that hive should just learn about partitions from the hdfs namespace - rather than burden those pipelines to call 'load data' and 'drop partition' on subdir creation/deletion.

          comments?

          Show
          Joydeep Sen Sarma added a comment - one thing is not clear to me: when an external table is created pointing to a location - do the subdirectories automatically get registered into corresponding partitions in Hive? similarly - when new subdirectories are added - what happens - does Hive recognize them automatically. (the only other alternative would be to call 'load data ...' where the data directory and the target directory will be the same - which would probably work - but i don't think we have tried it out). (this is kind of relevant to hive-126 - since we are getting rid of the logic that recognizes partitions based on hdfs contents). this seems like a usability issue. if the directories already exist and you are unwilling to alter them (so that hive can convert it into internal table directory structure) - then i presume that there are other apps that work directly against the directory namespace - and perhaps there is already a pipeline to populate these directories on an ongoing basis. this would suggest that hive should just learn about partitions from the hdfs namespace - rather than burden those pipelines to call 'load data' and 'drop partition' on subdir creation/deletion. comments?
          Hide
          Johan Oskarsson added a comment -

          My approach would be to have a command to add partitions manually, I have created a jira ticket for it: HIVE-115. There's also already a method for this in the metastore thrift interface if I'm not mistaken. For us it would be fairly simple to add another command after loading our data into hdfs.
          It would also be a bit tricky to automatically find partitions from HDFS if they have a custom format. I can't think of a way off the top of my head if you have directories like so: /dataset/2008/12/10/spain where 2008/12/11 is one partition and spain is another. Then we'd have to save more information on the exact directory structure for each partition and it seems to get more complex then it has to at this stage.

          Show
          Johan Oskarsson added a comment - My approach would be to have a command to add partitions manually, I have created a jira ticket for it: HIVE-115 . There's also already a method for this in the metastore thrift interface if I'm not mistaken. For us it would be fairly simple to add another command after loading our data into hdfs. It would also be a bit tricky to automatically find partitions from HDFS if they have a custom format. I can't think of a way off the top of my head if you have directories like so: /dataset/2008/12/10/spain where 2008/12/11 is one partition and spain is another. Then we'd have to save more information on the exact directory structure for each partition and it seems to get more complex then it has to at this stage.
          Hide
          Prasad Chakka added a comment -

          can't this be achieved by adding location (or suffix) parameter to 'create or load partition' command. this may even exist right now. How does one define the definition (or template) of the flexible partition naming function? a partition key might have sub-components or order of keys could be different. This is doable but the effort required is not necessary for this functionality.

          Show
          Prasad Chakka added a comment - can't this be achieved by adding location (or suffix) parameter to 'create or load partition' command. this may even exist right now. How does one define the definition (or template) of the flexible partition naming function? a partition key might have sub-components or order of keys could be different. This is doable but the effort required is not necessary for this functionality.
          Hide
          Johan Oskarsson added a comment -

          Do you mean something along the lines of:
          ALTER TABLE table_name ADD PARTITIONS (partition_col = '2008/12/04', another='spain') location ('/wherever/2008/12/04/spain')

          That would certainly work for our case and looks like a cleaner solution.

          Show
          Johan Oskarsson added a comment - Do you mean something along the lines of: ALTER TABLE table_name ADD PARTITIONS (partition_col = '2008/12/04', another='spain') location ('/wherever/2008/12/04/spain') That would certainly work for our case and looks like a cleaner solution.
          Hide
          Prasad Chakka added a comment -

          yeah that is what i meant. optionally the user can give relative path to the table's location instead of the complete path.

          alternatively, determining the partition path can be made customizable through a configurable class (the usual way extensions are done in Hadoop world). this path is also cleaner which leaves the implementation to the user who can customize it for each table differently.

          if specifying location everytime a partition is created or loaded could be cumbersome then we can do the above.

          Show
          Prasad Chakka added a comment - yeah that is what i meant. optionally the user can give relative path to the table's location instead of the complete path. alternatively, determining the partition path can be made customizable through a configurable class (the usual way extensions are done in Hadoop world). this path is also cleaner which leaves the implementation to the user who can customize it for each table differently. if specifying location everytime a partition is created or loaded could be cumbersome then we can do the above.
          Hide
          Johan Oskarsson added a comment -

          This patch allows a user to specify a location when adding a partition as seen in the example query above. The location is relative to the table location. There's also a unit test included.

          Show
          Johan Oskarsson added a comment - This patch allows a user to specify a location when adding a partition as seen in the example query above. The location is relative to the table location. There's also a unit test included.
          Hide
          Prasad Chakka added a comment -

          Hive.java:
          Can you move the logic of creating new partition into Partition.java (as a new constructor method)? I would like to isolate partition creation code into single class.
          559:560 -> use log for printing and also throw an exception back

          DDLSemanticAnalyzer.java:
          In semantic analysis of the query we just build up the description of the input into a temporary structure and leave the actual creation of Partition objects into DDLTask. Look at analyzeCreateTable method.

          DDLTask.java:
          Usually hive.metastore interfaces are not exposed to hive.ql except for hive.ql.metadata. Rest of hive.ql just use hive.ql.metadata to access metadata functionality (there are couple of instances where we hive.metastore is directly used in hive.ql but they shouldn't be unless they are simple model objects without any logic). It may be cleaner if DDLTask calls Hive.addPartition(tbl, part_vals, location) and let Hive.java take care of creating partition object and making metastore call.

          Also, tbl.isExternal() can be moved out of the for loop. BTB, why do we want to restrict this to external tables only? The same code can be used in cases where user creates the partition data in the location that internal tables expect but wants to add metadata right?

          Show
          Prasad Chakka added a comment - Hive.java: Can you move the logic of creating new partition into Partition.java (as a new constructor method)? I would like to isolate partition creation code into single class. 559:560 -> use log for printing and also throw an exception back DDLSemanticAnalyzer.java: In semantic analysis of the query we just build up the description of the input into a temporary structure and leave the actual creation of Partition objects into DDLTask. Look at analyzeCreateTable method. DDLTask.java: Usually hive.metastore interfaces are not exposed to hive.ql except for hive.ql.metadata. Rest of hive.ql just use hive.ql.metadata to access metadata functionality (there are couple of instances where we hive.metastore is directly used in hive.ql but they shouldn't be unless they are simple model objects without any logic). It may be cleaner if DDLTask calls Hive.addPartition(tbl, part_vals, location) and let Hive.java take care of creating partition object and making metastore call. Also, tbl.isExternal() can be moved out of the for loop. BTB, why do we want to restrict this to external tables only? The same code can be used in cases where user creates the partition data in the location that internal tables expect but wants to add metadata right?
          Hide
          Johan Oskarsson added a comment -

          Updated patch with the suggestions mentioned. Prasad, can you have a look and make sure I got everything right?
          Also removed the constraint that it has to be an external table.

          Show
          Johan Oskarsson added a comment - Updated patch with the suggestions mentioned. Prasad, can you have a look and make sure I got everything right? Also removed the constraint that it has to be an external table.
          Hide
          Prasad Chakka added a comment -

          I should have mentioned this before but could you add some clipositive tests for the grammar changes? (both regular and external)

          TestAddPartition.java
          53: this has not effect. SerializationLib should be set using ql.metadata.Table.setSerializationLib() but it doesn't matter in this test though.

          Otherwise everything looks good.

          Show
          Prasad Chakka added a comment - I should have mentioned this before but could you add some clipositive tests for the grammar changes? (both regular and external) TestAddPartition.java 53: this has not effect. SerializationLib should be set using ql.metadata.Table.setSerializationLib() but it doesn't matter in this test though. Otherwise everything looks good.
          Hide
          Johan Oskarsson added a comment -

          Updated patch to include clipositive tests, I have left the rest of the patch as is.
          I left the code TestAddPartition.java:53 there because I get an NPE if I remove it.

          Show
          Johan Oskarsson added a comment - Updated patch to include clipositive tests, I have left the rest of the patch as is. I left the code TestAddPartition.java:53 there because I get an NPE if I remove it.
          Hide
          Prasad Chakka added a comment -

          looks good. +1

          could post the NPE that you get when line 53 is changed to setSerializationLib()? it shouldn't be happenning (TestHiveMetaStore.java does similar thing) and i am curious as to why that is happening. but this things needn't delay the check-in though.

          Show
          Prasad Chakka added a comment - looks good. +1 could post the NPE that you get when line 53 is changed to setSerializationLib()? it shouldn't be happenning (TestHiveMetaStore.java does similar thing) and i am curious as to why that is happening. but this things needn't delay the check-in though.
          Hide
          Ashish Thusoo added a comment -

          TestAddPartiton failed in the test run...

          [junit] Running org.apache.hadoop.hive.ql.plan.TestAddPartition
          [junit] FAILED: Error in metadata: MetaException(message:java.lang.NullPointerException null)
          [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 7.01 sec
          [junit] Test org.apache.hadoop.hive.ql.plan.TestAddPartition FAILED

          Show
          Ashish Thusoo added a comment - TestAddPartiton failed in the test run... [junit] Running org.apache.hadoop.hive.ql.plan.TestAddPartition [junit] FAILED: Error in metadata: MetaException(message:java.lang.NullPointerException null) [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 7.01 sec [junit] Test org.apache.hadoop.hive.ql.plan.TestAddPartition FAILED
          Hide
          Prasad Chakka added a comment -

          could you post the entire stack trace for the exception?

          Show
          Prasad Chakka added a comment - could you post the entire stack trace for the exception?
          Hide
          Johan Oskarsson added a comment -

          Sorry for the confusion, it seems I uploaded an older version of the patch. This is the right one.

          Prasad: I meant that the NPE appears if I don't set the serialization lib. This latest patch does it using setSerializationLib in SerDeInfo.

          Show
          Johan Oskarsson added a comment - Sorry for the confusion, it seems I uploaded an older version of the patch. This is the right one. Prasad: I meant that the NPE appears if I don't set the serialization lib. This latest patch does it using setSerializationLib in SerDeInfo.
          Hide
          Ashish Thusoo added a comment -

          committed. Thanks Johan!!

          Show
          Ashish Thusoo added a comment - committed. Thanks Johan!!

            People

            • Assignee:
              Johan Oskarsson
              Reporter:
              Johan Oskarsson
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development