Hive
  1. Hive
  2. HIVE-1918

Add export/import facilities to the hive system

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: Metastore, Query Processor
    • Labels:
      None

      Description

      This is an enhancement request to add export/import features to hive.

      With this language extension, the user can export the data of the table - which may be located in different hdfs locations in case of a partitioned table - as well as the metadata of the table into a specified output location. This output location can then be moved over to another different hadoop/hive instance and imported there.

      This should work independent of the source and target metastore dbms used; for instance, between derby and mysql.

      For partitioned tables, the ability to export/import a subset of the partition must be supported.

      Howl will add more features on top of this: The ability to create/use the exported data even in the absence of hive, using MR or Pig. Please see http://wiki.apache.org/pig/Howl/HowlImportExport for these details.

      1. HIVE-1918.patch.5.txt
        394 kB
        Krishna Kumar
      2. HIVE-1918.patch.4.txt
        412 kB
        Krishna Kumar
      3. HIVE-1918.patch.3.txt
        361 kB
        Krishna Kumar
      4. HIVE-1918.patch.2.txt
        317 kB
        Krishna Kumar
      5. hive-metastore-er.pdf
        137 kB
        Krishna Kumar
      6. HIVE-1918.patch.1.txt
        311 kB
        Krishna Kumar
      7. HIVE-1918.patch.txt
        312 kB
        Krishna Kumar

        Issue Links

          Activity

          Krishna Kumar created issue -
          Hide
          Krishna Kumar added a comment -

          Patch for adding export/import.

          Show
          Krishna Kumar added a comment - Patch for adding export/import.
          Krishna Kumar made changes -
          Field Original Value New Value
          Status Open [ 1 ] Patch Available [ 10002 ]
          Krishna Kumar made changes -
          Attachment HIVE-1918.patch.txt [ 12468726 ]
          Hide
          Krishna Kumar added a comment -

          Design notes:

          • Export/Import modeled on existing load functionality. No new tasks added, but existing tasks for copy/move/create table/add partition et al reused.
          • EXPORT TABLE table [PARTITION (partition_col=partition_colval, ...) ] TO location
          • IMPORT [[EXTERNAL] TABLE table [PARTITION (partition_col=partition_colval, ...)] ] FROM sourcelocation [LOCATION targetlocation]
          • The data/metadata stored as an xml-serialized file for the metadata in the target directory plus sub-directories for the data files.
          Show
          Krishna Kumar added a comment - Design notes: Export/Import modeled on existing load functionality. No new tasks added, but existing tasks for copy/move/create table/add partition et al reused. EXPORT TABLE table [PARTITION (partition_col=partition_colval, ...) ] TO location IMPORT [ [EXTERNAL] TABLE table [PARTITION (partition_col=partition_colval, ...)] ] FROM sourcelocation [LOCATION targetlocation] The data/metadata stored as an xml-serialized file for the metadata in the target directory plus sub-directories for the data files.
          Hide
          Edward Capriolo added a comment -

          Can we do this without?

          -  public Partition(Table tbl, Map<String, String> partSpec, Path location)
          +  public Partition(Table tbl, Map<String, String> partSpec, Path location, Map<String, String> partParams)
          

          I would like to see some form of API stability in the metastore. It is very disruptive for programs that work against the metastore. Of which internally I have several.

          Show
          Edward Capriolo added a comment - Can we do this without? - public Partition(Table tbl, Map<String, String> partSpec, Path location) + public Partition(Table tbl, Map<String, String> partSpec, Path location, Map<String, String> partParams) I would like to see some form of API stability in the metastore. It is very disruptive for programs that work against the metastore. Of which internally I have several.
          Namit Jain made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Krishna Kumar added a comment -

          Ok. Will take of this via a delegating ctor.

          A process question: I guess I should wait for more comments from other reviewers before I create another patch in case if others are reviewing the current patch?

          Show
          Krishna Kumar added a comment - Ok. Will take of this via a delegating ctor. A process question: I guess I should wait for more comments from other reviewers before I create another patch in case if others are reviewing the current patch?
          Hide
          Carl Steinbach added a comment -
          Show
          Carl Steinbach added a comment - Review request: https://reviews.apache.org/r/339
          Hide
          Edward Capriolo added a comment -

          I was not implying that we should definately not add

          , Map<String, String> partParams

          . What I am asking is what is the rational for doing it? I think we should not need to add things to the metastore to export it's information, but I might be missing something.

          Show
          Edward Capriolo added a comment - I was not implying that we should definately not add , Map<String, String> partParams . What I am asking is what is the rational for doing it? I think we should not need to add things to the metastore to export it's information, but I might be missing something.
          Hide
          Krishna Kumar added a comment -

          Why export/import needs this change: It is not the export part, but rather the import part which needs this change. While creating a partition as part of an import, we need to be able to create the partition along with its ancillary data including partition parameters. But first part of the existing "create partition" flow (AddPartitionDesc -> DDLTask.addPartition -> Hive.createPartition) did not support partition params specification but the second part (metastore.api.Partition -> IMetaStoreClient.add_partition -> HiveMetaStore.HMSHandler.add_partition -> ObjectStore.addPartition) does. So I added the ability to pass the partition parameters along in the first part of the flow.

          In terms of options for compatible changes, there are two I can see:

          1. The solution suggested above. Add an additional ctor so that no existing code breaks.

          public Partition(Table tbl, Map<String, String> partSpec, Path location) {
            this(tbl, partSpec, location, null);
          }
          
          public Partition(Table tbl, Map<String, String> partSpec, Path location, Map<String, String> partParams) {...}
          

          2. Have only the current ctor but in Hive.createPartition get the underlying metastore.api.Partition and set the parameters to it before passing it on to the metastoreClient.

          Thoughts?

          Show
          Krishna Kumar added a comment - Why export/import needs this change: It is not the export part, but rather the import part which needs this change. While creating a partition as part of an import, we need to be able to create the partition along with its ancillary data including partition parameters. But first part of the existing "create partition" flow (AddPartitionDesc -> DDLTask.addPartition -> Hive.createPartition) did not support partition params specification but the second part (metastore.api.Partition -> IMetaStoreClient.add_partition -> HiveMetaStore.HMSHandler.add_partition -> ObjectStore.addPartition) does. So I added the ability to pass the partition parameters along in the first part of the flow. In terms of options for compatible changes, there are two I can see: 1. The solution suggested above. Add an additional ctor so that no existing code breaks. public Partition(Table tbl, Map<String, String> partSpec, Path location) { this(tbl, partSpec, location, null); } public Partition(Table tbl, Map<String, String> partSpec, Path location, Map<String, String> partParams) {...} 2. Have only the current ctor but in Hive.createPartition get the underlying metastore.api.Partition and set the parameters to it before passing it on to the metastoreClient. Thoughts?
          Hide
          Edward Capriolo added a comment -

          If partition currently only has one constructor I do not think we add a second constructor, just change it and break the old compatibility. If we do add partParams we should have DDL statements to view and modify these. For a long time there were many things that were added in the create table such as InputFormat or OutputFormat and we had no way to view or modify these besides going directly to the JDBC and changing them. I am not sure if this is the case here, but I do not think we should be adding fields that are not completely manageable.

          Show
          Edward Capriolo added a comment - If partition currently only has one constructor I do not think we add a second constructor, just change it and break the old compatibility. If we do add partParams we should have DDL statements to view and modify these. For a long time there were many things that were added in the create table such as InputFormat or OutputFormat and we had no way to view or modify these besides going directly to the JDBC and changing them. I am not sure if this is the case here, but I do not think we should be adding fields that are not completely manageable.
          Hide
          Carl Steinbach added a comment -

          @Krishna: I tried applying the patch to trunk and running the tests. Noticed a couple things upfront that need to be fixed:

          • The patch doesn't apply cleanly with 'patch -p0'. To satisfy this using a Git repo you need to generate the patch using 'git diff --no-prefix ...'
          • Most of the new exim* tests fail with diffs. Can you please fix this and update the patch?
          • 'hive.test.exim' needs to be added to HiveConf and hive-default.xml. There is already a 'hive.test.mode.*' namespace defined in HiveConf, so you should probably follow this convention and change the name to 'hive.test.mode.exim'. I think an even better solution would be to instead define a new conf property called 'hive.exim.uri.scheme.whitelist' and make it a comma separated list of acceptable URI schemes for import and export.
          Show
          Carl Steinbach added a comment - @Krishna: I tried applying the patch to trunk and running the tests. Noticed a couple things upfront that need to be fixed: The patch doesn't apply cleanly with 'patch -p0'. To satisfy this using a Git repo you need to generate the patch using 'git diff --no-prefix ...' Most of the new exim* tests fail with diffs. Can you please fix this and update the patch? 'hive.test.exim' needs to be added to HiveConf and hive-default.xml. There is already a 'hive.test.mode.*' namespace defined in HiveConf, so you should probably follow this convention and change the name to 'hive.test.mode.exim'. I think an even better solution would be to instead define a new conf property called 'hive.exim.uri.scheme.whitelist' and make it a comma separated list of acceptable URI schemes for import and export.
          Hide
          Namit Jain added a comment -

          A couple of issues:

          1. Can you add a new test directory - I mean, exporting in /tmp/.. means that there will be problems in concurrent tests on the same machine.
          2. Do you want to support errors at import time - I mean, what happens if one of the rows is bad - should I have an option to specify the number
          of rows to ignore, and dump them somewhere else ?

          Show
          Namit Jain added a comment - A couple of issues: 1. Can you add a new test directory - I mean, exporting in /tmp/.. means that there will be problems in concurrent tests on the same machine. 2. Do you want to support errors at import time - I mean, what happens if one of the rows is bad - should I have an option to specify the number of rows to ignore, and dump them somewhere else ?
          Hide
          Krishna Kumar added a comment -

          @Edward: Both the existing data model (prettified er diagram attached) and the object model (class org.apache.hadoop.hive.metastore.api.Partition) allow the specification of parameters on a per-partition basis. So I am not adding new fields to either of these models. By proposal 2 above, I will not be adding any ctor parameters to org.apache.hadoop.hive.ql.metadata.Partition as well.

          Your point re providing manageability via ddl statements to all aspects of the data/object model is taken. But I am not adding new aspects to either model, so if indeed we need to address current manageability gaps, should they not be addressed via another enhancement request, rather than this one, which aims simply to add export/import facilities?

          Show
          Krishna Kumar added a comment - @Edward: Both the existing data model (prettified er diagram attached) and the object model (class org.apache.hadoop.hive.metastore.api.Partition) allow the specification of parameters on a per-partition basis. So I am not adding new fields to either of these models. By proposal 2 above, I will not be adding any ctor parameters to org.apache.hadoop.hive.ql.metadata.Partition as well. Your point re providing manageability via ddl statements to all aspects of the data/object model is taken. But I am not adding new aspects to either model, so if indeed we need to address current manageability gaps, should they not be addressed via another enhancement request, rather than this one, which aims simply to add export/import facilities?
          Hide
          Krishna Kumar added a comment -

          @Carl:

          1. Taken care in the new patch.
          2. Can you post some of the diffs that you get failures on? I had a problem with running the tests on nfs mounted directories. That had to do with an existing bug in the load functionality. This used to result in a "MetaException: could not delete dir" error while trying to cleanup the effects of the previous test. I have created a separate jira HIVE-1924 for this and have attached a patch.
          3. Have taken the whitelist approach, the whitelist now set as "hdfs,pfile".

          Show
          Krishna Kumar added a comment - @Carl: 1. Taken care in the new patch. 2. Can you post some of the diffs that you get failures on? I had a problem with running the tests on nfs mounted directories. That had to do with an existing bug in the load functionality. This used to result in a "MetaException: could not delete dir" error while trying to cleanup the effects of the previous test. I have created a separate jira HIVE-1924 for this and have attached a patch. 3. Have taken the whitelist approach, the whitelist now set as "hdfs,pfile".
          Hide
          Krishna Kumar added a comment -

          A quick summary of the second derivative (difference between diffs)

          • used no-prefix while generating patch
          • hive.test.exim replaced by hive.exim.uri.scheme.whitelist
          • schemaCompare, initializeFromUrl, validateTable all refactored to util methods
          • trailing spaces in some test files removed
          Show
          Krishna Kumar added a comment - A quick summary of the second derivative (difference between diffs) used no-prefix while generating patch hive.test.exim replaced by hive.exim.uri.scheme.whitelist schemaCompare, initializeFromUrl, validateTable all refactored to util methods trailing spaces in some test files removed
          Krishna Kumar made changes -
          Attachment HIVE-1918.patch.1.txt [ 12469128 ]
          Hide
          Krishna Kumar added a comment -

          Prettified ER diagram of the existing data model

          Show
          Krishna Kumar added a comment - Prettified ER diagram of the existing data model
          Krishna Kumar made changes -
          Attachment hive-metastore-er.pdf [ 12469129 ]
          Hide
          Krishna Kumar added a comment -

          @Namit:

          1. Do you have any ideas re how we can get an unique, temporary directory name for use in the test script files? In code of course we can use the getScratchDir methods, but how to solve this problem in these test scripts?

          2. Export/Import, as in the case of Load, operates at file level rather than at record level. So there are no record-level filters available.

          Show
          Krishna Kumar added a comment - @Namit: 1. Do you have any ideas re how we can get an unique, temporary directory name for use in the test script files? In code of course we can use the getScratchDir methods, but how to solve this problem in these test scripts? 2. Export/Import, as in the case of Load, operates at file level rather than at record level. So there are no record-level filters available.
          Hide
          Edward Capriolo added a comment -

          But I am not adding new aspects to either model, so if indeed we need to address current manageability gaps, should they not be addressed via another enhancement request, rather than this one, which aims simply to add export/import facilities?

          This depends on the lag time between the feature getting added and the enhancement being added. With wishful thinking the two events will be close, but history does not always agree. Management becomes more important as different people begin using the metastore for different purposes. I believe if we add any feature support to manage it completely with DML is mandatory. Currently the metastore is changing for security. import-export, and indexes if each of these features are only half manageable at time X this will make releases awkward and half functional.

          Show
          Edward Capriolo added a comment - But I am not adding new aspects to either model, so if indeed we need to address current manageability gaps, should they not be addressed via another enhancement request, rather than this one, which aims simply to add export/import facilities? This depends on the lag time between the feature getting added and the enhancement being added. With wishful thinking the two events will be close, but history does not always agree. Management becomes more important as different people begin using the metastore for different purposes. I believe if we add any feature support to manage it completely with DML is mandatory. Currently the metastore is changing for security. import-export, and indexes if each of these features are only half manageable at time X this will make releases awkward and half functional.
          Hide
          Namit Jain added a comment -

          @Krishna

          1. How is it different from running multiple tests currently - the output files are generated in
          build/ql/.... directory under the hive installation. Similarly, you can create a new directory under build.
          This should be definitely done in this patch itself.

          2. Load does not do any checking - not does it perform any metadata operation. The table already exists,
          and load is merely a helper function. Since import is touching both data and metadata, it makes sense
          to validate the data (atleast optionally) - but that can always be done in a follow-up.

          Show
          Namit Jain added a comment - @Krishna 1. How is it different from running multiple tests currently - the output files are generated in build/ql/.... directory under the hive installation. Similarly, you can create a new directory under build. This should be definitely done in this patch itself. 2. Load does not do any checking - not does it perform any metadata operation. The table already exists, and load is merely a helper function. Since import is touching both data and metadata, it makes sense to validate the data (atleast optionally) - but that can always be done in a follow-up.
          Hide
          Krishna Kumar added a comment -

          Patch including

          • no changes to ql.metadata.Partition as per option#2 above
          • use relative paths in tests
          Show
          Krishna Kumar added a comment - Patch including no changes to ql.metadata.Partition as per option#2 above use relative paths in tests
          Krishna Kumar made changes -
          Attachment HIVE-1918.patch.2.txt [ 12469801 ]
          Hide
          Krishna Kumar added a comment -

          @Namit: Attached patch now uses relative paths in test scripts; (Note that some existing tests [clientpositive/insertexternal1.q, clientpositive/load_fs.q] uses absolute paths even today. Those need to be changed via another bug report.)

          @Edward: No changes to Partition.java as proposed in option 2 above.

          Show
          Krishna Kumar added a comment - @Namit: Attached patch now uses relative paths in test scripts; (Note that some existing tests [clientpositive/insertexternal1.q, clientpositive/load_fs.q] uses absolute paths even today. Those need to be changed via another bug report.) @Edward: No changes to Partition.java as proposed in option 2 above.
          Hide
          Krishna Kumar added a comment -

          Patch with all open issues addressed

          Show
          Krishna Kumar added a comment - Patch with all open issues addressed
          Krishna Kumar made changes -
          Attachment HIVE-1918.patch.3.txt [ 12470063 ]
          Hide
          Krishna Kumar added a comment -

          With this patch, I think all above issues are addressed. Also have added 3 bug fixes + tests for those bugs. Please review.

          Show
          Krishna Kumar added a comment - With this patch, I think all above issues are addressed. Also have added 3 bug fixes + tests for those bugs. Please review.
          Krishna Kumar made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Namit Jain added a comment -

          Can you create a review-board request. or is the old one still valid ?

          Show
          Namit Jain added a comment - Can you create a review-board request. or is the old one still valid ?
          Namit Jain made changes -
          Assignee Krishna Kumar [ n_krishna_kumar ]
          Hide
          Namit Jain added a comment -

          @Paul, you should definitely review it also, before it gets in.
          I am reviewing it right now

          Show
          Namit Jain added a comment - @Paul, you should definitely review it also, before it gets in. I am reviewing it right now
          Hide
          Carl Steinbach added a comment -

          I updated the diff on reviewboard: https://reviews.apache.org/r/339

          Show
          Carl Steinbach added a comment - I updated the diff on reviewboard: https://reviews.apache.org/r/339
          Hide
          Paul Yang added a comment -

          Looking at it as well..

          Show
          Paul Yang added a comment - Looking at it as well..
          Hide
          Namit Jain added a comment -

          Still looking:

          1. Move export/import stuff to a new directory/file.
          All the changes of HiveUtils should be in EximUtil.

          2. DDLTask.java - why are the existing APIs changing ?
          The default behavior for createPartition() should not change

          3. inputs not populated for import, and outputs for export.
          This can break some dependent stuff: authorization, concurrency etc.

          Show
          Namit Jain added a comment - Still looking: 1. Move export/import stuff to a new directory/file. All the changes of HiveUtils should be in EximUtil. 2. DDLTask.java - why are the existing APIs changing ? The default behavior for createPartition() should not change 3. inputs not populated for import, and outputs for export. This can break some dependent stuff: authorization, concurrency etc.
          Hide
          Krishna Kumar added a comment -

          Thanks, Namit, for the comments.

          1. Ok re moving serialization/deserialization methods to EximUtil, but did not understand the first part. Are you suggesting moving EximUtil, ImportSemanticAnalyzer and ExportSemanticAnalyzer to a new package? Does not seem to warrant it; today all parsing/semantic analysis classes are in o.a.h.h.ql.parse package...

          2. You mean Hive.java's API? The existing first createPartition remains as it is, the second createPartition used in DDLTasek is changing to allow the creation of a partition with all the partition-specific configurations. Since AddPartitionDesc is initialized with nulls/-1 for these extra parameters, the existing behaviour is not altered.

          3. Can you expand a little? What are inputs/outputs (classes?, tables?) - if they are part of the existing object model/data model, I think they are exported and imported.

          Show
          Krishna Kumar added a comment - Thanks, Namit, for the comments. 1. Ok re moving serialization/deserialization methods to EximUtil, but did not understand the first part. Are you suggesting moving EximUtil, ImportSemanticAnalyzer and ExportSemanticAnalyzer to a new package? Does not seem to warrant it; today all parsing/semantic analysis classes are in o.a.h.h.ql.parse package... 2. You mean Hive.java's API? The existing first createPartition remains as it is, the second createPartition used in DDLTasek is changing to allow the creation of a partition with all the partition-specific configurations. Since AddPartitionDesc is initialized with nulls/-1 for these extra parameters, the existing behaviour is not altered. 3. Can you expand a little? What are inputs/outputs (classes?, tables?) - if they are part of the existing object model/data model, I think they are exported and imported.
          Hide
          Namit Jain added a comment -

          1. Just moving the serialization/deserialization methods
          3. inputs and outputs are present in SemanticAnalyzer -

          Look at ReadEntity/WriteEntity - they are populated by the appropriate semantic analyzer and
          then used later by:

          a. concurrency : uses inputs/outputs to lock those objects
          b. authorization: uses inputs/outputs for permission checking
          c. execution hooks: they can be used for a variety of things: in facebook, we use them for replication

          Show
          Namit Jain added a comment - 1. Just moving the serialization/deserialization methods 3. inputs and outputs are present in SemanticAnalyzer - Look at ReadEntity/WriteEntity - they are populated by the appropriate semantic analyzer and then used later by: a. concurrency : uses inputs/outputs to lock those objects b. authorization: uses inputs/outputs for permission checking c. execution hooks: they can be used for a variety of things: in facebook, we use them for replication
          Namit Jain made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Namit Jain added a comment -

          Reading from:
          http://download.oracle.com/docs/cd/B10500_01/server.920/a96652/ch02.htm#1005081

          Importing into Existing Tables

          This section describes factors to take into account when you import data into existing tables.

          Manually Creating Tables Before Importing Data

          When you choose to create tables manually before importing data into them from an export file, you should use either the same table definition previously used or a compatible format. For example, although you can increase the width of columns and change their order, you cannot do the following:

          Add NOT NULL columns
          Change the datatype of a column to an incompatible datatype (LONG to NUMBER, for example)
          Change the definition of object types used in a table
          Change DEFAULT column values

          Note:
          When tables are manually created before data is imported, the CREATE TABLE statement in the export dump file will fail because the table already exists. To avoid this failure and continue loading data into the table, set the import parameter IGNORE=y. Otherwise, no data will be loaded into the table because of the table creation error.

          Do you want to support this ? Seems like a reasonable thing to have - currently, an error is thrown during import
          if the table already exists ?

          Show
          Namit Jain added a comment - Reading from: http://download.oracle.com/docs/cd/B10500_01/server.920/a96652/ch02.htm#1005081 Importing into Existing Tables This section describes factors to take into account when you import data into existing tables. Manually Creating Tables Before Importing Data When you choose to create tables manually before importing data into them from an export file, you should use either the same table definition previously used or a compatible format. For example, although you can increase the width of columns and change their order, you cannot do the following: Add NOT NULL columns Change the datatype of a column to an incompatible datatype (LONG to NUMBER, for example) Change the definition of object types used in a table Change DEFAULT column values Note: When tables are manually created before data is imported, the CREATE TABLE statement in the export dump file will fail because the table already exists. To avoid this failure and continue loading data into the table, set the import parameter IGNORE=y. Otherwise, no data will be loaded into the table because of the table creation error. Do you want to support this ? Seems like a reasonable thing to have - currently, an error is thrown during import if the table already exists ?
          Hide
          Krishna Kumar added a comment -

          Importing into existing tables is now supported, but the checks (to see whether the imported table and the target table are compatible) have been kept fairly simple for now. Please see ImportSemanticAnalyzer.checkTable. The schemas (column and partition) of the two should match exactly, except for comments. Since we are just moving files (rather than rewriting records), I think there will be issues if the metadata schema does not match (in terms of types, number etc) the data serialization exactly.

          Re the earlier comment re outputs/inputs, got what you meant. I will add the table/partition to the inputs in exportsemanticanalyzer. But in the case of the imports, I see that the tasks themselves adds the entity operated upon to the inputs/outputs list. Isn't that too late for authorization/concurrency, even though it may work for replication. Or both the sem.analyzers and the tasks are expected to add them? In the case of newly created table/partition, the sem.analyzer does not have a handle ?

          Show
          Krishna Kumar added a comment - Importing into existing tables is now supported, but the checks (to see whether the imported table and the target table are compatible) have been kept fairly simple for now. Please see ImportSemanticAnalyzer.checkTable. The schemas (column and partition) of the two should match exactly, except for comments. Since we are just moving files (rather than rewriting records), I think there will be issues if the metadata schema does not match (in terms of types, number etc) the data serialization exactly. Re the earlier comment re outputs/inputs, got what you meant. I will add the table/partition to the inputs in exportsemanticanalyzer. But in the case of the imports, I see that the tasks themselves adds the entity operated upon to the inputs/outputs list. Isn't that too late for authorization/concurrency, even though it may work for replication. Or both the sem.analyzers and the tasks are expected to add them? In the case of newly created table/partition, the sem.analyzer does not have a handle ?
          Hide
          Namit Jain added a comment -

          Tasks only add them when they may be available at compile time - for example, in case of dynamic partitions.
          Semantic Analyzer is supposed to add them

          Show
          Namit Jain added a comment - Tasks only add them when they may be available at compile time - for example, in case of dynamic partitions. Semantic Analyzer is supposed to add them
          Hide
          Krishna Kumar added a comment -

          Hmm. LoadSemanticAnalyzer (which knows the table) does not add it to the outputs, but the MoveTask it schedules, does.

          Similarly, CREATE-TABLE does not add the entity but the DDLTask it schedules, does. This may be fine only because the entity does not exist at compile time?

          ADD-PARTITION adds the table as an input at compile time and the partition itself is added as an output at execution time. Should not the table be an output (at compile time) as well - for authorization/concurrency purposes?

          Anyway, where the import operates on existing tables/partitions, I will add them at compile time. If the entity is being created as part of the task, then the task will be adding them to inputs/outputs at runtime. Is this fine?

          Show
          Krishna Kumar added a comment - Hmm. LoadSemanticAnalyzer (which knows the table) does not add it to the outputs, but the MoveTask it schedules, does. Similarly, CREATE-TABLE does not add the entity but the DDLTask it schedules, does. This may be fine only because the entity does not exist at compile time? ADD-PARTITION adds the table as an input at compile time and the partition itself is added as an output at execution time. Should not the table be an output (at compile time) as well - for authorization/concurrency purposes? Anyway, where the import operates on existing tables/partitions, I will add them at compile time. If the entity is being created as part of the task, then the task will be adding them to inputs/outputs at runtime. Is this fine?
          Hide
          Namit Jain added a comment -

          Please file bugs for the above cases -

          The changes for import look fine.
          You also need to make similar changes for export.

          Show
          Namit Jain added a comment - Please file bugs for the above cases - The changes for import look fine. You also need to make similar changes for export.
          Hide
          Krishna Kumar added a comment -

          Patch with

          • metadata ser/deser methods moved from HiveUtils to EximUtil
          • inputs and outputs populated; authorization related bugfix and tests
          Show
          Krishna Kumar added a comment - Patch with metadata ser/deser methods moved from HiveUtils to EximUtil inputs and outputs populated; authorization related bugfix and tests
          Krishna Kumar made changes -
          Attachment HIVE-1918.patch.4.txt [ 12470864 ]
          Hide
          Krishna Kumar added a comment -

          Please review. Will try and see if I can update the reviewboard myself...

          Show
          Krishna Kumar added a comment - Please review. Will try and see if I can update the reviewboard myself...
          Krishna Kumar made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Namit Jain added a comment -

          Can you upload a patch to review-board

          Show
          Namit Jain added a comment - Can you upload a patch to review-board
          Hide
          Krishna Kumar added a comment -

          https://reviews.apache.org/r/430/ added (with hive-git as repository).

          Carl, can you take down 339 as that is now superseded?

          Show
          Krishna Kumar added a comment - https://reviews.apache.org/r/430/ added (with hive-git as repository). Carl, can you take down 339 as that is now superseded?
          Hide
          Namit Jain added a comment -

          Krishna, the code changes look good - I had one concern only.

          The functions in EximUtil.java like:

          private static Element createStorageDescriptor(Document doc,
          String location,
          String inputFormatClass,

          have a implicit dependency on the metastore schema.
          If the schema changes, export/import will break, and it will be difficult
          to add them.

          Do you want to think about it ?

          I mean, add a API in the metastore thrift to generate this, or something like this ?
          So that, this code is auto-generated and is amenable to new fields.

          Show
          Namit Jain added a comment - Krishna, the code changes look good - I had one concern only. The functions in EximUtil.java like: private static Element createStorageDescriptor(Document doc, String location, String inputFormatClass, have a implicit dependency on the metastore schema. If the schema changes, export/import will break, and it will be difficult to add them. Do you want to think about it ? I mean, add a API in the metastore thrift to generate this, or something like this ? So that, this code is auto-generated and is amenable to new fields.
          Hide
          Namit Jain added a comment -

          @Paul, do you have any additional comments ?

          Show
          Namit Jain added a comment - @Paul, do you have any additional comments ?
          Hide
          Krishna Kumar added a comment -

          There are a few reasons why I took this approach

          • The decision on compatibility (forward/backward) checks as in EximUtil.checkCompatibility needs to taken consciously. That is, automatically breaking backward compatibility is not an option here I think.
          • What needs to be serialized/deserialized is also requires a human decision. For instance, even now, authorization details are not transferred by an export/import.
          • The serialization/deserialization methods are also used by howl codebase outside of a hive context. It will be good to have this code only loosely coupled to the metastore code.
          Show
          Krishna Kumar added a comment - There are a few reasons why I took this approach The decision on compatibility (forward/backward) checks as in EximUtil.checkCompatibility needs to taken consciously. That is, automatically breaking backward compatibility is not an option here I think. What needs to be serialized/deserialized is also requires a human decision. For instance, even now, authorization details are not transferred by an export/import. The serialization/deserialization methods are also used by howl codebase outside of a hive context. It will be good to have this code only loosely coupled to the metastore code.
          Hide
          Paul Yang added a comment -

          @namit - was working on some other issues, will take another look.

          Show
          Paul Yang added a comment - @namit - was working on some other issues, will take another look.
          Hide
          Paul Yang added a comment -

          Made a couple of comments on reviewboard.

          Show
          Paul Yang added a comment - Made a couple of comments on reviewboard.
          Hide
          Krishna Kumar added a comment -

          Thanks Paul.

          [Your comments are on a superseded review board submission; I will remind Carl again to take it down. The current reviewboard submission is up at https://reviews.apache.org/r/430/, but never the less both your comments are still applicable.]

          1. Ok. Will address it.

          2. I am not seeing how compatibility checking and selective serialization/deserialization of an object graph will be possible by auto-generated code. Will look into both thrift and datanucleus serialization (that you mentioned) from this aspect, but fine-grained control over this process is required here I think.

          Show
          Krishna Kumar added a comment - Thanks Paul. [Your comments are on a superseded review board submission; I will remind Carl again to take it down. The current reviewboard submission is up at https://reviews.apache.org/r/430/, but never the less both your comments are still applicable.] 1. Ok. Will address it. 2. I am not seeing how compatibility checking and selective serialization/deserialization of an object graph will be possible by auto-generated code. Will look into both thrift and datanucleus serialization (that you mentioned) from this aspect, but fine-grained control over this process is required here I think.
          Hide
          Krishna Kumar added a comment -
          • Nested ternaries expanded
          • thrift-based serialization for the metastore objects

          Please review.

          Show
          Krishna Kumar added a comment - Nested ternaries expanded thrift-based serialization for the metastore objects Please review.
          Krishna Kumar made changes -
          Attachment HIVE-1918.patch.5.txt [ 12472202 ]
          Hide
          Carl Steinbach added a comment -

          I discarded my reviewboard request after Krishna's original request. Unfortunately this doesn't appear to prevent people from commenting on it, and reviewboard won't let me delete the request outright.

          Please review the patch here: https://reviews.apache.org/r/430/

          Show
          Carl Steinbach added a comment - I discarded my reviewboard request after Krishna's original request. Unfortunately this doesn't appear to prevent people from commenting on it, and reviewboard won't let me delete the request outright. Please review the patch here: https://reviews.apache.org/r/430/
          Hide
          Paul Yang added a comment -

          @Krishna - can you regenerate this patch against trunk?

          Show
          Paul Yang added a comment - @Krishna - can you regenerate this patch against trunk?
          Hide
          Krishna Kumar added a comment -

          Merged with trunk

          Show
          Krishna Kumar added a comment - Merged with trunk
          Krishna Kumar made changes -
          Attachment HIVE-1918.patch.5.txt [ 12473539 ]
          Krishna Kumar made changes -
          Attachment HIVE-1918.patch.5.txt [ 12472202 ]
          Hide
          Paul Yang added a comment -

          +1 Looks good, will test and commit

          Show
          Paul Yang added a comment - +1 Looks good, will test and commit
          Hide
          Paul Yang added a comment -

          Committed. Thanks Krishna!

          Show
          Paul Yang added a comment - Committed. Thanks Krishna!
          Paul Yang made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.8.0 [ 12316178 ]
          Resolution Fixed [ 1 ]
          Carl Steinbach made changes -
          Component/s Metastore [ 12312584 ]
          Carl Steinbach made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Carl Steinbach made changes -
          Assignee Krishna Kumar [ n_krishna_kumar ] Carl Steinbach [ cwsteinbach ]
          Carl Steinbach made changes -
          Assignee Carl Steinbach [ cwsteinbach ] Krishna Kumar [ n_krishna_kumar ]
          Carl Steinbach made changes -
          Link This issue relates to HIVE-2816 [ HIVE-2816 ]
          Sho Shimauchi made changes -
          Link This issue relates to HIVE-4299 [ HIVE-4299 ]
          Hide
          Gelesh added a comment -

          Wish:-
          Can We have a option to copy meta information alone,

          UseCase:-
          So that, during DistCp with out copying the Hive files, (with partition folder and clustered file structure) to a temp location, we can create a _meta file alone.

          Then, DistCp the hive files (the partioned and clusteded file structure) as such and load re create a hive table in the new cluster.

          Show
          Gelesh added a comment - Wish:- Can We have a option to copy meta information alone, UseCase:- So that, during DistCp with out copying the Hive files, (with partition folder and clustered file structure) to a temp location, we can create a _meta file alone. Then, DistCp the hive files (the partioned and clusteded file structure) as such and load re create a hive table in the new cluster.
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Patch Available Patch Available Open Open
          7d 2h 23m 2 Namit Jain 09/Feb/11 06:40
          Open Open Patch Available Patch Available
          17d 21h 34m 3 Krishna Kumar 11/Feb/11 15:47
          Patch Available Patch Available Resolved Resolved
          32d 4h 42m 1 Paul Yang 15/Mar/11 20:30
          Resolved Resolved Closed Closed
          276d 3h 26m 1 Carl Steinbach 16/Dec/11 23:56

            People

            • Assignee:
              Krishna Kumar
              Reporter:
              Krishna Kumar
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development