Hive
  1. Hive
  2. HIVE-6405

Support append feature for HCatalog

    Details

      Description

      HCatalog currently treats all tables as "immutable" - i.e. all tables and partitions can be written to only once, and not appended. The nuances of what this means is as follows:

      • A non-partitioned table can be written to, and data in it is never updated from then on unless you drop and recreate.
      • A partitioned table may support "appending" of a sort in a manner by adding new partitions to the table, but once written, the partitions themselves cannot have any new data added to them.

      Hive, on the other hand, does allow us to "INSERT INTO" into a table, thus allowing us append semantics. There is benefit to both of these models, and so, our goal is as follows:

      a) Introduce a notion of an immutable table, wherein all tables are not immutable by default, and have this be a table property. If this property is set for a table, and we attempt to write to a table that already has data (or a partition), disallow "INSERT INTO" into it from hive. This property being set will allow hive to mimic HCatalog's current immutable-table property. (I'm going to create a separate sub-task to cover this bit, and focus on the HCatalog-side on this jira)

      b) As long as that flag is not set, HCatalog should be changed to allow appends into it as well, and not simply error out if data already exists in a table.

        Activity

        Sushanth Sowmyan created issue -
        Sushanth Sowmyan made changes -
        Field Original Value New Value
        Description HCatalog currently treats all tables as "immutable" - i.e. all tables and partitions can be written to only once, and not appended. The nuances of what this means is as follows:

         * A non-partitioned table can be written to, and data in it is never updated from then on unless you drop and recreate.

         * A partitioned table may support "appending" of a sort in a manner by adding new partitions to the table, but once written, the partitions themselves cannot have any new data added to them.

        Hive, on the other hand, does allow us to "INSERT INTO" a table, thus allowing us append semantics. There is benefit to both of these models, and so, our goal is as follows:

        a) Introduce a notion of an immutable table, wherein all tables are not immutable by default, and have this be a table property. If this property is set for a table, and we attempt to write to a table that already has data (or a partition), disallow "INSERT INTO" into it from hive. This property being set will allow hive to mimic HCatalog's current immutable-table property. (I'm going to create a separate sub-task to cover this)

        b) As long as that flag is not set, HCatalog should be changed to allow appends into it as well, and not simply error out if data already exists in a table.
        HCatalog currently treats all tables as "immutable" - i.e. all tables and partitions can be written to only once, and not appended. The nuances of what this means is as follows:

         * A non-partitioned table can be written to, and data in it is never updated from then on unless you drop and recreate.

         * A partitioned table may support "appending" of a sort in a manner by adding new partitions to the table, but once written, the partitions themselves cannot have any new data added to them.

        Hive, on the other hand, does allow us to "INSERT INTO" a table, thus allowing us append semantics. There is benefit to both of these models, and so, our goal is as follows:

        a) Introduce a notion of an immutable table, wherein all tables are not immutable by default, and have this be a table property. If this property is set for a table, and we attempt to write to a table that already has data (or a partition), disallow "INSERT INTO" into it from hive. This property being set will allow hive to mimic HCatalog's current immutable-table property. (I'm going to create a separate sub-task to cover this bit, and focus on the HCatalog-side on this jira)

        b) As long as that flag is not set, HCatalog should be changed to allow appends into it as well, and not simply error out if data already exists in a table.
        Sushanth Sowmyan made changes -
        Description HCatalog currently treats all tables as "immutable" - i.e. all tables and partitions can be written to only once, and not appended. The nuances of what this means is as follows:

         * A non-partitioned table can be written to, and data in it is never updated from then on unless you drop and recreate.

         * A partitioned table may support "appending" of a sort in a manner by adding new partitions to the table, but once written, the partitions themselves cannot have any new data added to them.

        Hive, on the other hand, does allow us to "INSERT INTO" a table, thus allowing us append semantics. There is benefit to both of these models, and so, our goal is as follows:

        a) Introduce a notion of an immutable table, wherein all tables are not immutable by default, and have this be a table property. If this property is set for a table, and we attempt to write to a table that already has data (or a partition), disallow "INSERT INTO" into it from hive. This property being set will allow hive to mimic HCatalog's current immutable-table property. (I'm going to create a separate sub-task to cover this bit, and focus on the HCatalog-side on this jira)

        b) As long as that flag is not set, HCatalog should be changed to allow appends into it as well, and not simply error out if data already exists in a table.
        HCatalog currently treats all tables as "immutable" - i.e. all tables and partitions can be written to only once, and not appended. The nuances of what this means is as follows:

         * A non-partitioned table can be written to, and data in it is never updated from then on unless you drop and recreate.

         * A partitioned table may support "appending" of a sort in a manner by adding new partitions to the table, but once written, the partitions themselves cannot have any new data added to them.

        Hive, on the other hand, does allow us to "INSERT INTO" into a table, thus allowing us append semantics. There is benefit to both of these models, and so, our goal is as follows:

        a) Introduce a notion of an immutable table, wherein all tables are not immutable by default, and have this be a table property. If this property is set for a table, and we attempt to write to a table that already has data (or a partition), disallow "INSERT INTO" into it from hive. This property being set will allow hive to mimic HCatalog's current immutable-table property. (I'm going to create a separate sub-task to cover this bit, and focus on the HCatalog-side on this jira)

        b) As long as that flag is not set, HCatalog should be changed to allow appends into it as well, and not simply error out if data already exists in a table.
        Sushanth Sowmyan made changes -
        Attachment HIVE-6405.patch [ 12629811 ]
        Sushanth Sowmyan made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Sushanth Sowmyan made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Sushanth Sowmyan made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Release Note Introduces append feature for HCatalog writes.

        Previously, if an unpartitioned table had data in it, or if a partition in a partitioned table had data in it, or if the partition even existed, HCat would fail if a user attempted to write to them. Now, that behaviour is extended so that the strict behaviour exists only if the table in question has a parameter "immutable" set to "true" (see HIVE-6406).

        With this patch, we can append to existing partitions or non-partitioned tables that already have data in them, as long as the new data being written is compatible to the old data (i.e. one cannot mix fileformats when attempting an append)

        As a further note, append is currently not compatible with dynamic partitioning, and a dynamic partitioning job is still unable to append to a table, even if it is a mutable table.
        Sushanth Sowmyan made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Sushanth Sowmyan made changes -
        Attachment HIVE-6405.patch [ 12629811 ]
        Sushanth Sowmyan made changes -
        Release Note Introduces append feature for HCatalog writes.

        Previously, if an unpartitioned table had data in it, or if a partition in a partitioned table had data in it, or if the partition even existed, HCat would fail if a user attempted to write to them. Now, that behaviour is extended so that the strict behaviour exists only if the table in question has a parameter "immutable" set to "true" (see HIVE-6406).

        With this patch, we can append to existing partitions or non-partitioned tables that already have data in them, as long as the new data being written is compatible to the old data (i.e. one cannot mix fileformats when attempting an append)

        As a further note, append is currently not compatible with dynamic partitioning, and a dynamic partitioning job is still unable to append to a table, even if it is a mutable table.

          People

          • Assignee:
            Sushanth Sowmyan
            Reporter:
            Sushanth Sowmyan
          • Votes:
            1 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:

              Development