Hive
  1. Hive
  2. HIVE-5317

Implement insert, update, and delete in Hive with full ACID support

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.14.0
    • Component/s: None
    • Labels:
      None

      Description

      Many customers want to be able to insert, update and delete rows from Hive tables with full ACID support. The use cases are varied, but the form of the queries that should be supported are:

      • INSERT INTO tbl SELECT …
      • INSERT INTO tbl VALUES ...
      • UPDATE tbl SET … WHERE …
      • DELETE FROM tbl WHERE …
      • MERGE INTO tbl USING src ON … WHEN MATCHED THEN ... WHEN NOT MATCHED THEN ...
      • SET TRANSACTION LEVEL …
      • BEGIN/END TRANSACTION

      Use Cases

      • Once an hour, a set of inserts and updates (up to 500k rows) for various dimension tables (eg. customer, inventory, stores) needs to be processed. The dimension tables have primary keys and are typically bucketed and sorted on those keys.
      • Once a day a small set (up to 100k rows) of records need to be deleted for regulatory compliance.
      • Once an hour a log of transactions is exported from a RDBS and the fact tables need to be updated (up to 1m rows) to reflect the new data. The transactions are a combination of inserts, updates, and deletes. The table is partitioned and bucketed.

        Issue Links

          Activity

          Owen O'Malley created issue -
          Hide
          Carl Steinbach added a comment -

          Will these features place any limitations on which storage formats you can use? Also, I don't think it's possible to support ACID guarantees and HCatalog (i.e. file permission based authorization) simultaneously on top of the same Hive warehouse. Is there a plan in place for fixing that?

          Show
          Carl Steinbach added a comment - Will these features place any limitations on which storage formats you can use? Also, I don't think it's possible to support ACID guarantees and HCatalog (i.e. file permission based authorization) simultaneously on top of the same Hive warehouse. Is there a plan in place for fixing that?
          Hide
          Alan Gates added a comment -

          The only requirement is that the file format must be able to support a rowid. With things like text and sequence file this can be done via a byte offset.

          I'm not seeing why this falls apart in the file based authorization. Are you worried that different users will own the base and delta files? It's no different than the current case where different users may own different partitions. We will need to make sure the compactions can still happen in this case, that is that the compaction can be run as the user who owns the table, not as Hive.

          Show
          Alan Gates added a comment - The only requirement is that the file format must be able to support a rowid. With things like text and sequence file this can be done via a byte offset. I'm not seeing why this falls apart in the file based authorization. Are you worried that different users will own the base and delta files? It's no different than the current case where different users may own different partitions. We will need to make sure the compactions can still happen in this case, that is that the compaction can be run as the user who owns the table, not as Hive.
          Hide
          Owen O'Malley added a comment -

          Here are my thoughts about how it can be approached.

          Show
          Owen O'Malley added a comment - Here are my thoughts about how it can be approached.
          Owen O'Malley made changes -
          Field Original Value New Value
          Attachment InsertUpdatesinHive.pdf [ 12604051 ]
          Hide
          Brock Noland added a comment -

          Just curious, I was surprised I didn't see adding transactions to HBase + support in the hbase storage handler as a potential alternative implementation. Could you speak to why your approach is superior to that approach? Also, it'd be great if you posted design document on the design document section of the wiki: https://cwiki.apache.org/confluence/display/Hive/DesignDocs

          Show
          Brock Noland added a comment - Just curious, I was surprised I didn't see adding transactions to HBase + support in the hbase storage handler as a potential alternative implementation. Could you speak to why your approach is superior to that approach? Also, it'd be great if you posted design document on the design document section of the wiki: https://cwiki.apache.org/confluence/display/Hive/DesignDocs
          Hide
          Alan Gates added a comment -

          Brock, we did look at that. We didn't go that route for a couple of reasons:

          1. Adding transactions to HBase is a fair amount of work. See Google's Percolator paper on one approach to that.
          2. HBase can't offer the same scan speed as HDFS. Since we're choosing to focus this on updates done in the OLAP style work loads HBase isn't going to be a great storage mechanism for the data. I agree it might make sense to have transactions on HBase for a more OLTP style workload.
          Show
          Alan Gates added a comment - Brock, we did look at that. We didn't go that route for a couple of reasons: Adding transactions to HBase is a fair amount of work. See Google's Percolator paper on one approach to that. HBase can't offer the same scan speed as HDFS. Since we're choosing to focus this on updates done in the OLAP style work loads HBase isn't going to be a great storage mechanism for the data. I agree it might make sense to have transactions on HBase for a more OLTP style workload.
          Hide
          Owen O'Malley added a comment -

          Expanding on Alan's comments:

          • The HBase scan rate is much lower than HDFS, especially with short-circuit reads.
          • HBase is tuned for a write-heavy workloads.
          • HBase doesn't have a columnar format and can't support column projection.
          • HBase doesn't have predicate pushdown into the file format.
          • HBase doesn't have the equivalent of partitions or buckets.
          Show
          Owen O'Malley added a comment - Expanding on Alan's comments: The HBase scan rate is much lower than HDFS, especially with short-circuit reads. HBase is tuned for a write-heavy workloads. HBase doesn't have a columnar format and can't support column projection. HBase doesn't have predicate pushdown into the file format. HBase doesn't have the equivalent of partitions or buckets.
          Hide
          stack added a comment -

          Alan Gates

          Looks like a bunch of hbase primitives done as mapreduce jobs.

          At first blush, on 1., percolator would be a bunch of work but looks less than what is proposed here (would you need percolator given you write the transaction id into the row?). On 2., if hbase were made write ORC, couldn't you MR the files hbase writes after asking hbase to snapshot.

          Show
          stack added a comment - Alan Gates Looks like a bunch of hbase primitives done as mapreduce jobs. At first blush, on 1., percolator would be a bunch of work but looks less than what is proposed here (would you need percolator given you write the transaction id into the row?). On 2., if hbase were made write ORC, couldn't you MR the files hbase writes after asking hbase to snapshot.
          Hide
          stack added a comment -

          The HBase scan rate is much lower than HDFS, especially with short-circuit reads.

          What kinda of numbers are you talking Owen? Would be interested in knowing what they are. Implication would be also that it cannot be improved? Or scanning the files written by hbase offline from a snapshot wouldn't work from you (snapshots are cheap in hbase. Going by your use cases, you'd be doing these runs infrequently enough).

          HBase is tuned for a write-heavy workloads.

          Funny. Often we're accused of the other extreme.

          HBase doesn't have a columnar format and can't support column projection.

          It doesn't. Too much work to add a storage engine that wrote columnar?

          HBase doesn't have the equivalent of partitions or buckets.

          In hbase we call them 'Regions'.

          Show
          stack added a comment - The HBase scan rate is much lower than HDFS, especially with short-circuit reads. What kinda of numbers are you talking Owen? Would be interested in knowing what they are. Implication would be also that it cannot be improved? Or scanning the files written by hbase offline from a snapshot wouldn't work from you (snapshots are cheap in hbase. Going by your use cases, you'd be doing these runs infrequently enough). HBase is tuned for a write-heavy workloads. Funny. Often we're accused of the other extreme. HBase doesn't have a columnar format and can't support column projection. It doesn't. Too much work to add a storage engine that wrote columnar? HBase doesn't have the equivalent of partitions or buckets. In hbase we call them 'Regions'.
          Hide
          Bikas Saha added a comment -

          Some questions which I am sure have been considered but are not clear in the document.
          Should metastore heartbeat be in the job itself and not the client since the job is the source of truth and the client can disappear. What happens if the client disappears but the job completes with success and manages to promote the output files?
          Is transaction id per file or per metastore? Where does the metastore recover the last transaction id(s) from after restart?

          Show
          Bikas Saha added a comment - Some questions which I am sure have been considered but are not clear in the document. Should metastore heartbeat be in the job itself and not the client since the job is the source of truth and the client can disappear. What happens if the client disappears but the job completes with success and manages to promote the output files? Is transaction id per file or per metastore? Where does the metastore recover the last transaction id(s) from after restart?
          Hide
          Owen O'Malley added a comment -

          Bikas,
          In Hive if the client disappears, the query fails, because the final work (output promotion, display to the user) is done by the client. Also don't forget that a single query may be composed on many MR jobs, although obviously that changes on Tez.

          The transaction id is global for all of the tasks working on the same query.

          The metastore's data in stored in an underlying SQL database, so the transaction information will need to be there also.

          Show
          Owen O'Malley added a comment - Bikas, In Hive if the client disappears, the query fails, because the final work (output promotion, display to the user) is done by the client. Also don't forget that a single query may be composed on many MR jobs, although obviously that changes on Tez. The transaction id is global for all of the tasks working on the same query. The metastore's data in stored in an underlying SQL database, so the transaction information will need to be there also.
          Hide
          Eric Hanson added a comment -

          Overall this looks like a workable approach give the use cases described (mostly coarse grained updates with a low transaction rate), and it has the benefit that it doesn't take a dependency on another large piece of software like an update-aware DBMS or NoSQL store.

          Regarding use cases, it appears that this design won't be able to have fast performance for fine-grained inserts. E.g. there might be scenarios where you want to insert one row into a fact table every 10 milliseconds in a separate transaction and have the rows immediately visible to readers. Are you willing to forgo that use case? It sounds like yes. This may be reasonable. If you want to handle it then a different design for the delta insert file information is probably needed, i.e. a store that's optimized for short write transactions.

          I didn't see any obvious problem, due to the versioned scans, but is this design safe from the Halloween problem? That's the problem where an update scan sees its own updates again, causing an infinite loop or incorrect update. An argument that the design is safe from this would be good.

          You mention that you will have one type of delta file that encodes updates directly, for sorted files. Is this really necessary, or can you make updates illegal for sorted files? If updates can always be modelled as insert plus deleted, that simplifies things.

          How do you ensure that the delta files are fully written (committed) to the storage system before the metastore treats the transaction that created the delta file as committed?

          It's not completely clear why you need exactly the transaction ID information specified in the delta file names. E.g. would just the transaction ID (start timestamp) be enough? A precise specification of how they are used would be useful.

          Explicitly explaining what happens when a transaction aborts and how its delta files get ignored and then cleaned up would be useful.

          Is there any issue with correctness of task retry in the presence of updates if a task fails? It appears that it is safe due to the snapshot isolation. Explicitly addressing this in the specification would be good.

          Show
          Eric Hanson added a comment - Overall this looks like a workable approach give the use cases described (mostly coarse grained updates with a low transaction rate), and it has the benefit that it doesn't take a dependency on another large piece of software like an update-aware DBMS or NoSQL store. Regarding use cases, it appears that this design won't be able to have fast performance for fine-grained inserts. E.g. there might be scenarios where you want to insert one row into a fact table every 10 milliseconds in a separate transaction and have the rows immediately visible to readers. Are you willing to forgo that use case? It sounds like yes. This may be reasonable. If you want to handle it then a different design for the delta insert file information is probably needed, i.e. a store that's optimized for short write transactions. I didn't see any obvious problem, due to the versioned scans, but is this design safe from the Halloween problem? That's the problem where an update scan sees its own updates again, causing an infinite loop or incorrect update. An argument that the design is safe from this would be good. You mention that you will have one type of delta file that encodes updates directly, for sorted files. Is this really necessary, or can you make updates illegal for sorted files? If updates can always be modelled as insert plus deleted, that simplifies things. How do you ensure that the delta files are fully written (committed) to the storage system before the metastore treats the transaction that created the delta file as committed? It's not completely clear why you need exactly the transaction ID information specified in the delta file names. E.g. would just the transaction ID (start timestamp) be enough? A precise specification of how they are used would be useful. Explicitly explaining what happens when a transaction aborts and how its delta files get ignored and then cleaned up would be useful. Is there any issue with correctness of task retry in the presence of updates if a task fails? It appears that it is safe due to the snapshot isolation. Explicitly addressing this in the specification would be good.
          Hide
          Alan Gates added a comment -

          One thing that might help people understand the design, take a look at http://research.microsoft.com/pubs/193599/Apollo3%20-%20Sigmod%202013%20-%20final.pdf a paper that influenced our thinking and design.

          Show
          Alan Gates added a comment - One thing that might help people understand the design, take a look at http://research.microsoft.com/pubs/193599/Apollo3%20-%20Sigmod%202013%20-%20final.pdf a paper that influenced our thinking and design.
          Hide
          Alan Gates added a comment -

          Regarding use cases, it appears that this design won't be able to have fast performance for fine-grained inserts. ...

          Agreed, this will fail badly in a one insert at a time situation. That isn't what we're going after. We would like to be able to handle a batch inserts every minute, but for the moment that seems like the floor.

          I didn't see any obvious problem, due to the versioned scans, but is this design safe from the Halloween problem?

          As a rule Hive jobs always define their input up front and then scan only once. So even though an update is writing a new record, the delta file it's writing into shouldn't be defined as part of it's input. In the future when we move to having one delta file rather than one per write (more details on that to follow), this may be more of an issue, and we'll need to think about how to avoid it.

          How do you ensure that the delta files are fully written (committed) to the storage system before the metastore treats the transaction that created the delta file as committed?

          The OutputCommitter will move the new delta files from a temp directory to the directory of the base file (as is standard in Hadoop apps). Only after this will the Hive client communicate to the metastore that the transaction is committed. If there is a failure between moving the files from temp to base dir, readers will still ignore these files as they will have a transaction id that is listed as aborted.

          It's not completely clear why you need exactly the transaction ID information specified in the delta file names. E.g. would just the transaction ID (start timestamp) be enough?

          The reason for including the end id is so that readers can quickly decide whether they need to scan that file at all, and potentially prune files from their scans. Does that answer the question?

          Is there any issue with correctness of task retry in the presence of updates if a task fails?

          As in standard Hadoop practice, output from tasks will be written to a temp directory. Failed or killed tasks' output will never be promoted to the base file directory and thus will never be seen by readers.

          I'm working on updating the doc with answers to these. One of us will post the updated doc soon.

          Show
          Alan Gates added a comment - Regarding use cases, it appears that this design won't be able to have fast performance for fine-grained inserts. ... Agreed, this will fail badly in a one insert at a time situation. That isn't what we're going after. We would like to be able to handle a batch inserts every minute, but for the moment that seems like the floor. I didn't see any obvious problem, due to the versioned scans, but is this design safe from the Halloween problem? As a rule Hive jobs always define their input up front and then scan only once. So even though an update is writing a new record, the delta file it's writing into shouldn't be defined as part of it's input. In the future when we move to having one delta file rather than one per write (more details on that to follow), this may be more of an issue, and we'll need to think about how to avoid it. How do you ensure that the delta files are fully written (committed) to the storage system before the metastore treats the transaction that created the delta file as committed? The OutputCommitter will move the new delta files from a temp directory to the directory of the base file (as is standard in Hadoop apps). Only after this will the Hive client communicate to the metastore that the transaction is committed. If there is a failure between moving the files from temp to base dir, readers will still ignore these files as they will have a transaction id that is listed as aborted. It's not completely clear why you need exactly the transaction ID information specified in the delta file names. E.g. would just the transaction ID (start timestamp) be enough? The reason for including the end id is so that readers can quickly decide whether they need to scan that file at all, and potentially prune files from their scans. Does that answer the question? Is there any issue with correctness of task retry in the presence of updates if a task fails? As in standard Hadoop practice, output from tasks will be written to a temp directory. Failed or killed tasks' output will never be promoted to the base file directory and thus will never be seen by readers. I'm working on updating the doc with answers to these. One of us will post the updated doc soon.
          Hide
          Eric Hanson added a comment -

          Okay, thanks for the response!

          Show
          Eric Hanson added a comment - Okay, thanks for the response!
          Hide
          Kelly Stirman added a comment -

          I'm curious - why not use ZK to maintain transactional state.

          Hive metastore, if I'm not mistaken, is not HA by default, and it imposes the associated complexity of HA (for MySQL and PG at least) on the user.

          Show
          Kelly Stirman added a comment - I'm curious - why not use ZK to maintain transactional state. Hive metastore, if I'm not mistaken, is not HA by default, and it imposes the associated complexity of HA (for MySQL and PG at least) on the user.
          Hide
          Owen O'Malley added a comment -

          Hive already depends on the metastore being up, so it isn't adding a new SPoF. Zookeeper adds additional semantic complexity, especially for highly dynamic data.

          Show
          Owen O'Malley added a comment - Hive already depends on the metastore being up, so it isn't adding a new SPoF. Zookeeper adds additional semantic complexity, especially for highly dynamic data.
          Roshan Naik made changes -
          Link This issue supercedes HIVE-4196 [ HIVE-4196 ]
          Hide
          Edward Capriolo added a comment -

          I have two fundamental problems with this concept.

          The only requirement is that the file format must be able to support a rowid. With things like text and sequence file this can be done via a byte offset.

          This is a good reason not to do this. Things that only work for some formats create fragmentation. What about format's that do not have a row id? What if the user is already using the key for something else like data?

          Once an hour a log of transactions is exported from a RDBS and the fact tables need to be updated (up to 1m rows) to reflect the new data. The transactions are a combination of inserts, updates, and deletes. The table is partitioned and bucketed.

          What this ticket describes seems like a bad use case for hive. Why would the user not simply create a new table partitioned by hour? What is the need to transaction ally in-place update a table?

          It seems like the better solution would be for the user to log these updates themselves and then export the table with a tool like squoop periodically.

          I see this as a really complicated piece of work, for a narrow use case, and I have a very difficult time believing adding transactions to hive to support this is the right answer.

          Show
          Edward Capriolo added a comment - I have two fundamental problems with this concept. The only requirement is that the file format must be able to support a rowid. With things like text and sequence file this can be done via a byte offset. This is a good reason not to do this. Things that only work for some formats create fragmentation. What about format's that do not have a row id? What if the user is already using the key for something else like data? Once an hour a log of transactions is exported from a RDBS and the fact tables need to be updated (up to 1m rows) to reflect the new data. The transactions are a combination of inserts, updates, and deletes. The table is partitioned and bucketed. What this ticket describes seems like a bad use case for hive. Why would the user not simply create a new table partitioned by hour? What is the need to transaction ally in-place update a table? It seems like the better solution would be for the user to log these updates themselves and then export the table with a tool like squoop periodically. I see this as a really complicated piece of work, for a narrow use case, and I have a very difficult time believing adding transactions to hive to support this is the right answer.
          Hide
          Edward Capriolo added a comment -

          By the way. I do work like this very often, and having tables that update periodically cause a lot of problems. The first is when you have to re-compute a result 4 days later.

          You do not want a fresh up-to-date table, you want the table as it existed 4 days ago. When you want to troubleshoot a result you do not want your intermediate tables trampled over. When you want to rebuild a months worth of results you want to launch 31 jobs in parallel not 31 jobs in series.

          In fact in programming hive I suggest ALWAYS partitioning this dimension tables by time and NOT doing what this ticket is describing for the reasons above (and more)

          Show
          Edward Capriolo added a comment - By the way. I do work like this very often, and having tables that update periodically cause a lot of problems. The first is when you have to re-compute a result 4 days later. You do not want a fresh up-to-date table, you want the table as it existed 4 days ago. When you want to troubleshoot a result you do not want your intermediate tables trampled over. When you want to rebuild a months worth of results you want to launch 31 jobs in parallel not 31 jobs in series. In fact in programming hive I suggest ALWAYS partitioning this dimension tables by time and NOT doing what this ticket is describing for the reasons above (and more)
          Hide
          Owen O'Malley added a comment -

          Ed,
          If you don't use the insert, update, and delete commands, they won't impact your use of Hive. On the other hand, there are a wide number of users who need ACID and updates.

          Show
          Owen O'Malley added a comment - Ed, If you don't use the insert, update, and delete commands, they won't impact your use of Hive. On the other hand, there are a wide number of users who need ACID and updates.
          Hide
          Thejas M Nair added a comment -

          Ed, For the data re-processing use case, this approach is not what is recommended. This approach is meant to be used for use cases where your changes to a partition are small fraction of the existing number of rows.
          Even with this approach, it still would make sense to partition your data by time for 'fact tables'. Your dimension table has new records being added periodically, making it more like the 'fact table' use case. This approach will also work with tables partitioned by time.

          Show
          Thejas M Nair added a comment - Ed, For the data re-processing use case, this approach is not what is recommended. This approach is meant to be used for use cases where your changes to a partition are small fraction of the existing number of rows. Even with this approach, it still would make sense to partition your data by time for 'fact tables'. Your dimension table has new records being added periodically, making it more like the 'fact table' use case. This approach will also work with tables partitioned by time.
          Hide
          Edward Capriolo added a comment -

          Ed,
          If you don't use the insert, update, and delete commands, they won't impact your use of Hive. On the other hand, there are a wide number of users who need ACID and updates.

          Why don't those users just use an acid database?

          The dimension tables have primary keys and are typically bucketed and sorted on those keys.

          All the use cases defined seem to be exactly what hive is not built for.
          1) Hive does not do much/any optimization of a table when it is sorted.
          2) Hive tables do not have primary keys
          3) Hive is not made to play with tables of only a few rows

          It seems like the idea is to turn hive and hive metastore into a once shot database for processes that can easily be done differently.

          Once a day a small set (up to 100k rows) of records need to be deleted for regulatory compliance.

          1. squoop export to rdbms
          2. run query on rdbms
          3. write back to hive.

          I am not ready to vote -1, but I am struggling to understand why anyone would want to use hive to solve the use cases described. This seems like a square peg in a round hole solution. It feels like something that belongs outside of hive.

          It feels a lot like this:
          http://db.cs.yale.edu/hadoopdb/hadoopdb.html

          Show
          Edward Capriolo added a comment - Ed, If you don't use the insert, update, and delete commands, they won't impact your use of Hive. On the other hand, there are a wide number of users who need ACID and updates. Why don't those users just use an acid database? The dimension tables have primary keys and are typically bucketed and sorted on those keys. All the use cases defined seem to be exactly what hive is not built for. 1) Hive does not do much/any optimization of a table when it is sorted. 2) Hive tables do not have primary keys 3) Hive is not made to play with tables of only a few rows It seems like the idea is to turn hive and hive metastore into a once shot database for processes that can easily be done differently. Once a day a small set (up to 100k rows) of records need to be deleted for regulatory compliance. 1. squoop export to rdbms 2. run query on rdbms 3. write back to hive. I am not ready to vote -1, but I am struggling to understand why anyone would want to use hive to solve the use cases described. This seems like a square peg in a round hole solution. It feels like something that belongs outside of hive. It feels a lot like this: http://db.cs.yale.edu/hadoopdb/hadoopdb.html
          Hide
          Edward Capriolo added a comment -

          "In theory the base can be in any format, but ORC will be required for v1"

          This is exactly what I talk about when I talk about fragmentation. Hive can not be a system where features only work when using a specific input format. The feature must be applicable to more then just the single file format. Taging "other file formats" in the "LATER" bothers me. Wouldn't the community have more utility of something that worked against a TextFormat was written first, then later against other formats. I know about the "stinger initiative", developing features that only work with specific input formats does not seem like the correct course of action. It goes against our core design principals:

          https://cwiki.apache.org/confluence/display/Hive/Home

          "Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats. Please see File Format and SerDe in the Developer Guide for details."

          Show
          Edward Capriolo added a comment - "In theory the base can be in any format, but ORC will be required for v1" This is exactly what I talk about when I talk about fragmentation. Hive can not be a system where features only work when using a specific input format. The feature must be applicable to more then just the single file format. Taging "other file formats" in the "LATER" bothers me. Wouldn't the community have more utility of something that worked against a TextFormat was written first, then later against other formats. I know about the "stinger initiative", developing features that only work with specific input formats does not seem like the correct course of action. It goes against our core design principals: https://cwiki.apache.org/confluence/display/Hive/Home "Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats. Please see File Format and SerDe in the Developer Guide for details."
          Hide
          Sergey Shelukhin added a comment -

          I think "the small number of rows" meant above was for the update, not the entire partition.
          So, large dataset, small number of rows updated. Exporting entire dataset to rdbms to perform a query seems excessive in this case

          Show
          Sergey Shelukhin added a comment - I think "the small number of rows" meant above was for the update, not the entire partition. So, large dataset, small number of rows updated. Exporting entire dataset to rdbms to perform a query seems excessive in this case
          Hide
          Lefty Leverenz added a comment -

          Off topic: This ticket has 100 watchers. Is that a record?

          Show
          Lefty Leverenz added a comment - Off topic: This ticket has 100 watchers. Is that a record?
          Hide
          Alan Gates added a comment -

          MAPREDUCE-279, at 109, currently out scores us. There may be others, but it would be cool to have more watchers than Yarn.

          Show
          Alan Gates added a comment - MAPREDUCE-279 , at 109, currently out scores us. There may be others, but it would be cool to have more watchers than Yarn.
          Roshan Naik made changes -
          Link This issue incorporates HIVE-5687 [ HIVE-5687 ]
          Hide
          Vinod Kumar Vavilapalli added a comment -

          MAPREDUCE-279, at 109, currently out scores us. There may be others, but it would be cool to have more watchers than Yarn.

          Hehe, looks like we have a race. I'll go ask some of us YARN folks who are also watching this JIRA to stop watching this one

          Show
          Vinod Kumar Vavilapalli added a comment - MAPREDUCE-279 , at 109, currently out scores us. There may be others, but it would be cool to have more watchers than Yarn. Hehe, looks like we have a race. I'll go ask some of us YARN folks who are also watching this JIRA to stop watching this one
          Hide
          Pardeep Kumar added a comment -

          Vinod.. it is very much obvious.. ACID, updates and Deletes are one of the most awaited features of Hive and many people like me are waiting for the same..

          Show
          Pardeep Kumar added a comment - Vinod.. it is very much obvious.. ACID, updates and Deletes are one of the most awaited features of Hive and many people like me are waiting for the same..
          Hide
          Pardeep Kumar added a comment -

          Will these features be supported on all Hive file formats i.e. Sequencefile, Text, ORC, RC etc.

          Show
          Pardeep Kumar added a comment - Will these features be supported on all Hive file formats i.e. Sequencefile, Text, ORC, RC etc.
          Hide
          Alan Gates added a comment -

          Currently they are being supported in ORC. It is done in such a way that it could be extended to any file format that can support a row id, though there is some code to write to make it happen. It could be extended to support text or sequence file by using offset in the base file as the surrogate for rowid. I'm not sure if this would work for RC file or not.

          Show
          Alan Gates added a comment - Currently they are being supported in ORC. It is done in such a way that it could be extended to any file format that can support a row id, though there is some code to write to make it happen. It could be extended to support text or sequence file by using offset in the base file as the surrogate for rowid. I'm not sure if this would work for RC file or not.
          Lefty Leverenz made changes -
          Link This issue is related to HIVE-6905 [ HIVE-6905 ]
          Hide
          Pardeep Kumar added a comment -

          Got a quick question from our interested customer - Will Hive be supporting commit and rollback too? And if yes, how will it be done?

          Show
          Pardeep Kumar added a comment - Got a quick question from our interested customer - Will Hive be supporting commit and rollback too? And if yes, how will it be done?
          Hide
          Alan Gates added a comment -

          The intention is to support begin/commit/rollback, hopefully in the next release. See https://issues.apache.org/jira/secure/attachment/12614488/HiveTransactionManagerDetailedDesign%20%281%29.pdf for the design of the transaction manager which will be handling this. Note that with Hive 0.13 we already do transactions in the streaming ingest interface, they just aren't available through SQL yet.

          Show
          Alan Gates added a comment - The intention is to support begin/commit/rollback, hopefully in the next release. See https://issues.apache.org/jira/secure/attachment/12614488/HiveTransactionManagerDetailedDesign%20%281%29.pdf for the design of the transaction manager which will be handling this. Note that with Hive 0.13 we already do transactions in the streaming ingest interface, they just aren't available through SQL yet.
          Hide
          Satish Thumar added a comment -

          Can someone put a example to update records on Hive ORC table?

          Show
          Satish Thumar added a comment - Can someone put a example to update records on Hive ORC table?
          Hide
          Venkat Ankam added a comment -

          Eagerly waiting for this ACID support...Do you know which release and by when this will be implemented in?

          Regards,
          Venkat

          Show
          Venkat Ankam added a comment - Eagerly waiting for this ACID support...Do you know which release and by when this will be implemented in? Regards, Venkat
          Hide
          Alain Schröder added a comment -

          Thanks for your e-mail.

          Unfortunately, I’ll be away until June 25th, 2014.
          Please note that your mail will not be forwarded.

          If you have urgent requests, please contact Nils Hofmeister.

          Best regards,
          Alain Schröder

          Show
          Alain Schröder added a comment - Thanks for your e-mail. Unfortunately, I’ll be away until June 25th, 2014. Please note that your mail will not be forwarded. If you have urgent requests, please contact Nils Hofmeister. Best regards, Alain Schröder
          Hide
          Alan Gates added a comment -

          My hope is to have the INSERT, UPDATE, DELETE functionality working by the next release of Hive.

          Show
          Alan Gates added a comment - My hope is to have the INSERT, UPDATE, DELETE functionality working by the next release of Hive.
          Hide
          Venkat Ankam added a comment -

          Thanks Alan. When the next release is scheduled?

          Show
          Venkat Ankam added a comment - Thanks Alan. When the next release is scheduled?
          Hide
          Venkat Ankam added a comment -

          Any update on the next release of Hive with this feature?

          Show
          Venkat Ankam added a comment - Any update on the next release of Hive with this feature?
          Hide
          Muthupandi K added a comment -

          Waiting for this Hive release, when we can expect for this release.

          Show
          Muthupandi K added a comment - Waiting for this Hive release, when we can expect for this release.
          Hide
          Alan Gates added a comment -

          The discussion of when to branch for this has been going on on dev@hive.apache.org for a bit now, see http://mail-archives.apache.org/mod_mbox/hive-dev/201408.mbox/%3CCAKjA-pyhnHhxjaCYhWibX3o-RfQ7g2Sk9fyLYBN%3DFx6UofJ33A%40mail.gmail.com%3E

          The summary is, any day now. Once we branch it's usually ~4 weeks for stabilization and release mechanics before the release.

          Show
          Alan Gates added a comment - The discussion of when to branch for this has been going on on dev@hive.apache.org for a bit now, see http://mail-archives.apache.org/mod_mbox/hive-dev/201408.mbox/%3CCAKjA-pyhnHhxjaCYhWibX3o-RfQ7g2Sk9fyLYBN%3DFx6UofJ33A%40mail.gmail.com%3E The summary is, any day now. Once we branch it's usually ~4 weeks for stabilization and release mechanics before the release.
          Hide
          Alan Gates added a comment -

          All the sub-tasks have been completed.

          Show
          Alan Gates added a comment - All the sub-tasks have been completed.
          Alan Gates made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 0.14.0 [ 12326450 ]
          Resolution Fixed [ 1 ]
          Eugene Koifman made changes -
          Link This issue is related to HIVE-8244 [ HIVE-8244 ]
          Hide
          Dapeng Sun added a comment -

          Hi Owen & Alan
          The feature is great!
          I have a question, if the cluster enable security(Kerberos), does ZooKeeperHiveLockManager support a security Zookeeper?

          Show
          Dapeng Sun added a comment - Hi Owen & Alan The feature is great! I have a question, if the cluster enable security(Kerberos), does ZooKeeperHiveLockManager support a security Zookeeper?
          Hide
          Thejas M Nair added a comment -

          This has been fixed in 0.14 release. Please open new jira if you see any issues.

          Show
          Thejas M Nair added a comment - This has been fixed in 0.14 release. Please open new jira if you see any issues.
          Thejas M Nair made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Joyoung Zhang added a comment -

          An error occurred when I use hive release 0.14.0

          delete from pokes where foo=97;

          "FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations."

          Show
          Joyoung Zhang added a comment - An error occurred when I use hive release 0.14.0 delete from pokes where foo=97; "FAILED: SemanticException [Error 10294] : Attempt to do update or delete using transaction manager that does not support these operations."
          Hide
          Alan Gates added a comment -

          See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML and https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions for information on how to set up your cluster and tables to use update and delete.

          Show
          Alan Gates added a comment - See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML and https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions for information on how to set up your cluster and tables to use update and delete.
          Hide
          sanjiv singh added a comment -

          I am looking for named insertion in hive. Seems it is not supoorted as you can't explicitly specify column order.

          INSERT INTO tbl(colm1, colm2...) VALUES....

          is NAMED INSERTION supported in hive ? If No, then Can we know rational behind it ? and Any future plan for it?

          Thanks,
          Sanjiv

          Show
          sanjiv singh added a comment - I am looking for named insertion in hive. Seems it is not supoorted as you can't explicitly specify column order. INSERT INTO tbl(colm1, colm2...) VALUES.... is NAMED INSERTION supported in hive ? If No, then Can we know rational behind it ? and Any future plan for it? Thanks, Sanjiv
          Hide
          Ana-Maria Badulescu added a comment -

          
          Hello,
          I am currently out of the office on an extended leave. During my absence, I will have no access to e-mail.
          Please contact Fernanda Tavares, Director, Software Development, at ftavares@syncsort.com.

          Regards,

          Ana-Maria Badulescu
          Senior Manager - Software Development

          Syncsort Incorporated
          P: 201-930-8246
          E: abadulescu@syncsort.com<https://mail.syncsort.com/ecp/Organize/lrabin@syncsort.com>
          www.syncsort.com<http://www.syncsort.com/>

          Integrating BIG data… Smarter

          Show
          Ana-Maria Badulescu added a comment -  Hello, I am currently out of the office on an extended leave. During my absence, I will have no access to e-mail. Please contact Fernanda Tavares, Director, Software Development, at ftavares@syncsort.com. Regards, Ana-Maria Badulescu Senior Manager - Software Development Syncsort Incorporated P: 201-930-8246 E: abadulescu@syncsort.com< https://mail.syncsort.com/ecp/Organize/lrabin@syncsort.com > www.syncsort.com< http://www.syncsort.com/ > Integrating BIG data… Smarter
          Hide
          Alan Gates added a comment -

          is NAMED INSERTION supported in hive ? If No, then Can we know rational behind it ? and Any future plan for it?

          This is a larger issue than just ACID, as Hive doesn't support named insertion. I hope to add it soon for all insert, not just ACID inserts. I don't have a planned release for that yet.

          Show
          Alan Gates added a comment - is NAMED INSERTION supported in hive ? If No, then Can we know rational behind it ? and Any future plan for it? This is a larger issue than just ACID, as Hive doesn't support named insertion. I hope to add it soon for all insert, not just ACID inserts. I don't have a planned release for that yet.
          Hide
          Madhu added a comment -

          Did u get your issue resolved....I have modified all the config changes recommended...insert values works but update and delete error is showing the same error... Appreciate if u can help

          Show
          Madhu added a comment - Did u get your issue resolved....I have modified all the config changes recommended...insert values works but update and delete error is showing the same error... Appreciate if u can help
          Hide
          Madhu added a comment -

          I followed all the config recommendations and created table in ACID but delete and update is still throwing error 10294 in hive .14..can someone help in fixing it

          Show
          Madhu added a comment - I followed all the config recommendations and created table in ACID but delete and update is still throwing error 10294 in hive .14..can someone help in fixing it
          Hide
          Alan Gates added a comment -

          Madhu, the place to ask for help is on hive's dev list. Please post the error logs you are getting from both your client and your metastore instance. The DDL you used to create the table will also be very helpful.

          Show
          Alan Gates added a comment - Madhu , the place to ask for help is on hive's dev list. Please post the error logs you are getting from both your client and your metastore instance. The DDL you used to create the table will also be very helpful.
          Eugene Koifman made changes -
          Link This issue is related to HIVE-9675 [ HIVE-9675 ]
          Hide
          Fanhong Li added a comment -

          insert into table values() when UTF-8 character is not correct

          insert into table test_acid partition(pt='pt_2')
          values( 2, '中文_2' , 'city_2' )
          ;

          hive> select *
          > from test_acid
          > ;
          OK
          2 -�_2 city_2 pt_2
          Time taken: 0.237 seconds, Fetched: 1 row(s)
          hive>

          CREATE TABLE test_acid(id INT,
          name STRING,
          city STRING)
          PARTITIONED BY (pt STRING)
          clustered by (id) into 1 buckets
          stored as ORCFILE
          TBLPROPERTIES('transactional'='true')
          ;

          Show
          Fanhong Li added a comment - insert into table values() when UTF-8 character is not correct insert into table test_acid partition(pt='pt_2') values( 2, '中文_2' , 'city_2' ) ; hive> select * > from test_acid > ; OK 2 -�_2 city_2 pt_2 Time taken: 0.237 seconds, Fetched: 1 row(s) hive> CREATE TABLE test_acid(id INT, name STRING, city STRING) PARTITIONED BY (pt STRING) clustered by (id) into 1 buckets stored as ORCFILE TBLPROPERTIES('transactional'='true') ;

            People

            • Assignee:
              Owen O'Malley
              Reporter:
              Owen O'Malley
            • Votes:
              34 Vote for this issue
              Watchers:
              162 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development