ACID and Transactions in Hive What is ACID and why should you use it? ACID stands for Atomicity (an operation either succeeds completely or fails, it does not leave partial data), Consistency (once an application performs an operation the results of that operation are visible to it in every subsequent operation), Isolation (operations by one user do not cause unexpected side effects for other users), and Durability (once an operation is complete it will be preserved even in the face of machine or system failure). These traits have long been expected of database systems as part of their transaction functionality. Up until Hive 0.13, atomicity, consistency, and durability were provided at the partition level. Isolation could be provided by turning on one of the available locking mechanisms (ZooKeeper or in memory). With the addition of this feature in Hive 0.13 it is now possible to provide full ACID semantics at the row level, so that one application can add rows while another reads from the same partition without interfering with each other. Transactions and ACID semantics have been added to Hive to address the following use cases: * Streaming ingest of data. Many users have tools such as Apache Flume, Apache Storm, or Apache Kafka that they use to stream data into their Hadoop cluster. While these tools can write data at rates of hundreds or more rows per second, Hive can only add partitions every fifteen minutes to an hour. Adding partitions more often leads quickly to an overwhelming number of partitions in the table. These tools could stream data into existing partitions, but this would cause dirty read problems for readers (that is, they would see data written after they started their queries) and leave many small files in their directories that would put pressure on the NameNode. With this new functionality this use case will be supported while allowing readers to get a consistent view of the data and avoiding too many files. * Slow changing dimensions. In a typical star schema data warehouse, dimensions tables change slowly over time. For example, a retailer will open new stores, which need to be added to the stores table, or an existing store may change its square footage or some other tracked characteristic. These changes lead to inserts of individual records or update of records (depending on the strategy chosen). Hive is not currently able to support this. Once INSERT ... VALUES and UPDATE are supported this will be possible. * Data restatement. Sometimes collected data is found to be incorrect and needs correction. Or the first instance of the data may be an approximation (90% of servers reporting) with the full data provided later. Or business rules may require that certain transactions be restated due to subsequent transactions (e.g. after making a purchase a customer may purchase a membership and thus be entitled to discount prices, including on the previous purchase). Once INSERT ... VALUES and UPDATE are supported this will be possible. Limitations In Hive 0.13 INSERT ... VALUES, UPDATE, and DELETE are not yet supported. BEGIN, COMMIT, and ROLLBACK are also not yet supported. The plan is to support these in the next release. Only ORC file format is supported in this first release. The feature has been built such that any storage format that can determine how updates or deletes apply to base records (basically, that has an explicit or implicit row id) can be used with ACID, but so far the integration work has only been done for ORC. Use of the streaming ingest interface (see below) does not yet work with the existing INSERT INTO capability of Hive. Once a table uses the streaming ingest interface any data added to it via INSERT INTO will be lost. By default transactions are configured to be off in Hive 0.13. See the Configuration section below for a discussion of which values need to set to configure it. Tables must be bucketed to make use of these features. Tables in the same system not using transactions and ACID do not need to be bucketed. At this time only snapshot level isolation is supported. That is when a given query starts, it will be provided with a consistent snapshot of the data. There is no support for dirty read, read committed, repeatable read, or serializable. Snapshot consistency is similar to read committed. With the introduction of BEGIN the intention is to support repreatable read and snapshot isolation. The existing ZooKeeper and in memory lock managers are not compatible with transactions. There is no intention to address this issue. See Basic Design below for a discussion of how locks are stored for transactions. Streaming Ingest Interface TODO Grammar Changes Several new commands have been added to Hive's DDL in support of ACID and transactions, plus some existing DDL has been modified. A new command SHOW TRANSACTIONS has been added. This will show a list of all currently open and aborted transactions in the system. It is intended for use by administrators. It will print out: * transaction id * transaction state, * user who started the transaction * machine where it was started A new command SHOW COMPACTIONS has been added. This will provide a list of all tables and partitions currently being compacted or scheduled for compaction. For details on compaction see Basic Design below. This will print: * database name * table name * partition name (if this is a partitioned table) * whether it is a major or minor compaction * the state the compaction is in, which can be: ** 'initiated', waiting in the queue to be done ** 'working', being compacted ** 'ready for cleaning', the compaction has been done and the old files are scheduled to be cleaned * thread id of the worker thread doing the compaction (only if in working state) * the time at which the compaction started (only if in working or ready for cleaning state) The SHOW LOCKS command has been altered to provide information about the new locks associated with transactions. If you are using the ZooKeeper or in memory lock managers you will notice no difference in the output of this command. If you are using transactions then SHOW LOCKS will return: * database name * table name * partition (if this is a partitioned table) * the state the lock is in, can be: ** 'acquired', the requestor holds the lock ** 'waiting', the requestor is waiting for the lock ** 'aborted', the lock has timed out but not yet been cleaned up * the type of lock, can be: ** 'exclusive', no one else can hold the lock at the same time (obtained mostly by DDL operations such as drop table) ** 'shared_read', any number of other shared_read locks can lock the same resource at the same time (obtained by reads, confusingly, an insert operation also obtains a shared_read lock) ** 'shared_write', any number of shared_read locks can lock the same resource at the same time, but no other shared_write locks are allowed (obtained by update and delete) * id of the transaction this lock is associated with, if there is one * last time the holder of this lock sent a heartbeat indicating it was still alive * the time the lock was acquired at, if it has been acquired * hive user who requested the lock * host the user is running on A new option has been added to ALTER TABLE to request a compaction of a table or partition. In general users do not need to request compactions, as the system will detect the need for them and initiate the compaction. However, if compaction is turned off for a table or a user wants to compact the table at a time the system would not choose to, ALTER TABLE can be used to initiate the compaction. The syntax is: ALTER TABLE tablename [PARTITION (partition_key = 'partition_value' [, ...])] COMPACT 'compaction_type' where compaction_type can be MAJOR or MINOR. This will enqueue a request for compaction and return. To watch the progress of the compaction the user can use SHOW COMPACTIONS. Basic Design HDFS does not support in place changes to files. It also does not offer read consistency in the face of writers appending to files being read by a user. In order to provide these features on top of HDFS we have used the standard approach used in other data warehousing tools. Data for the table or partition is stored in a set of base files. New records, updates, and deletes are stored in delta files. A new set of delta files is created for each insert, update, or delete operation. At read time the reader merges the base and delta files, applying any updates and deletes as it reads. Occasionally these changes need to be merged into the base files. To do this a set of threads have been added to the Hive metastore. They determine when this compaction needs to be done, execute the compaction, and then clean up afterwards. There are two types of compactions, minor and major. Minor compaciton takes a set of existing delta files and rewrites them to a single delta file per bucket. Major compaction takes one or more delta files and the base file for the bucket and rewrites them into a new base file per bucket. All compactions are done in the background and do not prevent concurrent reads and writes of the data. After a compaction the system waits until all readers of the old files have finished and then removes the old files. Previously all files for a partition (or a table if the table is not partitioned) lived in a single directory. With these changes, any partitions (or tables) written with an ACID aware writer will have a directory for the base files and a directory for each set of delta files. A new lock manager has also been added to Hive, the DbLockManager. This lock manager stores all lock information in the metastore. In addition all transactions are stored in the metastore. This means that transactions and locks are durable even in the face of server failure. To avoid clients dying and leaving transaction or locks dangling a heartbeat is sent from lock holders and transaction initiators to the metastore on a regular basis. If a heartbeat is not received in the configured amount of time, the lock or transaction will be aborted. Configuration A number of new configuration values have been added to the system to support transactions. Configuration key Default Value to turn on transactions and ACID Notes hive.txn.manager org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager org.apache.hadoop.hive.ql.lockmgr.DbTxnManager DummyTxnManager replicates pre Hive-0.13 behavior and provides no transactions hive.txn.driver Class name of the JDBC driver for the metastore database hive.txn.connection.string Should be the same as javax.jdo.option.ConnectionURL, with username and password appended hive.txn.timeout 300 time after which transactions are declared aborted if the client has not sent a heartbeat, in seconds. hive.txn.max.open.batch 1000 Maximum number of transactions that can be fetched in one call to open_txns(). Increasing this will decrease the number of delta files created when streaming data into Hive. But it will also increase the number of open transactions at any given time, possibly impacting read performance. hive.compactor.initiator.on false true (for exactly one instance of the Thrift metastore Whether to run the initiator and cleaner threads on this metastore instance service) hive.compactor.worker.threads 0 > 0 on at leat one instance of the Thrift metastore service How many worker threads to run on this metastore instance. Worker threads spawn MapReduce jobs to do the compaction. hive.compactor.worker.timeout 86400 Time in seconds after which a compaction job will be declared failed and the compaction requeued hive.compactor.check.interval 300 Time in seconds between checks to see if any partitions need compacted. This should be kept high because each check for compaction requires many calls against the NameNode. hive.compactor.delta.num.threshold 10 Number of delta directories in a partition that will trigger a minor compaction hive.compactor.delta.pct.threshold 0.1 Percentage size (in bytes) of the deltas relative to the base that will trigger a major compaction. 1 = 100% hive.compactor.abortedtxn.threshold 1000 Number of aborted transactions on a given partition that will trigger a major compaction Table Properties If a table owner does not wish the system to automatically determine when to compact, then the table property NO_AUTO_COMPACTION can be set. This will prevent all automatic compactions.